The road to scaling Hotelling multivariate control charts¶

Presented at QPRC 2021, Wednesday, July 28¶

Francois Dion, Dion Research¶

Francois Dion (fdion@dionresearch.com) is the founder and Chief Data Scientist at Dion Research, creator of visu.ai, SeekEx & SeekErr, open source software such as Stemgraphic, Signethic & Hotelling and contributor to PandasGui, Plotly, Yellowbrick & more.

Hotelling documentation: https://dionresearch.github.io/hotelling/

Hotelling GitHub repository: https://github.com/dionresearch/hotelling

Hotelling Pypi link: https://pypi.org/project/hotelling/

"Who am I to argue with statistics?", Hetty, NCIS: Los Angeles, "Ennemy within"

How (it came to be)¶

  • Developed an end-to-end data quality, anomaly detection and data science platform
  • Initially, seeking exotics on univariate, machine learning for multivariate
  • Some data is IID and normal (or transformed), SPC (Statistical Process Control) applies here
  • A lot of data science is done in Python and with open source software
  • No Python module available in 2015, wrote Hotelling stats and charts for the platform
  • Released the Hotelling stats as open source in 2019

Why (Problem statement)¶

  • Python in a notebook (Jupyter) environment is inherently interactive
  • Should be able to zoom in and interact quickly with charts
  • A lot of data science is handling "big data" [1]
  • The typical way to handle this is to delay and distribute computation
  • Hotelling assumed local data, in memory and 1 CPU core (nothing parallel)
  • Many request to allow it to scale and work with data pipeline.

[1] https://blog.dionresearch.com/2019/11/data-infrastructures-for-rest-of-us.html

File on disk: 1GB rule of thumb gives intervals for each aspects of a pipeline. Example:

Memory usage = 1GB x 6 x 2 x 1.3 x 32 = ~500GB

The road to scaling Hotelling multivariate control charts¶

Where to start?

  • Make sure it works as intended first
  • Add chart interactivity
  • automated testing
  • Had done this for Stemgraphic already (http://stemgraphic.org/), should provide guidance
  • Investigate and choose parallel/distributed frameworks

The road to scaling Hotelling, part deux¶

Many choices, select one!

Chose dask and distributed, allowing some compatibility with standard python dataframes (pandas) and scientific module (numpy)

  • Identify what we want to delay / combine / compute:
    • shape of data
    • mean
    • difference (of means)
    • covariance
  • Identify what we want to also delay, for charts
    • sample
  • implement first shot, keep compatible with standard pandas dataframes or numpy arrays

Demo¶

Typical imports in a python script or notebook:

In [1]:
import pandas as pd
from hotelling.plots import control_chart, control_stats
from hotelling.helpers import load_df
from hotelling.stats import hotelling_t2

Data¶

data from Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232

In [2]:
y = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
        "impurities": [14.92,16.90,17.38,16.90,16.92,16.71,17.07,16.93,16.71,16.88,16.73,17.07,17.60,16.90,],
        "temp": [85.77,83.77,84.46,86.27,85.23,83.81,86.08,85.85,85.73,86.27,83.46,85.81,85.92,84.23,],
        "concentration": [42.26,43.44,42.74,43.60,43.18,43.72,43.33,43.41,43.28,42.59,44.00,42.78,43.11,43.48,],
    }
)

y.set_index("id", inplace=True)
In [3]:
help(hotelling_t2)
Help on function hotelling_t2 in module hotelling.stats:

hotelling_t2(x, y=None, bessel=True, S=None)
    hotelling_t2.
    
    Compute the Hotelling (T2) test statistic.
    
    It is the multivariate extension of the Student's t-test.
    Test the null hypothesis that two multivariate samples have the same underlying
    probability distribution, when specifying samples for x and y. The number of samples do not have
    to be the same, but the number of features does have to be equal.
    
    Equation:
    
    Hotelling's t-squared statistic is defined as:
    
    .. math::
        T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})
    
    Where S is the pooled covariance matrix and ᵀ represents the transpose.
    
    The two sample t-squared statistic is defined as:
    
    .. math::
        T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})
    
    References:
        - Hotelling, Harold. (1931). The Generalization of Student's Ratio. Ann. Math. Statist. 2, no. 3, 360--378.
          doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
    
        - Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples
          et l'Analyse Factorielle. 107-119.
          In: L'Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
    
        - Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio.
          In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics.
          Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
    
    :param x: array-like, samples of observations for one or two sample test (required)
    :param y: for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
    :param bessel: bool, apply bessel correction (default)
    :return:
        statistic: float,
            the t2 statistic
        f_value: float,
            the f value
        p_value: float,
            the p value
        s: 2d array,
            the pooled variance

Statistics¶

In [4]:
hotelling_t2(y[:7],y[7:])
Out[4]:
(1.1274962421214139,
 0.3131934005892816,
 0.8155799493855016,
 array([[ 0.3701119 , -0.04580476,  0.10414762],
        [-0.04580476,  1.10192857, -0.26898571],
        [ 0.10414762, -0.26898571,  0.2428881 ]]))

Charts¶

(static, png, svg, pdf etc)

In [5]:
control_chart(y, alpha=0.01, legend_right=True, interactive=False);

Charts¶

(dynamic, interctive)

In [6]:
control_chart(y, alpha=0.01, legend_right=True, interactive=True);
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning:

I found a path object that I don't think is part of a bar chart. Ignoring.

In [7]:
help(load_df)
Help on function load_df in module hotelling.helpers:

load_df(filepath, server=None, dask=None, **kwargs)
    load_df.
    
    :param str filepath:
    :param str server: head node for distributed cluster, ip address and port or hostname and port (localhost for local)
    :param bool dask: if True, forces the use of dask,, even on smaller datasets
    :param kwargs: to pass arguments to pandas `read_csv`
    
    :return: dataframe

In [8]:
x = load_df(
    'data/historical_2006*.txt',
    dask=True,
    delimiter='|',
    header=None,
    dtype={
           11: 'float64',
           22: 'float64',
           6: 'float64',
           8: 'float64'},
    usecols=[4,6,8,11,12,22],
)
In [9]:
%%timeit
hotelling_t2(x)
3.89 s ± 73.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]:
%%timeit
control_stats(x)
2.28 s ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]:
control_chart(x, alpha=0.01, legend_right=True, interactive=True, template="ggplot2+presentation");
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning:

I found a path object that I don't think is part of a bar chart. Ignoring.

The road AHEAD to scaling Hotelling¶

What's next?

Hotelling works at scale now, but more has to be done on the road to v1.0, as far as scaling goes:

  • better selection of points when limiting the display
  • issue another PR to resolve latest plotlywarnings
  • combine several delayed functions to optimize the DAG
  • Always improving performance

Other enhancements will be brought in from visu.ai

More descriptive errors when data is not in a usable format

Questions?¶

The road to scaling Hotelling multivariate control charts¶

Francois Dion ( fdion@dionresearch.com ), Dion Research¶

https://github.com/dionresearch/hotelling¶

Thank you!¶