Francois Dion (fdion@dionresearch.com) is the founder and Chief Data Scientist at Dion Research, creator of visu.ai, SeekEx & SeekErr, open source software such as Stemgraphic, Signethic & Hotelling and contributor to PandasGui, Plotly, Yellowbrick & more.

Hotelling documentation: https://dionresearch.github.io/hotelling/
Hotelling GitHub repository: https://github.com/dionresearch/hotelling
Hotelling Pypi link: https://pypi.org/project/hotelling/
"Who am I to argue with statistics?", Hetty, NCIS: Los Angeles, "Ennemy within"
anomaly detection and data science platformseeking exotics on univariate, machine learning for multivariateIID and normal (or transformed), SPC (Statistical Process Control) applies herePython and with open source softwareHotelling stats and charts for the platformHotelling stats as open source in 2019Python in a notebook (Jupyter) environment is inherently interactiveHotelling assumed local data, in memory and 1 CPU core (nothing parallel)data pipeline.[1] https://blog.dionresearch.com/2019/11/data-infrastructures-for-rest-of-us.html
File on disk: 1GB rule of thumb gives intervals for each aspects of a pipeline. Example:
Memory usage = 1GB x 6 x 2 x 1.3 x 32 = ~500GB
Where to start?
Stemgraphic already (http://stemgraphic.org/), should provide guidanceMany choices, select one!
Chose dask and distributed, allowing some compatibility with standard python dataframes (pandas) and scientific module (numpy)
pandas dataframes or numpy arraysTypical imports in a python script or notebook:
import pandas as pd
from hotelling.plots import control_chart, control_stats
from hotelling.helpers import load_df
from hotelling.stats import hotelling_t2
data from Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232
y = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
"impurities": [14.92,16.90,17.38,16.90,16.92,16.71,17.07,16.93,16.71,16.88,16.73,17.07,17.60,16.90,],
"temp": [85.77,83.77,84.46,86.27,85.23,83.81,86.08,85.85,85.73,86.27,83.46,85.81,85.92,84.23,],
"concentration": [42.26,43.44,42.74,43.60,43.18,43.72,43.33,43.41,43.28,42.59,44.00,42.78,43.11,43.48,],
}
)
y.set_index("id", inplace=True)
help(hotelling_t2)
Help on function hotelling_t2 in module hotelling.stats:
hotelling_t2(x, y=None, bessel=True, S=None)
hotelling_t2.
Compute the Hotelling (T2) test statistic.
It is the multivariate extension of the Student's t-test.
Test the null hypothesis that two multivariate samples have the same underlying
probability distribution, when specifying samples for x and y. The number of samples do not have
to be the same, but the number of features does have to be equal.
Equation:
Hotelling's t-squared statistic is defined as:
.. math::
T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})
Where S is the pooled covariance matrix and ᵀ represents the transpose.
The two sample t-squared statistic is defined as:
.. math::
T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})
References:
- Hotelling, Harold. (1931). The Generalization of Student's Ratio. Ann. Math. Statist. 2, no. 3, 360--378.
doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
- Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples
et l'Analyse Factorielle. 107-119.
In: L'Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
- Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio.
In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics.
Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
:param x: array-like, samples of observations for one or two sample test (required)
:param y: for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
:param bessel: bool, apply bessel correction (default)
:return:
statistic: float,
the t2 statistic
f_value: float,
the f value
p_value: float,
the p value
s: 2d array,
the pooled variance
hotelling_t2(y[:7],y[7:])
(1.1274962421214139,
0.3131934005892816,
0.8155799493855016,
array([[ 0.3701119 , -0.04580476, 0.10414762],
[-0.04580476, 1.10192857, -0.26898571],
[ 0.10414762, -0.26898571, 0.2428881 ]]))
(static, png, svg, pdf etc)
control_chart(y, alpha=0.01, legend_right=True, interactive=False);
(dynamic, interctive)
control_chart(y, alpha=0.01, legend_right=True, interactive=True);
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning: I found a path object that I don't think is part of a bar chart. Ignoring.
help(load_df)
Help on function load_df in module hotelling.helpers:
load_df(filepath, server=None, dask=None, **kwargs)
load_df.
:param str filepath:
:param str server: head node for distributed cluster, ip address and port or hostname and port (localhost for local)
:param bool dask: if True, forces the use of dask,, even on smaller datasets
:param kwargs: to pass arguments to pandas `read_csv`
:return: dataframe
x = load_df(
'data/historical_2006*.txt',
dask=True,
delimiter='|',
header=None,
dtype={
11: 'float64',
22: 'float64',
6: 'float64',
8: 'float64'},
usecols=[4,6,8,11,12,22],
)
%%timeit
hotelling_t2(x)
3.89 s ± 73.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
control_stats(x)
2.28 s ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
control_chart(x, alpha=0.01, legend_right=True, interactive=True, template="ggplot2+presentation");
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning: I found a path object that I don't think is part of a bar chart. Ignoring.
What's next?
Hotelling works at scale now, but more has to be done on the road to v1.0, as far as scaling goes:
plotlywarningsOther enhancements will be brought in from visu.ai
More descriptive errors when data is not in a usable format