Francois Dion (fdion@dionresearch.com) is the founder and Chief Data Scientist at Dion Research, creator of visu.ai, SeekEx & SeekErr, open source software such as Stemgraphic
, Signethic
& Hotelling
and contributor to PandasGui
, Plotly
, Yellowbrick
& more.
Hotelling documentation: https://dionresearch.github.io/hotelling/
Hotelling GitHub repository: https://github.com/dionresearch/hotelling
Hotelling Pypi link: https://pypi.org/project/hotelling/
"Who am I to argue with statistics?", Hetty, NCIS: Los Angeles, "Ennemy within"
anomaly
detection and data science platformseeking exotics
on univariate, machine learning
for multivariateIID
and normal
(or transformed), SPC
(Statistical Process Control) applies herePython
and with open source softwareHotelling
stats and charts for the platformHotelling
stats as open source in 2019Python
in a notebook (Jupyter
) environment is inherently interactiveHotelling
assumed local data, in memory and 1 CPU core (nothing parallel)data pipeline
.[1] https://blog.dionresearch.com/2019/11/data-infrastructures-for-rest-of-us.html
File on disk: 1GB rule of thumb gives intervals for each aspects of a pipeline. Example:
Memory usage = 1GB x 6 x 2 x 1.3 x 32 = ~500GB
Where to start?
Stemgraphic
already (http://stemgraphic.org/), should provide guidanceMany choices, select one!
Chose dask
and distributed
, allowing some compatibility with standard python dataframes (pandas
) and scientific module (numpy
)
pandas
dataframes or numpy
arraysTypical imports in a python script or notebook:
import pandas as pd
from hotelling.plots import control_chart, control_stats
from hotelling.helpers import load_df
from hotelling.stats import hotelling_t2
data from Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232
y = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
"impurities": [14.92,16.90,17.38,16.90,16.92,16.71,17.07,16.93,16.71,16.88,16.73,17.07,17.60,16.90,],
"temp": [85.77,83.77,84.46,86.27,85.23,83.81,86.08,85.85,85.73,86.27,83.46,85.81,85.92,84.23,],
"concentration": [42.26,43.44,42.74,43.60,43.18,43.72,43.33,43.41,43.28,42.59,44.00,42.78,43.11,43.48,],
}
)
y.set_index("id", inplace=True)
help(hotelling_t2)
Help on function hotelling_t2 in module hotelling.stats: hotelling_t2(x, y=None, bessel=True, S=None) hotelling_t2. Compute the Hotelling (T2) test statistic. It is the multivariate extension of the Student's t-test. Test the null hypothesis that two multivariate samples have the same underlying probability distribution, when specifying samples for x and y. The number of samples do not have to be the same, but the number of features does have to be equal. Equation: Hotelling's t-squared statistic is defined as: .. math:: T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu}) Where S is the pooled covariance matrix and ᵀ represents the transpose. The two sample t-squared statistic is defined as: .. math:: T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y}) References: - Hotelling, Harold. (1931). The Generalization of Student's Ratio. Ann. Math. Statist. 2, no. 3, 360--378. doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979 - Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples et l'Analyse Factorielle. 107-119. In: L'Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris. - Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY :param x: array-like, samples of observations for one or two sample test (required) :param y: for two sample test, array-like, samples of observations (optional), for one sample, list of means to test :param bessel: bool, apply bessel correction (default) :return: statistic: float, the t2 statistic f_value: float, the f value p_value: float, the p value s: 2d array, the pooled variance
hotelling_t2(y[:7],y[7:])
(1.1274962421214139, 0.3131934005892816, 0.8155799493855016, array([[ 0.3701119 , -0.04580476, 0.10414762], [-0.04580476, 1.10192857, -0.26898571], [ 0.10414762, -0.26898571, 0.2428881 ]]))
(static, png, svg, pdf etc)
control_chart(y, alpha=0.01, legend_right=True, interactive=False);
(dynamic, interctive)
control_chart(y, alpha=0.01, legend_right=True, interactive=True);
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning: I found a path object that I don't think is part of a bar chart. Ignoring.
help(load_df)
Help on function load_df in module hotelling.helpers: load_df(filepath, server=None, dask=None, **kwargs) load_df. :param str filepath: :param str server: head node for distributed cluster, ip address and port or hostname and port (localhost for local) :param bool dask: if True, forces the use of dask,, even on smaller datasets :param kwargs: to pass arguments to pandas `read_csv` :return: dataframe
x = load_df(
'data/historical_2006*.txt',
dask=True,
delimiter='|',
header=None,
dtype={
11: 'float64',
22: 'float64',
6: 'float64',
8: 'float64'},
usecols=[4,6,8,11,12,22],
)
%%timeit
hotelling_t2(x)
3.89 s ± 73.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
control_stats(x)
2.28 s ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
control_chart(x, alpha=0.01, legend_right=True, interactive=True, template="ggplot2+presentation");
/home/fdion/anaconda3/envs/hotelling/lib/python3.8/site-packages/plotly/matplotlylib/renderer.py:612: UserWarning: I found a path object that I don't think is part of a bar chart. Ignoring.
What's next?
Hotelling works at scale now, but more has to be done on the road to v1.0, as far as scaling goes:
plotly
warningsOther enhancements will be brought in from visu.ai
More descriptive errors when data is not in a usable format