hotelling
¶
Module implementing Hotelling one and two sample tests
Top-level package for Hotelling T2.
Top-level package for Hotelling T2.
hotelling.stats
¶
Stats.py.
Hotelling’s T-Squared multivariate test for one sample or two independent samples
- See:
- Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2,
- no. 3, 360–378. doi:10.1214/aoms/1177732979.
https://projecteuclid.org/euclid.aoms/1177732979
-
hotelling.stats.
bessel_correction
(x, y=None)[source]¶ bessel_correction.
Sampling tends to underestimate variability of a population. This is due to the fact that we are more likely to sample around the mean than near the extremities. Bessel’s correction uses n−1 instead of n which is used to calculate variance etc, in order to correct for the bias in the estimation of the population variance.
Parameters: - x – array-like, samples of observations
- y – array-like, samples of observations, optional
Returns: returns x_n - 1, y_n - 1
-
hotelling.stats.
hotelling_dict
(x, y=None, bessel=True)[source]¶ hotelling_dict.
returns the same values as hotelling_t2, but in a dictionary - for API etc
Parameters: - x – array-like, samples of observations for one or two sample test (required)
- y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
Returns: dict
-
hotelling.stats.
hotelling_t2
(x, y=None, bessel=True, S=None)[source]¶ hotelling_t2.
Compute the Hotelling (T2) test statistic.
It is the multivariate extension of the Student’s t-test. Test the null hypothesis that two multivariate samples have the same underlying probability distribution, when specifying samples for x and y. The number of samples do not have to be the same, but the number of features does have to be equal.
Equation:
Hotelling’s t-squared statistic is defined as:
\[\begin{split}T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})\end{split}\]Where S is the pooled covariance matrix and ᵀ represents the transpose.
The two sample t-squared statistic is defined as:
\[\begin{split}T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})\end{split}\]- References:
- Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
- Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples et l’Analyse Factorielle. 107-119. In: L’Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
- Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
Parameters: - x – array-like, samples of observations for one or two sample test (required)
- y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
- bessel – bool, apply bessel correction (default)
Returns: - statistic: float,
the t2 statistic
- f_value: float,
the f value
- p_value: float,
the p value
- s: 2d array,
the pooled variance
-
hotelling.stats.
inverse_covariance_matrix
(x, y, bessel=True)[source]¶ inverse_covariance_matrix.
Parameters: - x – array-like, samples of observations
- y – array-like, samples of observations
- bessel – bool, apply bessel correction (default)
Returns: float, the pooled variance inverse, the pooled variance
-
hotelling.stats.
pooled_covariance_matrix
(x, y, bessel=True)[source]¶ pooled_covariance.
Compute the pooled covariance matrix
Equation:
The pooled covariance matrix is defined as:
\[\begin{split}S = \\frac{n_xS_x + n_yS_y}{n_x+n_y}\end{split}\]And with bessel correction as:
\[\begin{split}S = \\frac{(n_x-1)S_x + (n_y-1)S_y}{n_x+n_y-2}\end{split}\]see: https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Pooled_covariance_matrix
Parameters: - x – array-like, samples of observations
- y – array-like, samples of observations
- bessel – bool, apply bessel correction (default)
Returns: float, the pooled variance
hotelling.plots
¶
plot.py.
Hotelling’s T-Squared multivariate control charts
See:
- Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979.
- Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics. Stanford Univ. Press. 448-85
- Gnanadesikan, R. and J.R. Kettenring (1972). Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics 28, 81-124
-
hotelling.plots.
control_chart
(x, phase=1, alpha=0.001, x_bar=None, s=None, legend_right=False, interactive=False, width=10, cusum=False, template='none', marker='o', ooc_marker='x', random_state=42, limit=1000, no_display=False)[source]¶ control_chart.
Hotelling Control Chart based on Q / T^2.
See also control_interval for more detail
Parameters: - x – pandas dataframe, uni or multivariate
- phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control
- alpha – significance level - used to calculate control lines at α/2 and 1-α/2
- x_bar – sample mean (optional, required with s)
- s – sample covariance (optional, required with x_bar)
- legend_right – default to ‘left’, can specify ‘right’
- interactive – if True and plotly is available, renders as interactive plot in notebook. False, render image.
- width – how many units wide. defaults to 10, good for notebooks
- cusum – use cumulative sum instead of average
- template – plotly template, defaults to ‘none’, matching default matplotlib
- marker – default marker symbol - one valid for matplotlib
- ooc_marker – out of control marker symbol (x) - one valid for matplotlib
- random_state – seed for sample (n > limit)
- limit – max number of points to plot, defaults to 1000
Returns: matplotlib ax / plotly fig
-
hotelling.plots.
control_interval
(m, n, f, phase=1, alpha=0.001)[source]¶ control_interval.
For Hotelling control charts, phase 1 is using Qi. This follows a beta distribution, not an F distribution. For phase 2 uses future observations. These would follow a known distribution ~ F (Seber, 1984). The lower and upper lines are based on the quantiles of the distribution (aka percent point function) for α and 1 - α, while the center line is the median (50%).
- See:
- Seber, G (1984). Multivariate Observations. John Wiley & Sons.
- Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232
Parameters: - m – sample groups (between 1 and n)
- n – number of samples
- f – number of features in the multivariate samples
- phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control
- alpha – significance level - used to calculate control lines at α/2 and 1-α/2
Returns:
-
hotelling.plots.
control_stats
(x)[source]¶ control_stats.
Compute the sample mean vector and the covariance matrix
Parameters: x – pandas dataframe, uni or multivariate Returns: sample mean, sample covariance
-
hotelling.plots.
limit_display
(x, limit, random_state)[source]¶ limit_displau.
Convenient way to get around the issue of very large datasets. We can’t show everything, so we display a subset. The tests and stats like T2, F and P values are not affected, because we calculate them on all the data.
Parameters: - x – dask or pandas dataframe, uni or multivariate
- random_state – seed for sample (n > limit)
- limit – max number of points to plot, defaults to 1000
Returns: returns original number of rows and limited dataframe
-
hotelling.plots.
univariate_control_chart
(x, var=None, sigma=3, legend_right=False, interactive=False, connected=True, width=10, cusum=False, cusum_only=False, template='none', marker='o', ooc_marker='x', limit=1000, random_state=42, no_display=False)[source]¶ univariate_control_chart.
Parameters: - x – dask or pandas dataframe, uni or multivariate
- var – optional, variable to plot (default to all)
- sigma – default to 3 sigma from mean for upper and lower control lines
- legend_right – default to ‘left’, can specify ‘right’
- interactive – if plotly is available, renders as interactive plot in notebook. False to render image.
- connected – defaults to True. Appropriate when time related /consecutive batches, else, should be False
- width – how many units wide. defaults to 10, good for notebooks
- cusum – use cumulative sum instead of average
- cusum_only – don’t display values, just cusum referenced to 0
- template – plotly template, defaults to ‘none’, matching default matplotlib
- marker – default marker symbol (o) - one valid for matplotlib
- ooc_marker – out of control marker symbol (x) - one valid for matplotlib
- random_state – seed for sample (n > limit)
- limit – max number of points to plot, defaults to 1000
Returns: returns matplotlib figure or array of plotly figures
hotelling.cli
¶
Console script for hotelling.
hotelling.helpers
¶
helpers.py.
-
hotelling.helpers.
load_df
(filepath, server=None, dask=None, **kwargs)[source]¶ load_df.
Parameters: - filepath (str) –
- server (str) – head node for distributed cluster, ip address and port or hostname and port (localhost for local)
- dask (bool) – if True, forces the use of dask,, even on smaller datasets
- kwargs – to pass arguments to pandas read_csv
Returns: dataframe
-
hotelling.helpers.
savefig
(plt)[source]¶ savefig.
Allows displaying a matplotlib figure to the console terminal. This requires pysixel to be pip installed. It also requires a terminal with Sixel graphic support, like DEC with graphic support, Linux xterm (started with -ti 340), MLTerm (multilingual terminal, available on Windows, Linux etc).
This is called by the command line tool when using –output stdout and can also be used in an ipython session.
Parameters: plt – matplotlib pyplot Returns: