`hotelling`¶

Module implementing Hotelling one and two sample tests

Top-level package for Hotelling T2.

Top-level package for Hotelling T2.

`hotelling.stats`¶

Stats.py.

Hotelling’s T-Squared multivariate test for one sample or two independent samples

See:

Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2,: no. 3, 360–378. doi:10.1214/aoms/1177732979.

https://projecteuclid.org/euclid.aoms/1177732979

hotelling.stats.bessel_correction(x, y=None)[source]¶

bessel_correction.

Sampling tends to underestimate variability of a population. This is due to the fact that we are more likely to sample around the mean than near the extremities. Bessel’s correction uses n−1 instead of n which is used to calculate variance etc, in order to correct for the bias in the estimation of the population variance.

Parameters:	x – array-like, samples of observations y – array-like, samples of observations, optional
Returns:	returns x_n - 1, y_n - 1

hotelling.stats.hotelling_dict(x, y=None, bessel=True)[source]¶

hotelling_dict.

returns the same values as hotelling_t2, but in a dictionary - for API etc

Parameters:	x – array-like, samples of observations for one or two sample test (required) y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
Returns:	dict

hotelling.stats.hotelling_t2(x, y=None, bessel=True, S=None)[source]¶

hotelling_t2.

Compute the Hotelling (T2) test statistic.

It is the multivariate extension of the Student’s t-test. Test the null hypothesis that two multivariate samples have the same underlying probability distribution, when specifying samples for x and y. The number of samples do not have to be the same, but the number of features does have to be equal.

Equation:

Hotelling’s t-squared statistic is defined as:

\[\begin{split}T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})\end{split}\]

Where S is the pooled covariance matrix and ᵀ represents the transpose.

The two sample t-squared statistic is defined as:

\[\begin{split}T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})\end{split}\]

References:

Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples et l’Analyse Factorielle. 107-119. In: L’Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY

Parameters:

x – array-like, samples of observations for one or two sample test (required)
y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
bessel – bool, apply bessel correction (default)

Returns:

statistic: float,: the t2 statistic
f_value: float,: the f value
p_value: float,: the p value
s: 2d array,: the pooled variance

hotelling.stats.inverse_covariance_matrix(x, y, bessel=True)[source]¶

inverse_covariance_matrix.

Parameters:	x – array-like, samples of observations y – array-like, samples of observations bessel – bool, apply bessel correction (default)
Returns:	float, the pooled variance inverse, the pooled variance

hotelling.stats.pooled_covariance_matrix(x, y, bessel=True)[source]¶

pooled_covariance.

Compute the pooled covariance matrix

Equation:

The pooled covariance matrix is defined as:

\[\begin{split}S = \\frac{n_xS_x + n_yS_y}{n_x+n_y}\end{split}\]

And with bessel correction as:

\[\begin{split}S = \\frac{(n_x-1)S_x + (n_y-1)S_y}{n_x+n_y-2}\end{split}\]

see: https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Pooled_covariance_matrix

Parameters:	x – array-like, samples of observations y – array-like, samples of observations bessel – bool, apply bessel correction (default)
Returns:	float, the pooled variance

`hotelling.plots`¶

plot.py.

Hotelling’s T-Squared multivariate control charts

See:

Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979.

Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics. Stanford Univ. Press. 448-85

Gnanadesikan, R. and J.R. Kettenring (1972). Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics 28, 81-124

hotelling.plots.control_chart(x, phase=1, alpha=0.001, x_bar=None, s=None, legend_right=False, interactive=False, width=10, cusum=False, template='none', marker='o', ooc_marker='x', random_state=42, limit=1000, no_display=False)[source]¶

control_chart.

Hotelling Control Chart based on Q / T^2.

See also control_interval for more detail

Parameters:

x – pandas dataframe, uni or multivariate
phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control
alpha – significance level - used to calculate control lines at α/2 and 1-α/2
x_bar – sample mean (optional, required with s)
s – sample covariance (optional, required with x_bar)
legend_right – default to ‘left’, can specify ‘right’
interactive – if True and plotly is available, renders as interactive plot in notebook. False, render image.
width – how many units wide. defaults to 10, good for notebooks
cusum – use cumulative sum instead of average
template – plotly template, defaults to ‘none’, matching default matplotlib
marker – default marker symbol - one valid for matplotlib
ooc_marker – out of control marker symbol (x) - one valid for matplotlib
random_state – seed for sample (n > limit)
limit – max number of points to plot, defaults to 1000

Returns:

matplotlib ax / plotly fig

hotelling.plots.control_interval(m, n, f, phase=1, alpha=0.001)[source]¶

control_interval.

For Hotelling control charts, phase 1 is using Qi. This follows a beta distribution, not an F distribution. For phase 2 uses future observations. These would follow a known distribution ~ F (Seber, 1984). The lower and upper lines are based on the quantiles of the distribution (aka percent point function) for α and 1 - α, while the center line is the median (50%).

See:

Seber, G (1984). Multivariate Observations. John Wiley & Sons.
Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232

Parameters:	m – sample groups (between 1 and n) n – number of samples f – number of features in the multivariate samples phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control alpha – significance level - used to calculate control lines at α/2 and 1-α/2
Returns:

hotelling.plots.control_stats(x)[source]¶

control_stats.

Compute the sample mean vector and the covariance matrix

Parameters:	x – pandas dataframe, uni or multivariate
Returns:	sample mean, sample covariance

hotelling.plots.limit_display(x, limit, random_state)[source]¶

limit_displau.

Convenient way to get around the issue of very large datasets. We can’t show everything, so we display a subset. The tests and stats like T2, F and P values are not affected, because we calculate them on all the data.

Parameters:	x – dask or pandas dataframe, uni or multivariate random_state – seed for sample (n > limit) limit – max number of points to plot, defaults to 1000
Returns:	returns original number of rows and limited dataframe

hotelling.plots.univariate_control_chart(x, var=None, sigma=3, legend_right=False, interactive=False, connected=True, width=10, cusum=False, cusum_only=False, template='none', marker='o', ooc_marker='x', limit=1000, random_state=42, no_display=False)[source]¶

univariate_control_chart.

Parameters:

x – dask or pandas dataframe, uni or multivariate
var – optional, variable to plot (default to all)
sigma – default to 3 sigma from mean for upper and lower control lines
legend_right – default to ‘left’, can specify ‘right’
interactive – if plotly is available, renders as interactive plot in notebook. False to render image.
connected – defaults to True. Appropriate when time related /consecutive batches, else, should be False
width – how many units wide. defaults to 10, good for notebooks
cusum – use cumulative sum instead of average
cusum_only – don’t display values, just cusum referenced to 0
template – plotly template, defaults to ‘none’, matching default matplotlib
marker – default marker symbol (o) - one valid for matplotlib
ooc_marker – out of control marker symbol (x) - one valid for matplotlib
random_state – seed for sample (n > limit)
limit – max number of points to plot, defaults to 1000

Returns:

returns matplotlib figure or array of plotly figures

`hotelling.cli`¶

Console script for hotelling.

`hotelling.helpers`¶

helpers.py.

hotelling.helpers.load_df(filepath, server=None, dask=None, **kwargs)[source]¶

load_df.

Parameters:	filepath (str) – server (str) – head node for distributed cluster, ip address and port or hostname and port (localhost for local) dask (bool) – if True, forces the use of dask,, even on smaller datasets kwargs – to pass arguments to pandas read_csv
Returns:	dataframe

hotelling.helpers.savefig(plt)[source]¶

savefig.

Allows displaying a matplotlib figure to the console terminal. This requires pysixel to be pip installed. It also requires a terminal with Sixel graphic support, like DEC with graphic support, Linux xterm (started with -ti 340), MLTerm (multilingual terminal, available on Windows, Linux etc).

This is called by the command line tool when using –output stdout and can also be used in an ipython session.

Parameters:	plt – matplotlib pyplot
Returns:

`hotelling`¶

`hotelling.stats`¶

`hotelling.plots`¶

`hotelling.cli`¶

`hotelling.helpers`¶

Hotelling T2

Navigation

Related Topics

hotelling¶

hotelling.stats¶

hotelling.plots¶

hotelling.cli¶

hotelling.helpers¶

`hotelling`¶

`hotelling.stats`¶

`hotelling.plots`¶

`hotelling.cli`¶

`hotelling.helpers`¶