hotelling

Module implementing Hotelling one and two sample tests

Top-level package for Hotelling T2.

Top-level package for Hotelling T2.

hotelling.stats

Stats.py.

Hotelling’s T-Squared multivariate test for one sample or two independent samples

See:
Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2,
no. 3, 360–378. doi:10.1214/aoms/1177732979.

https://projecteuclid.org/euclid.aoms/1177732979

hotelling.stats.bessel_correction(x, y=None)[source]

bessel_correction.

Sampling tends to underestimate variability of a population. This is due to the fact that we are more likely to sample around the mean than near the extremities. Bessel’s correction uses n−1 instead of n which is used to calculate variance etc, in order to correct for the bias in the estimation of the population variance.

Parameters:
  • x – array-like, samples of observations
  • y – array-like, samples of observations, optional
Returns:

returns x_n - 1, y_n - 1

hotelling.stats.hotelling_dict(x, y=None, bessel=True)[source]

hotelling_dict.

returns the same values as hotelling_t2, but in a dictionary - for API etc

Parameters:
  • x – array-like, samples of observations for one or two sample test (required)
  • y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
Returns:

dict

hotelling.stats.hotelling_t2(x, y=None, bessel=True, S=None)[source]

hotelling_t2.

Compute the Hotelling (T2) test statistic.

It is the multivariate extension of the Student’s t-test. Test the null hypothesis that two multivariate samples have the same underlying probability distribution, when specifying samples for x and y. The number of samples do not have to be the same, but the number of features does have to be equal.

Equation:

Hotelling’s t-squared statistic is defined as:

\[\begin{split}T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})\end{split}\]

Where S is the pooled covariance matrix and ᵀ represents the transpose.

The two sample t-squared statistic is defined as:

\[\begin{split}T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})\end{split}\]
References:
  • Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
  • Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples et l’Analyse Factorielle. 107-119. In: L’Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
  • Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
Parameters:
  • x – array-like, samples of observations for one or two sample test (required)
  • y – for two sample test, array-like, samples of observations (optional), for one sample, list of means to test
  • bessel – bool, apply bessel correction (default)
Returns:

statistic: float,

the t2 statistic

f_value: float,

the f value

p_value: float,

the p value

s: 2d array,

the pooled variance

hotelling.stats.inverse_covariance_matrix(x, y, bessel=True)[source]

inverse_covariance_matrix.

Parameters:
  • x – array-like, samples of observations
  • y – array-like, samples of observations
  • bessel – bool, apply bessel correction (default)
Returns:

float, the pooled variance inverse, the pooled variance

hotelling.stats.pooled_covariance_matrix(x, y, bessel=True)[source]

pooled_covariance.

Compute the pooled covariance matrix

Equation:

The pooled covariance matrix is defined as:

\[\begin{split}S = \\frac{n_xS_x + n_yS_y}{n_x+n_y}\end{split}\]

And with bessel correction as:

\[\begin{split}S = \\frac{(n_x-1)S_x + (n_y-1)S_y}{n_x+n_y-2}\end{split}\]

see: https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Pooled_covariance_matrix

Parameters:
  • x – array-like, samples of observations
  • y – array-like, samples of observations
  • bessel – bool, apply bessel correction (default)
Returns:

float, the pooled variance

hotelling.plots

plot.py.

Hotelling’s T-Squared multivariate control charts

See:

  • Hotelling, Harold. (1931). The Generalization of Student’s Ratio. Ann. Math. Statist. 2, no. 3, 360–378. doi:10.1214/aoms/1177732979.
  • Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics. Stanford Univ. Press. 448-85
  • Gnanadesikan, R. and J.R. Kettenring (1972). Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics 28, 81-124
hotelling.plots.control_chart(x, phase=1, alpha=0.001, x_bar=None, s=None, legend_right=False, interactive=False, width=10, cusum=False, template='none', marker='o', ooc_marker='x', random_state=42, limit=1000, no_display=False)[source]

control_chart.

Hotelling Control Chart based on Q / T^2.

See also control_interval for more detail

Parameters:
  • x – pandas dataframe, uni or multivariate
  • phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control
  • alpha – significance level - used to calculate control lines at α/2 and 1-α/2
  • x_bar – sample mean (optional, required with s)
  • s – sample covariance (optional, required with x_bar)
  • legend_right – default to ‘left’, can specify ‘right’
  • interactive – if True and plotly is available, renders as interactive plot in notebook. False, render image.
  • width – how many units wide. defaults to 10, good for notebooks
  • cusum – use cumulative sum instead of average
  • template – plotly template, defaults to ‘none’, matching default matplotlib
  • marker – default marker symbol - one valid for matplotlib
  • ooc_marker – out of control marker symbol (x) - one valid for matplotlib
  • random_state – seed for sample (n > limit)
  • limit – max number of points to plot, defaults to 1000
Returns:

matplotlib ax / plotly fig

hotelling.plots.control_interval(m, n, f, phase=1, alpha=0.001)[source]

control_interval.

For Hotelling control charts, phase 1 is using Qi. This follows a beta distribution, not an F distribution. For phase 2 uses future observations. These would follow a known distribution ~ F (Seber, 1984). The lower and upper lines are based on the quantiles of the distribution (aka percent point function) for α and 1 - α, while the center line is the median (50%).

See:
  • Seber, G (1984). Multivariate Observations. John Wiley & Sons.
  • Nola D. Tracy, John C. Young & Robert L. Mason (1992) Multivariate Control Charts for individual Observations, Journal or Quality Technology, 24:2, 88-95, DOI:10.1080/00224065.1992.12015232
Parameters:
  • m – sample groups (between 1 and n)
  • n – number of samples
  • f – number of features in the multivariate samples
  • phase – 1 or 2 - phase 1 is within initial sample, phase 2 is measuring implemented control
  • alpha – significance level - used to calculate control lines at α/2 and 1-α/2
Returns:

hotelling.plots.control_stats(x)[source]

control_stats.

Compute the sample mean vector and the covariance matrix

Parameters:x – pandas dataframe, uni or multivariate
Returns:sample mean, sample covariance
hotelling.plots.limit_display(x, limit, random_state)[source]

limit_displau.

Convenient way to get around the issue of very large datasets. We can’t show everything, so we display a subset. The tests and stats like T2, F and P values are not affected, because we calculate them on all the data.

Parameters:
  • x – dask or pandas dataframe, uni or multivariate
  • random_state – seed for sample (n > limit)
  • limit – max number of points to plot, defaults to 1000
Returns:

returns original number of rows and limited dataframe

hotelling.plots.univariate_control_chart(x, var=None, sigma=3, legend_right=False, interactive=False, connected=True, width=10, cusum=False, cusum_only=False, template='none', marker='o', ooc_marker='x', limit=1000, random_state=42, no_display=False)[source]

univariate_control_chart.

Parameters:
  • x – dask or pandas dataframe, uni or multivariate
  • var – optional, variable to plot (default to all)
  • sigma – default to 3 sigma from mean for upper and lower control lines
  • legend_right – default to ‘left’, can specify ‘right’
  • interactive – if plotly is available, renders as interactive plot in notebook. False to render image.
  • connected – defaults to True. Appropriate when time related /consecutive batches, else, should be False
  • width – how many units wide. defaults to 10, good for notebooks
  • cusum – use cumulative sum instead of average
  • cusum_only – don’t display values, just cusum referenced to 0
  • template – plotly template, defaults to ‘none’, matching default matplotlib
  • marker – default marker symbol (o) - one valid for matplotlib
  • ooc_marker – out of control marker symbol (x) - one valid for matplotlib
  • random_state – seed for sample (n > limit)
  • limit – max number of points to plot, defaults to 1000
Returns:

returns matplotlib figure or array of plotly figures

hotelling.cli

Console script for hotelling.

hotelling.helpers

helpers.py.

hotelling.helpers.load_df(filepath, server=None, dask=None, **kwargs)[source]

load_df.

Parameters:
  • filepath (str) –
  • server (str) – head node for distributed cluster, ip address and port or hostname and port (localhost for local)
  • dask (bool) – if True, forces the use of dask,, even on smaller datasets
  • kwargs – to pass arguments to pandas read_csv
Returns:

dataframe

hotelling.helpers.savefig(plt)[source]

savefig.

Allows displaying a matplotlib figure to the console terminal. This requires pysixel to be pip installed. It also requires a terminal with Sixel graphic support, like DEC with graphic support, Linux xterm (started with -ti 340), MLTerm (multilingual terminal, available on Windows, Linux etc).

This is called by the command line tool when using –output stdout and can also be used in an ipython session.

Parameters:plt – matplotlib pyplot
Returns: