Coverage for /home/runner/work/hotelling/hotelling/hotelling/stats.py : 86%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# -*- coding: utf-8 -*-
Hotelling's T-Squared multivariate test for one sample or two independent samples
See: Hotelling, Harold. (1931). The Generalization of Student's Ratio. Ann. Math. Statist. 2, no. 3, 360--378. doi:10.1214/aoms/1177732979.
https://projecteuclid.org/euclid.aoms/1177732979 """
"""bessel_correction.
Sampling tends to underestimate variability of a population. This is due to the fact that we are more likely to sample around the mean than near the extremities. Bessel's correction uses n−1 instead of n which is used to calculate variance etc, in order to correct for the bias in the estimation of the population variance.
:param x: array-like, samples of observations :param y: array-like, samples of observations, optional :return: returns x_n - 1, y_n - 1 """ else:
"""inverse_covariance_matrix.
:param x: array-like, samples of observations :param y: array-like, samples of observations :param bessel: bool, apply bessel correction (default) :return: float, the pooled variance inverse, the pooled variance """
r"""pooled_covariance.
Compute the pooled covariance matrix
Equation:
The pooled covariance matrix is defined as:
.. math:: S = \\frac{n_xS_x + n_yS_y}{n_x+n_y}
And with bessel correction as:
.. math:: S = \\frac{(n_x-1)S_x + (n_y-1)S_y}{n_x+n_y-2}
Reference --------- see: https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Pooled_covariance_matrix
:param x: array-like, samples of observations :param y: array-like, samples of observations :param bessel: bool, apply bessel correction (default) :return: float, the pooled variance """ else: n2 = y.shape[0] n2 = n2.compute()
r"""hotelling_t2.
Compute the Hotelling (T2) test statistic.
It is the multivariate extension of the Student's t-test. Test the null hypothesis that two multivariate samples have the same underlying probability distribution, when specifying samples for x and y. The number of samples do not have to be the same, but the number of features does have to be equal.
Equation:
Hotelling's t-squared statistic is defined as:
.. math:: T^2 = n (\\bar{x} - {\mu})^{T} S^{-1} (\\bar{x} - {\mu})
Where S is the pooled covariance matrix and ᵀ represents the transpose.
The two sample t-squared statistic is defined as:
.. math:: T^2 = (\\bar{x} - \\bar{y})^{T} [S(\\frac1 n_x +\\frac 1 n_y)]^{-1} (\\bar{x}̄ - \\bar{y})
References: - Hotelling, Harold. (1931). The Generalization of Student's Ratio. Ann. Math. Statist. 2, no. 3, 360--378. doi:10.1214/aoms/1177732979. https://projecteuclid.org/euclid.aoms/1177732979
- Hotelling, Harold. (1955) Les Rapports entre les Methodes Statistiques recentes portant sur des Variables Multiples et l'Analyse Factorielle. 107-119. In: L'Analyse Factorielle et ses Applications. Centre National de la Recherche Scientifique, Paris.
- Anderson T.W. (1992) Introduction to Hotelling (1931) The Generalization of Student’s Ratio. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
:param x: array-like, samples of observations for one or two sample test (required) :param y: for two sample test, array-like, samples of observations (optional), for one sample, list of means to test :param bessel: bool, apply bessel correction (default) :return: statistic: float, the t2 statistic f_value: float, the f value p_value: float, the p value s: 2d array, the pooled variance """ # noqa: W605 except AttributeError as ex: if "list" in str(ex): x = np.asarray(x) nx, *p = x.shape p = p[0] if p else 1 y = np.asarray(y) else: warn("Error: The two samples must be in arrays or dataframes format.") raise ValueError
# samples observed means
# One sample T-squared
else: else: # Two sample T-squared # difference of means warn( f"Error: the two samples must have the same number of features ({p} != {py})." ) raise ValueError
# bessel correction ( -1 ) else: else:
# calculate the T2 statistics # Technically, we use diff_bar.T for the transpose, but with Pandas, a 1 dimensional dataframe # is automatically aligned for @ and is not required else: except AttributeError: cov = np.cov(x, rowvar=False) # for f test # term = (n - p) / (p * (n - 1)) # getting different results # f statistic # TODO: use chi square instead of f statistic for large sample else: # pooled covariance # f statistic # TODO: use chi square instead of f statistic for large sample
# p-value
# return the list of results
"""hotelling_dict.
returns the same values as `hotelling_t2`, but in a dictionary - for API etc
:param x: array-like, samples of observations for one or two sample test (required) :param y: for two sample test, array-like, samples of observations (optional), for one sample, list of means to test :return: dict """ t2_stat, f_stat, p_value, s = hotelling_t2(x, y, bessel) return dict(t2_stat=t2_stat, f_stat=f_stat, p_value=p_value, pooled_var=s) |