Theory#
This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter.
Distance covariance and distance correlation#
Distance covariance and distance correlation are recently introduced dependency measures between random vectors [CSRB07]. Let \(X\) and \(Y\) be two random vectors with finite first moments, and let \(\phi_X\) and \(\phi_Y\) be the respective characteristic functions
Let \(\phi_{X, Y}\) be the joint characteristic function. Then, if \(X\) and \(Y\) take values in \(\mathbb{R}^p\) and \(\mathbb{R}^q\) respectively, the distance covariance between them \(\mathcal{V}(X, Y)\), or \(\text{dCov}(X, Y)\), is the non-negative number defined by
where \(w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}\), \(|{}\cdot{}|_d\) is the euclidean norm in \(\mathbb{R}^d\) and \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\). The distance correlation \(\mathcal{R}(X, Y)\), or \(\text{dCor}(X, Y)\), is defined as
Properties#
The distance covariance has the following properties:
\(\mathcal{V}(X, Y) \geq 0\).
\(\mathcal{V}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.
\(\mathcal{V}(X, Y) = \mathcal{V}(Y, X)\).
\(\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)\) for all constant real-valued vectors \(\mathbf{a}_1, \mathbf{a}_2\), scalars \(b_1, b_2\) and orthonormal matrices \(\mathbf{C}_1, \mathbf{C}_2\).
If the random vectors \((X_1, Y_1)\) and \((X_2, Y_2)\) are independent then
The distance correlation has the following properties:
\(0 \leq \mathcal{R}(X, Y) \leq 1\).
\(\mathcal{R}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.
If \(\mathcal{R}(X, Y) = 1\) then there exists a vector \(\mathbf{a}\), a nonzero real number \(b\) and an orthogonal matrix \(\mathbf{C}\) such that \(Y = \mathbf{a} + b\mathbf{C}X\).
Estimators#
Distance covariance has an estimator with a simple form. Suppose that we have
\(n\) observations of \(X\) and \(Y\), denoted by \(x\) and \(y\).
We denote as \(x_i\) the
\(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of
\(y\). If we define \(a_{ij} = | x_i - x_j |_p\) and \(b_{ij} = | y_i - y_j |_q\),
the corresponding double centered matrices (double_centered()
) are defined by \((A_{i, j})_{i,j=1}^n\)
and \((B_{i, j})_{i,j=1}^n\)
Then
is called the squared sample distance covariance (distance_covariance_sqr()
),
and it is an estimator of \(\mathcal{V}^2(X, Y)\). Its square root
(distance_covariance()
) is thus an estimator of the distance covariance.
The sample distance correlation
\(\mathcal{R}_n(x, y)\) (distance_correlation()
) can be obtained as the
standardized sample covariance
These estimators have the following properties:
\(\mathcal{V}_n^2(x, y) \geq 0\)
\(0 \leq \mathcal{R}_n^2(x, y) \leq 1\)
In a similar way one can define an unbiased estimator \(\Omega_n(x, y)\)
(u_distance_covariance_sqr()
) of the
squared distance covariance \(\mathcal{V}^2(X, Y)\). Given the
previous definitions of \(a_{ij}\) and \(b_{ij}\), we define the \(U\)-centered
matrices (u_centered()
) \((\widetilde{A}_{i, j})_{i,j=1}^n\) and \((\widetilde{B}_{i, j})_{i,j=1}^n\)
Then, \(\Omega_n(x, y)\) is defined as
We can also obtain an estimator of \(\mathcal{R}^2(X, Y)\)
(u_distance_correlation_sqr()
) using \(\Omega_n(x, y)\),
as we did with \(\mathcal{V}_n^2(x, y)\). \(\Omega_n(x, y)\) does not verify that
\(\Omega_n(x, y) \geq 0\), because sometimes could take negative values near \(0\).
There are algorithms that can compute \(\mathcal{V}_n^2(x, y)\) and \(\Omega_n(x, y)\) for random variables with \(O(n\log n)\) complexity [CCH19, CHS16]. Since the estimator formulas explained above have complexity \(O(n^2)\), this improvement is significant, specially for larger samples.
Partial distance covariance and partial distance correlation#
Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed [CSR14]. The population partial distance covariance \(\mathcal{V}^{*}(X, Y; Z)\), or \(\text{pdCov}^{*}(X, Y; Z)\), between two random vectors \(X\) and \(Y\) with respect to a random vector \(Z\) is
where \(\mathcal{V}^2({}\cdot{}, {}\cdot{})\) is the squared distance covariance.
The corresponding partial distance correlation \(\mathcal{R}^{*}(X, Y; Z)\), or \(\text{pdCor}^{*}(X, Y; Z)\), is
where \(\mathcal{R}({}\cdot{}, {}\cdot{})\) is the distance correlation.
Estimators#
As in distance covariance and distance correlation, the \(U\)-centered distance matrices \(\widetilde{A}_{i, j}\), \(\widetilde{B}_{i, j}\) and \(\widetilde{C}_{i, j}\) corresponding with the samples \(x\), \(y\) and \(z\) taken from the random vectors \(X\), \(Y\) and \(Z\) can be computed using using (1).
The set of all \(U\)-centered distance matrices is a Hilbert space with the inner product (u_product()
)
Then, the projection of a sample \(x\) over \(z\) (u_projection()
) can be taken
in this Hilbert space using the associated matrices, as
The complementary projection (u_complementary_projection()
) is then
We can now define the sample partial distance covariance
(partial_distance_covariance()
) as
The sample distance correlation (partial_distance_correlation()
) is defined as
the cosine of the angle between the vectors \(P_{z^{\perp}}(x)\) and \(P_{z^{\perp}}(y)\)
Energy distance#
Energy distance is an statistical distance between random vectors \(X, Y \in \mathbb{R}^d\) [CSR13], defined as
where \(X'\) and \(Y'\) are independent and identically distributed copies of \(X\) and \(Y\), respectively.
It can be proved that, if the characteristic functions of \(X\) and \(Y\) are \(\phi_X(t)\) and \(\phi_Y(t)\) the energy distance can be alternatively written as
where again \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\).
Estimator#
Suppose that we have \(n_1\) observations of \(X\) and \(n_2\) observations of
\(Y\), denoted by \(x\) and \(y\). We denote as \(x_i\) the
\(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of
\(y\). Then, an estimator of the energy distance (energy_distance()
) is
References#
- CCH19
Arin Chaudhuri and Wenhao Hu. A fast algorithm for computing distance correlation. Computational Statistics & Data Analysis, 135:15–24, July 2019. doi:10.1016/j.csda.2019.01.016.
- CHS16
Xiaoming Huo and Gábor J. Székely. Fast computing for distance covariance. Technometrics, 58(4):435–447, 2016. URL: http://dx.doi.org/10.1080/00401706.2015.1054435, arXiv:http://dx.doi.org/10.1080/00401706.2015.1054435, doi:10.1080/00401706.2015.1054435.
- CSR13
Gábor J. Székely and Maria L. Rizzo. Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 – 1272, 2013. URL: http://www.sciencedirect.com/science/article/pii/S0378375813000633, doi:10.1016/j.jspi.2013.03.018.
- CSR14
Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 12 2014. URL: https://doi.org/10.1214/14-AOS1255, doi:10.1214/14-AOS1255.
- CSRB07
Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 12 2007. URL: http://dx.doi.org/10.1214/009053607000000505, doi:10.1214/009053607000000505.