Theory#

This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter.

Distance covariance and distance correlation#

Distance covariance and distance correlation are recently introduced dependency measures between random vectors [CSRB07]. Let \(X\) and \(Y\) be two random vectors with finite first moments, and let \(\phi_X\) and \(\phi_Y\) be the respective characteristic functions

\[\begin{split}\phi_X(t) &= \mathbb{E}[e^{itX}] \\ \phi_Y(t) &= \mathbb{E}[e^{itY}]\end{split}\]

Let \(\phi_{X, Y}\) be the joint characteristic function. Then, if \(X\) and \(Y\) take values in \(\mathbb{R}^p\) and \(\mathbb{R}^q\) respectively, the distance covariance between them \(\mathcal{V}(X, Y)\), or \(\text{dCov}(X, Y)\), is the non-negative number defined by

\[\mathcal{V}^2(X, Y) = \int_{\mathbb{R}^{p+q}}|\phi_{X, Y}(t, s) - \phi_X(t)\phi_Y(s)|^2w(t,s)dt ds,\]

where \(w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}\), \(|{}\cdot{}|_d\) is the euclidean norm in \(\mathbb{R}^d\) and \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\). The distance correlation \(\mathcal{R}(X, Y)\), or \(\text{dCor}(X, Y)\), is defined as

\[\begin{split}\mathcal{R}^2(X, Y) = \begin{cases} \frac{\mathcal{V}^2(X, Y)}{\sqrt{\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y)}} &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) > 0$} \\ 0 &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) = 0$.} \end{cases}\end{split}\]

Properties#

The distance covariance has the following properties:

  • \(\mathcal{V}(X, Y) \geq 0\).

  • \(\mathcal{V}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.

  • \(\mathcal{V}(X, Y) = \mathcal{V}(Y, X)\).

  • \(\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)\) for all constant real-valued vectors \(\mathbf{a}_1, \mathbf{a}_2\), scalars \(b_1, b_2\) and orthonormal matrices \(\mathbf{C}_1, \mathbf{C}_2\).

  • If the random vectors \((X_1, Y_1)\) and \((X_2, Y_2)\) are independent then

\[\mathcal{V}(X_1 + X_2, Y_1 + Y_2) \leq \mathcal{V}(X_1, Y_1) + \mathcal{V}(X_2, Y_2).\]

The distance correlation has the following properties:

  • \(0 \leq \mathcal{R}(X, Y) \leq 1\).

  • \(\mathcal{R}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.

  • If \(\mathcal{R}(X, Y) = 1\) then there exists a vector \(\mathbf{a}\), a nonzero real number \(b\) and an orthogonal matrix \(\mathbf{C}\) such that \(Y = \mathbf{a} + b\mathbf{C}X\).

Estimators#

Distance covariance has an estimator with a simple form. Suppose that we have \(n\) observations of \(X\) and \(Y\), denoted by \(x\) and \(y\). We denote as \(x_i\) the \(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of \(y\). If we define \(a_{ij} = | x_i - x_j |_p\) and \(b_{ij} = | y_i - y_j |_q\), the corresponding double centered matrices (double_centered()) are defined by \((A_{i, j})_{i,j=1}^n\) and \((B_{i, j})_{i,j=1}^n\)

\[\begin{split}A_{i, j} &= a_{i,j} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n a_{kl}, \\ B_{i, j} &= b_{i,j} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n b_{kl}.\end{split}\]

Then

\[\mathcal{V}_n^2(x, y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{i, j} B_{i, j}\]

is called the squared sample distance covariance (distance_covariance_sqr()), and it is an estimator of \(\mathcal{V}^2(X, Y)\). Its square root (distance_covariance()) is thus an estimator of the distance covariance. The sample distance correlation \(\mathcal{R}_n(x, y)\) (distance_correlation()) can be obtained as the standardized sample covariance

\[\begin{split}\mathcal{R}_n^2(x, y) = \begin{cases} \frac{\mathcal{V}_n^2(x, y)}{\sqrt{\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y)}}, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) > 0$}, \\ 0, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) = 0$.} \end{cases}\end{split}\]

These estimators have the following properties:

  • \(\mathcal{V}_n^2(x, y) \geq 0\)

  • \(0 \leq \mathcal{R}_n^2(x, y) \leq 1\)

In a similar way one can define an unbiased estimator \(\Omega_n(x, y)\) (u_distance_covariance_sqr()) of the squared distance covariance \(\mathcal{V}^2(X, Y)\). Given the previous definitions of \(a_{ij}\) and \(b_{ij}\), we define the \(U\)-centered matrices (u_centered()) \((\widetilde{A}_{i, j})_{i,j=1}^n\) and \((\widetilde{B}_{i, j})_{i,j=1}^n\)

(1)#\[\begin{split}\widetilde{A}_{i, j} &= \begin{cases} a_{i,j} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n a_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j, \end{cases} \\ \widetilde{B}_{i, j} &= \begin{cases} b_{i,j} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n b_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j. \end{cases}\end{split}\]

Then, \(\Omega_n(x, y)\) is defined as

\[\Omega_n(x, y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

We can also obtain an estimator of \(\mathcal{R}^2(X, Y)\) (u_distance_correlation_sqr()) using \(\Omega_n(x, y)\), as we did with \(\mathcal{V}_n^2(x, y)\). \(\Omega_n(x, y)\) does not verify that \(\Omega_n(x, y) \geq 0\), because sometimes could take negative values near \(0\).

There are algorithms that can compute \(\mathcal{V}_n^2(x, y)\) and \(\Omega_n(x, y)\) for random variables with \(O(n\log n)\) complexity [CCH19, CHS16]. Since the estimator formulas explained above have complexity \(O(n^2)\), this improvement is significant, specially for larger samples.

Partial distance covariance and partial distance correlation#

Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed [CSR14]. The population partial distance covariance \(\mathcal{V}^{*}(X, Y; Z)\), or \(\text{pdCov}^{*}(X, Y; Z)\), between two random vectors \(X\) and \(Y\) with respect to a random vector \(Z\) is

\[\begin{split}\mathcal{V}^{*}(X, Y; Z) = \begin{cases} \mathcal{V}^2(X, Y) - \frac{\mathcal{V}^2(X, Z)\mathcal{V}^2(Y, Z)}{\mathcal{V}^2(Z, Z)} & \text{if } \mathcal{V}^2(Z, Z) \neq 0 \\ \mathcal{V}^2(X, Y) & \text{if } \mathcal{V}^2(Z, Z) = 0 \end{cases}\end{split}\]

where \(\mathcal{V}^2({}\cdot{}, {}\cdot{})\) is the squared distance covariance.

The corresponding partial distance correlation \(\mathcal{R}^{*}(X, Y; Z)\), or \(\text{pdCor}^{*}(X, Y; Z)\), is

\[\begin{split}\mathcal{R}^{*}(X, Y; Z) = \begin{cases} \frac{\mathcal{R}^2(X, Y) - \mathcal{R}^2(X, Z)\mathcal{R}^2(Y, Z)}{\sqrt{1 - \mathcal{R}^4(X, Z)}\sqrt{1 - \mathcal{R}^4(Y, Z)}} & \text{if } \mathcal{R}^4(X, Z) \neq 1 \text{ and } \mathcal{R}^4(Y, Z) \neq 1 \\ 0 & \text{if } \mathcal{R}^4(X, Z) = 1 \text{ or } \mathcal{R}^4(Y, Z) = 1 \end{cases}\end{split}\]

where \(\mathcal{R}({}\cdot{}, {}\cdot{})\) is the distance correlation.

Estimators#

As in distance covariance and distance correlation, the \(U\)-centered distance matrices \(\widetilde{A}_{i, j}\), \(\widetilde{B}_{i, j}\) and \(\widetilde{C}_{i, j}\) corresponding with the samples \(x\), \(y\) and \(z\) taken from the random vectors \(X\), \(Y\) and \(Z\) can be computed using using (1).

The set of all \(U\)-centered distance matrices is a Hilbert space with the inner product (u_product())

\[\langle \widetilde{A}, \widetilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

Then, the projection of a sample \(x\) over \(z\) (u_projection()) can be taken in this Hilbert space using the associated matrices, as

\[P_z(x) = \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

The complementary projection (u_complementary_projection()) is then

\[P_{z^{\perp}}(x) = \widetilde{A} - P_z(x) = \widetilde{A} - \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

We can now define the sample partial distance covariance (partial_distance_covariance()) as

\[\mathcal{V}_n^{*}(x, y; z) = \langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle\]

The sample distance correlation (partial_distance_correlation()) is defined as the cosine of the angle between the vectors \(P_{z^{\perp}}(x)\) and \(P_{z^{\perp}}(y)\)

\[\begin{split}\mathcal{R}_n^{*}(x, y; z) = \begin{cases} \frac{\langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle}{||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)||} & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| \neq 0 \\ 0 & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| = 0 \end{cases}\end{split}\]

Energy distance#

Energy distance is an statistical distance between random vectors \(X, Y \in \mathbb{R}^d\) [CSR13], defined as

\[\mathcal{E}(X, Y) = 2\mathbb{E}(|| X - Y ||) - \mathbb{E}(|| X - X' ||) - \mathbb{E}(|| Y - Y' ||)\]

where \(X'\) and \(Y'\) are independent and identically distributed copies of \(X\) and \(Y\), respectively.

It can be proved that, if the characteristic functions of \(X\) and \(Y\) are \(\phi_X(t)\) and \(\phi_Y(t)\) the energy distance can be alternatively written as

\[\mathcal{E}(X, Y) = \frac{1}{c_d} \int_{\mathbb{R}^d} \frac{|\phi_X(t) - \phi_Y(t)|^2}{||t||^{d+1}}dt\]

where again \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\).

Estimator#

Suppose that we have \(n_1\) observations of \(X\) and \(n_2\) observations of \(Y\), denoted by \(x\) and \(y\). We denote as \(x_i\) the \(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of \(y\). Then, an estimator of the energy distance (energy_distance()) is

\[\mathcal{E_{n_1, n_2}}(x, y) = \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}|| x_i - y_j || - \frac{1}{n_1^2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_1}|| x_i - x_j || - \frac{1}{n_2^2}\sum_{i=1}^{n_2}\sum_{j=1}^{n_2}|| y_i - y_j ||\]

References#

CCH19

Arin Chaudhuri and Wenhao Hu. A fast algorithm for computing distance correlation. Computational Statistics & Data Analysis, 135:15–24, July 2019. doi:10.1016/j.csda.2019.01.016.

CHS16

Xiaoming Huo and Gábor J. Székely. Fast computing for distance covariance. Technometrics, 58(4):435–447, 2016. URL: http://dx.doi.org/10.1080/00401706.2015.1054435, arXiv:http://dx.doi.org/10.1080/00401706.2015.1054435, doi:10.1080/00401706.2015.1054435.

CSR13

Gábor J. Székely and Maria L. Rizzo. Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 – 1272, 2013. URL: http://www.sciencedirect.com/science/article/pii/S0378375813000633, doi:10.1016/j.jspi.2013.03.018.

CSR14

Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 12 2014. URL: https://doi.org/10.1214/14-AOS1255, doi:10.1214/14-AOS1255.

CSRB07

Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 12 2007. URL: http://dx.doi.org/10.1214/009053607000000505, doi:10.1214/009053607000000505.