Theory

This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter.

Distance covariance and distance correlation

Distance covariance and distance correlation are recently introduced dependency measures between random vectors [CSRB07]. Let \(X\) and \(Y\) be two random vectors with finite first moments, and let \(\phi_X\) and \(\phi_Y\) be the respective characteristic functions

\[\begin{split}\phi_X(t) &= \mathbb{E}[e^{itX}] \\ \phi_Y(t) &= \mathbb{E}[e^{itY}]\end{split}\]

Let \(\phi_{X, Y}\) be the joint characteristic function. Then, if \(X\) and \(Y\) take values in \(\mathbb{R}^p\) and \(\mathbb{R}^q\) respectively, the distance covariance between them \(\mathcal{V}(X, Y)\), or \(\text{dCov}(X, Y)\), is the non-negative number defined by

\[\mathcal{V}^2(X, Y) = \int_{\mathbb{R}^{p+q}}|\phi_{X, Y}(t, s) - \phi_X(t)\phi_Y(s)|^2w(t,s)dt ds,\]

where \(w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}\), \(|{}\cdot{}|_d\) is the euclidean norm in \(\mathbb{R}^d\) and \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\). The distance correlation \(\mathcal{R}(X, Y)\), or \(\text{dCor}(X, Y)\), is defined as

\[\begin{split}\mathcal{R}^2(X, Y) = \begin{cases} \frac{\mathcal{V}^2(X, Y)}{\sqrt{\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y)}} &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) > 0$} \\ 0 &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) = 0$.} \end{cases}\end{split}\]

Properties

The distance covariance has the following properties:

  • \(\mathcal{V}(X, Y) \geq 0\).

  • \(\mathcal{V}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.

  • \(\mathcal{V}(X, Y) = \mathcal{V}(Y, X)\).

  • \(\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)\) for all constant real-valued vectors \(\mathbf{a}_1, \mathbf{a}_2\), scalars \(b_1, b_2\) and orthonormal matrices \(\mathbf{C}_1, \mathbf{C}_2\).

  • If the random vectors \((X_1, Y_1)\) and \((X_2, Y_2)\) are independent then

\[\mathcal{V}(X_1 + X_2, Y_1 + Y_2) \leq \mathcal{V}(X_1, Y_1) + \mathcal{V}(X_2, Y_2).\]

The distance correlation has the following properties:

  • \(0 \leq \mathcal{R}(X, Y) \leq 1\).

  • \(\mathcal{R}(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent.

  • If \(\mathcal{R}(X, Y) = 1\) then there exists a vector \(\mathbf{a}\), a nonzero real number \(b\) and an orthogonal matrix \(\mathbf{C}\) such that \(Y = \mathbf{a} + b\mathbf{C}X\).

Estimators

Distance covariance has an estimator with a simple form. Suppose that we have \(n\) observations of \(X\) and \(Y\), denoted by \(x\) and \(y\). We denote as \(x_i\) the \(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of \(y\). If we define \(a_{ij} = | x_i - x_j |_p\) and \(b_{ij} = | y_i - y_j |_q\), the corresponding double centered matrices are defined by \((A_{i, j})_{i,j=1}^n\) and \((B_{i, j})_{i,j=1}^n\)

\[\begin{split}A_{i, j} &= a_{i,j} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n a_{kl}, \\ B_{i, j} &= b_{i,j} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n b_{kl}.\end{split}\]

Then

\[\mathcal{V}_n^2(x, y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{i, j} B_{i, j}\]

is called the squared sample distance covariance, and it is an estimator of \(\mathcal{V}^2(X, Y)\). The sample distance correlation \(\mathcal{R}_n(x, y)\) can be obtained as the standardized sample covariance

\[\begin{split}\mathcal{R}_n^2(x, y) = \begin{cases} \frac{\mathcal{V}_n^2(x, y)}{\sqrt{\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y)}}, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) > 0$}, \\ 0, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) = 0$.} \end{cases}\end{split}\]

These estimators have the following properties:

  • \(\mathcal{V}_n^2(x, y) \geq 0\)

  • \(0 \leq \mathcal{R}_n^2(x, y) \leq 1\)

In a similar way one can define an unbiased estimator \(\Omega_n(x, y)\) of the squared distance covariance \(\mathcal{V}^2(X, Y)\). Given the previous definitions of \(a_{ij}\) and \(b_{ij}\), we define the \(U\)-centered matrices \((\widetilde{A}_{i, j})_{i,j=1}^n\) and \((\widetilde{B}_{i, j})_{i,j=1}^n\)

(1)\[\begin{split}\widetilde{A}_{i, j} &= \begin{cases} a_{i,j} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n a_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j, \end{cases} \\ \widetilde{B}_{i, j} &= \begin{cases} b_{i,j} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n b_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j. \end{cases}\end{split}\]

Then, \(\Omega_n(x, y)\) is defined as

\[\Omega_n(x, y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

We can also obtain an estimator of \(\mathcal{R}^2(X, Y)\) using \(\Omega_n(x, y)\), as we did with \(\mathcal{V}_n^2(x, y)\). \(\Omega_n(x, y)\) does not verify that \(\Omega_n(x, y) \geq 0\), because sometimes could take negative values near \(0\).

There are algorithms that can compute \(\mathcal{V}_n^2(x, y)\) and \(\Omega_n(x, y)\) for random variables with \(O(n\log n)\) complexity [CCH19, CHS16]. Since the estimator formulas explained above have complexity \(O(n^2)\), this improvement is significant, specially for larger samples.

Partial distance covariance and partial distance correlation

Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed [CSR14]. The population partial distance covariance \(\mathcal{V}^{*}(X, Y; Z)\), or \(\text{pdCov}^{*}(X, Y; Z)\), between two random vectors \(X\) and \(Y\) with respect to a random vector \(Z\) is

\[\begin{split}\mathcal{V}^{*}(X, Y; Z) = \begin{cases} \mathcal{V}^2(X, Y) - \frac{\mathcal{V}^2(X, Z)\mathcal{V}^2(Y, Z)}{\mathcal{V}^2(Z, Z)} & \text{if } \mathcal{V}^2(Z, Z) \neq 0 \\ \mathcal{V}^2(X, Y) & \text{if } \mathcal{V}^2(Z, Z) = 0 \end{cases}\end{split}\]

where \(\mathcal{V}^2({}\cdot{}, {}\cdot{})\) is the squared distance covariance.

The corresponding partial distance correlation \(\mathcal{R}^{*}(X, Y; Z)\), or \(\text{pdCor}^{*}(X, Y; Z)\), is

\[\begin{split}\mathcal{R}^{*}(X, Y; Z) = \begin{cases} \frac{\mathcal{R}^2(X, Y) - \mathcal{R}^2(X, Z)\mathcal{R}^2(Y, Z)}{\sqrt{1 - \mathcal{R}^4(X, Z)}\sqrt{1 - \mathcal{R}^4(Y, Z)}} & \text{if } \mathcal{R}^4(X, Z) \neq 1 \text{ and } \mathcal{R}^4(Y, Z) \neq 1 \\ 0 & \text{if } \mathcal{R}^4(X, Z) = 1 \text{ or } \mathcal{R}^4(Y, Z) = 1 \end{cases}\end{split}\]

where \(\mathcal{R}({}\cdot{}, {}\cdot{})\) is the distance correlation.

Estimators

As in distance covariance and distance correlation, the \(U\)-centered distance matrices \(\widetilde{A}_{i, j}\), \(\widetilde{B}_{i, j}\) and \(\widetilde{C}_{i, j}\) corresponding with the samples \(x\), \(y\) and \(z\) taken from the random vectors \(X\), \(Y\) and \(Z\) can be computed using using (1).

The set of all \(U\)-centered distance matrices is a Hilbert space with the inner product

\[\langle \widetilde{A}, \widetilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

Then, the projection of a sample \(x\) over \(z\) can be taken in this Hilbert space using the associated matrices, as

\[P_z(x) = \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

The complementary projection is then

\[P_{z^{\perp}}(x) = \widetilde{A} - P_z(x) = \widetilde{A} - \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

We can now define the sample partial distance covariance as

\[\mathcal{V}_n^{*}(x, y; z) = \langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle\]

The sample distance correlation is defined as the cosine of the angle between the vectors \(P_{z^{\perp}}(x)\) and \(P_{z^{\perp}}(y)\)

\[\begin{split}\mathcal{R}_n^{*}(x, y; z) = \begin{cases} \frac{\langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle}{||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)||} & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| \neq 0 \\ 0 & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| = 0 \end{cases}\end{split}\]

Energy distance

Energy distance is an statistical distance between random vectors \(X, Y \in \mathbb{R}^d\) [CSR13], defined as

\[\mathcal{E}(X, Y) = 2\mathbb{E}(|| X - Y ||) - \mathbb{E}(|| X - X' ||) - \mathbb{E}(|| Y - Y' ||)\]

where \(X'\) and \(Y'\) are independent and identically distributed copies of \(X\) and \(Y\), respectively.

It can be proved that, if the characteristic functions of \(X\) and \(Y\) are \(\phi_X(t)\) and \(\phi_Y(t)\) the energy distance can be alternatively written as

\[\mathcal{E}(X, Y) = \frac{1}{c_d} \int_{\mathbb{R}^d} \frac{|\phi_X(t) - \phi_Y(t)|^2}{||t||^{d+1}}dt\]

where again \(c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}\) is half the surface area of the unit sphere in \(\mathbb{R}^d\).

Estimator

Suppose that we have \(n_1\) observations of \(X\) and \(n_2\) observations of \(Y\), denoted by \(x\) and \(y\). We denote as \(x_i\) the \(i\)-th observation of \(x\), and \(y_i\) the \(i\)-th observation of \(y\). Then, an estimator of the energy distance is

\[\mathcal{E_{n_1, n_2}}(x, y) = \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}|| x_i - y_j || - \frac{1}{n_1^2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_1}|| x_i - x_j || - \frac{1}{n_2^2}\sum_{i=1}^{n_2}\sum_{j=1}^{n_2}|| y_i - y_j ||\]

References

CCH19

Arin Chaudhuri and Wenhao Hu. A fast algorithm for computing distance correlation. Computational Statistics & Data Analysis, 135:15–24, July 2019. doi:10.1016/j.csda.2019.01.016.

CHS16

Xiaoming Huo and Gábor J. Székely. Fast computing for distance covariance. Technometrics, 58(4):435–447, 2016. URL: http://dx.doi.org/10.1080/00401706.2015.1054435, arXiv:http://dx.doi.org/10.1080/00401706.2015.1054435, doi:10.1080/00401706.2015.1054435.

CSR13

Gábor J. Székely and Maria L. Rizzo. Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 – 1272, 2013. URL: http://www.sciencedirect.com/science/article/pii/S0378375813000633, doi:10.1016/j.jspi.2013.03.018.

CSR14

Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 12 2014. URL: https://doi.org/10.1214/14-AOS1255, doi:10.1214/14-AOS1255.

CSRB07

Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 12 2007. URL: http://dx.doi.org/10.1214/009053607000000505, doi:10.1214/009053607000000505.