Theory#

This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter.

Distance covariance and distance correlation#

Distance covariance and distance correlation are recently introduced dependency measures between random vectors [CSRB07]. Let $X$ and $Y$ be two random vectors with finite first moments, and let $\phi_X$ and $\phi_Y$ be the respective characteristic functions

\[\begin{split}\phi_X(t) &= \mathbb{E}[e^{itX}] \\ \phi_Y(t) &= \mathbb{E}[e^{itY}]\end{split}\]

Let $\phi_{X, Y}$ be the joint characteristic function. Then, if $X$ and $Y$ take values in $\mathbb{R}^p$ and $\mathbb{R}^q$ respectively, the distance covariance between them $\mathcal{V}(X, Y)$, or $\text{dCov}(X, Y)$, is the non-negative number defined by

\[\mathcal{V}^2(X, Y) = \int_{\mathbb{R}^{p+q}}|\phi_{X, Y}(t, s) - \phi_X(t)\phi_Y(s)|^2w(t,s)dt ds,\]

where $w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}$, $|{}\cdot{}|_d$ is the euclidean norm in $\mathbb{R}^d$ and $c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}$ is half the surface area of the unit sphere in $\mathbb{R}^d$. The distance correlation $\mathcal{R}(X, Y)$, or $\text{dCor}(X, Y)$, is defined as

\[\begin{split}\mathcal{R}^2(X, Y) = \begin{cases} \frac{\mathcal{V}^2(X, Y)}{\sqrt{\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y)}} &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) > 0$} \\ 0 &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) = 0$.} \end{cases}\end{split}\]

Properties#

The distance covariance has the following properties:

$\mathcal{V}(X, Y) \geq 0$.
$\mathcal{V}(X, Y) = 0$ if and only if $X$ and $Y$ are independent.
$\mathcal{V}(X, Y) = \mathcal{V}(Y, X)$.
$\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)$ for all constant real-valued vectors $\mathbf{a}_1, \mathbf{a}_2$, scalars $b_1, b_2$ and orthonormal matrices $\mathbf{C}_1, \mathbf{C}_2$.
If the random vectors $(X_1, Y_1)$ and $(X_2, Y_2)$ are independent then

\[\mathcal{V}(X_1 + X_2, Y_1 + Y_2) \leq \mathcal{V}(X_1, Y_1) + \mathcal{V}(X_2, Y_2).\]

The distance correlation has the following properties:

$0 \leq \mathcal{R}(X, Y) \leq 1$.
$\mathcal{R}(X, Y) = 0$ if and only if $X$ and $Y$ are independent.
If $\mathcal{R}(X, Y) = 1$ then there exists a vector $\mathbf{a}$, a nonzero real number $b$ and an orthogonal matrix $\mathbf{C}$ such that $Y = \mathbf{a} + b\mathbf{C}X$.

Estimators#

Distance covariance has an estimator with a simple form. Suppose that we have $n$ observations of $X$ and $Y$, denoted by $x$ and $y$. We denote as $x_i$ the $i$-th observation of $x$, and $y_i$ the $i$-th observation of $y$. If we define $a_{ij} = | x_i - x_j |_p$ and $b_{ij} = | y_i - y_j |_q$, the corresponding double centered matrices (double_centered()) are defined by $(A_{i, j})_{i,j=1}^n$ and $(B_{i, j})_{i,j=1}^n$

\[\begin{split}A_{i, j} &= a_{i,j} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n a_{kl}, \\ B_{i, j} &= b_{i,j} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n b_{kl}.\end{split}\]

Then

\[\mathcal{V}_n^2(x, y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{i, j} B_{i, j}\]

is called the squared sample distance covariance (distance_covariance_sqr()), and it is an estimator of $\mathcal{V}^2(X, Y)$. Its square root (distance_covariance()) is thus an estimator of the distance covariance. The sample distance correlation $\mathcal{R}_n(x, y)$ (distance_correlation()) can be obtained as the standardized sample covariance

\[\begin{split}\mathcal{R}_n^2(x, y) = \begin{cases} \frac{\mathcal{V}_n^2(x, y)}{\sqrt{\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y)}}, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) > 0$}, \\ 0, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) = 0$.} \end{cases}\end{split}\]

These estimators have the following properties:

$\mathcal{V}_n^2(x, y) \geq 0$
$0 \leq \mathcal{R}_n^2(x, y) \leq 1$

In a similar way one can define an unbiased estimator $\Omega_n(x, y)$ (u_distance_covariance_sqr()) of the squared distance covariance $\mathcal{V}^2(X, Y)$. Given the previous definitions of $a_{ij}$ and $b_{ij}$, we define the $U$-centered matrices (u_centered()) $(\widetilde{A}_{i, j})_{i,j=1}^n$ and $(\widetilde{B}_{i, j})_{i,j=1}^n$

(1)#\[\begin{split}\widetilde{A}_{i, j} &= \begin{cases} a_{i,j} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n a_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j, \end{cases} \\ \widetilde{B}_{i, j} &= \begin{cases} b_{i,j} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n b_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j. \end{cases}\end{split}\]

Then, $\Omega_n(x, y)$ is defined as

\[\Omega_n(x, y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

We can also obtain an estimator of $\mathcal{R}^2(X, Y)$ (u_distance_correlation_sqr()) using $\Omega_n(x, y)$, as we did with $\mathcal{V}_n^2(x, y)$. $\Omega_n(x, y)$ does not verify that $\Omega_n(x, y) \geq 0$, because sometimes could take negative values near $0$.

There are algorithms that can compute $\mathcal{V}_n^2(x, y)$ and $\Omega_n(x, y)$ for random variables with $O(n\log n)$ complexity [CCH19, CHS16]. Since the estimator formulas explained above have complexity $O(n^2)$, this improvement is significant, specially for larger samples.

Partial distance covariance and partial distance correlation#

Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed [CSR14]. The population partial distance covariance $\mathcal{V}^{*}(X, Y; Z)$, or $\text{pdCov}^{*}(X, Y; Z)$, between two random vectors $X$ and $Y$ with respect to a random vector $Z$ is

\[\begin{split}\mathcal{V}^{*}(X, Y; Z) = \begin{cases} \mathcal{V}^2(X, Y) - \frac{\mathcal{V}^2(X, Z)\mathcal{V}^2(Y, Z)}{\mathcal{V}^2(Z, Z)} & \text{if } \mathcal{V}^2(Z, Z) \neq 0 \\ \mathcal{V}^2(X, Y) & \text{if } \mathcal{V}^2(Z, Z) = 0 \end{cases}\end{split}\]

where $\mathcal{V}^2({}\cdot{}, {}\cdot{})$ is the squared distance covariance.

The corresponding partial distance correlation $\mathcal{R}^{*}(X, Y; Z)$, or $\text{pdCor}^{*}(X, Y; Z)$, is

\[\begin{split}\mathcal{R}^{*}(X, Y; Z) = \begin{cases} \frac{\mathcal{R}^2(X, Y) - \mathcal{R}^2(X, Z)\mathcal{R}^2(Y, Z)}{\sqrt{1 - \mathcal{R}^4(X, Z)}\sqrt{1 - \mathcal{R}^4(Y, Z)}} & \text{if } \mathcal{R}^4(X, Z) \neq 1 \text{ and } \mathcal{R}^4(Y, Z) \neq 1 \\ 0 & \text{if } \mathcal{R}^4(X, Z) = 1 \text{ or } \mathcal{R}^4(Y, Z) = 1 \end{cases}\end{split}\]

where $\mathcal{R}({}\cdot{}, {}\cdot{})$ is the distance correlation.

Estimators#

As in distance covariance and distance correlation, the $U$-centered distance matrices $\widetilde{A}_{i, j}$, $\widetilde{B}_{i, j}$ and $\widetilde{C}_{i, j}$ corresponding with the samples $x$, $y$ and $z$ taken from the random vectors $X$, $Y$ and $Z$ can be computed using using (1).

The set of all $U$-centered distance matrices is a Hilbert space with the inner product (u_product())

\[\langle \widetilde{A}, \widetilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.\]

Then, the projection of a sample $x$ over $z$ (u_projection()) can be taken in this Hilbert space using the associated matrices, as

\[P_z(x) = \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

The complementary projection (u_complementary_projection()) is then

\[P_{z^{\perp}}(x) = \widetilde{A} - P_z(x) = \widetilde{A} - \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.\]

We can now define the sample partial distance covariance (partial_distance_covariance()) as

\[\mathcal{V}_n^{*}(x, y; z) = \langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle\]

The sample distance correlation (partial_distance_correlation()) is defined as the cosine of the angle between the vectors $P_{z^{\perp}}(x)$ and $P_{z^{\perp}}(y)$

\[\begin{split}\mathcal{R}_n^{*}(x, y; z) = \begin{cases} \frac{\langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle}{||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)||} & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| \neq 0 \\ 0 & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| = 0 \end{cases}\end{split}\]

Energy distance#

Energy distance is an statistical distance between random vectors $X, Y \in \mathbb{R}^d$ [CSR13], defined as

\[\mathcal{E}(X, Y) = 2\mathbb{E}(|| X - Y ||) - \mathbb{E}(|| X - X' ||) - \mathbb{E}(|| Y - Y' ||)\]

where $X'$ and $Y'$ are independent and identically distributed copies of $X$ and $Y$, respectively.

It can be proved that, if the characteristic functions of $X$ and $Y$ are $\phi_X(t)$ and $\phi_Y(t)$ the energy distance can be alternatively written as

\[\mathcal{E}(X, Y) = \frac{1}{c_d} \int_{\mathbb{R}^d} \frac{|\phi_X(t) - \phi_Y(t)|^2}{||t||^{d+1}}dt\]

where again $c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}$ is half the surface area of the unit sphere in $\mathbb{R}^d$.

Estimator#

Suppose that we have $n_1$ observations of $X$ and $n_2$ observations of $Y$, denoted by $x$ and $y$. We denote as $x_i$ the $i$-th observation of $x$, and $y_i$ the $i$-th observation of $y$. Then, an estimator of the energy distance (energy_distance()) is

\[\mathcal{E_{n_1, n_2}}(x, y) = \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}|| x_i - y_j || - \frac{1}{n_1^2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_1}|| x_i - x_j || - \frac{1}{n_2^2}\sum_{i=1}^{n_2}\sum_{j=1}^{n_2}|| y_i - y_j ||\]

References#

CCH19: Arin Chaudhuri and Wenhao Hu. A fast algorithm for computing distance correlation. Computational Statistics & Data Analysis, 135:15–24, July 2019. doi:10.1016/j.csda.2019.01.016.
CHS16: Xiaoming Huo and Gábor J. Székely. Fast computing for distance covariance. Technometrics, 58(4):435–447, 2016. URL: http://dx.doi.org/10.1080/00401706.2015.1054435, arXiv:http://dx.doi.org/10.1080/00401706.2015.1054435, doi:10.1080/00401706.2015.1054435.
CSR13: Gábor J. Székely and Maria L. Rizzo. Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 – 1272, 2013. URL: http://www.sciencedirect.com/science/article/pii/S0378375813000633, doi:10.1016/j.jspi.2013.03.018.
CSR14: Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 12 2014. URL: https://doi.org/10.1214/14-AOS1255, doi:10.1214/14-AOS1255.
CSRB07: Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 12 2007. URL: http://dx.doi.org/10.1214/009053607000000505, doi:10.1214/009053607000000505.