# Theory#

This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter.

## Distance covariance and distance correlation#

Distance covariance and distance correlation are recently introduced dependency measures between random vectors [CSRB07]. Let $$X$$ and $$Y$$ be two random vectors with finite first moments, and let $$\phi_X$$ and $$\phi_Y$$ be the respective characteristic functions

$\begin{split}\phi_X(t) &= \mathbb{E}[e^{itX}] \\ \phi_Y(t) &= \mathbb{E}[e^{itY}]\end{split}$

Let $$\phi_{X, Y}$$ be the joint characteristic function. Then, if $$X$$ and $$Y$$ take values in $$\mathbb{R}^p$$ and $$\mathbb{R}^q$$ respectively, the distance covariance between them $$\mathcal{V}(X, Y)$$, or $$\text{dCov}(X, Y)$$, is the non-negative number defined by

$\mathcal{V}^2(X, Y) = \int_{\mathbb{R}^{p+q}}|\phi_{X, Y}(t, s) - \phi_X(t)\phi_Y(s)|^2w(t,s)dt ds,$

where $$w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}$$, $$|{}\cdot{}|_d$$ is the euclidean norm in $$\mathbb{R}^d$$ and $$c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}$$ is half the surface area of the unit sphere in $$\mathbb{R}^d$$. The distance correlation $$\mathcal{R}(X, Y)$$, or $$\text{dCor}(X, Y)$$, is defined as

$\begin{split}\mathcal{R}^2(X, Y) = \begin{cases} \frac{\mathcal{V}^2(X, Y)}{\sqrt{\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y)}} &\text{ if \mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) > 0} \\ 0 &\text{ if \mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) = 0.} \end{cases}\end{split}$

### Properties#

The distance covariance has the following properties:

• $$\mathcal{V}(X, Y) \geq 0$$.

• $$\mathcal{V}(X, Y) = 0$$ if and only if $$X$$ and $$Y$$ are independent.

• $$\mathcal{V}(X, Y) = \mathcal{V}(Y, X)$$.

• $$\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)$$ for all constant real-valued vectors $$\mathbf{a}_1, \mathbf{a}_2$$, scalars $$b_1, b_2$$ and orthonormal matrices $$\mathbf{C}_1, \mathbf{C}_2$$.

• If the random vectors $$(X_1, Y_1)$$ and $$(X_2, Y_2)$$ are independent then

$\mathcal{V}(X_1 + X_2, Y_1 + Y_2) \leq \mathcal{V}(X_1, Y_1) + \mathcal{V}(X_2, Y_2).$

The distance correlation has the following properties:

• $$0 \leq \mathcal{R}(X, Y) \leq 1$$.

• $$\mathcal{R}(X, Y) = 0$$ if and only if $$X$$ and $$Y$$ are independent.

• If $$\mathcal{R}(X, Y) = 1$$ then there exists a vector $$\mathbf{a}$$, a nonzero real number $$b$$ and an orthogonal matrix $$\mathbf{C}$$ such that $$Y = \mathbf{a} + b\mathbf{C}X$$.

### Estimators#

Distance covariance has an estimator with a simple form. Suppose that we have $$n$$ observations of $$X$$ and $$Y$$, denoted by $$x$$ and $$y$$. We denote as $$x_i$$ the $$i$$-th observation of $$x$$, and $$y_i$$ the $$i$$-th observation of $$y$$. If we define $$a_{ij} = | x_i - x_j |_p$$ and $$b_{ij} = | y_i - y_j |_q$$, the corresponding double centered matrices (double_centered()) are defined by $$(A_{i, j})_{i,j=1}^n$$ and $$(B_{i, j})_{i,j=1}^n$$

$\begin{split}A_{i, j} &= a_{i,j} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n a_{kl}, \\ B_{i, j} &= b_{i,j} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n b_{kl}.\end{split}$

Then

$\mathcal{V}_n^2(x, y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{i, j} B_{i, j}$

is called the squared sample distance covariance (distance_covariance_sqr()), and it is an estimator of $$\mathcal{V}^2(X, Y)$$. Its square root (distance_covariance()) is thus an estimator of the distance covariance. The sample distance correlation $$\mathcal{R}_n(x, y)$$ (distance_correlation()) can be obtained as the standardized sample covariance

$\begin{split}\mathcal{R}_n^2(x, y) = \begin{cases} \frac{\mathcal{V}_n^2(x, y)}{\sqrt{\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y)}}, &\text{ if \mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) > 0}, \\ 0, &\text{ if \mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) = 0.} \end{cases}\end{split}$

These estimators have the following properties:

• $$\mathcal{V}_n^2(x, y) \geq 0$$

• $$0 \leq \mathcal{R}_n^2(x, y) \leq 1$$

In a similar way one can define an unbiased estimator $$\Omega_n(x, y)$$ (u_distance_covariance_sqr()) of the squared distance covariance $$\mathcal{V}^2(X, Y)$$. Given the previous definitions of $$a_{ij}$$ and $$b_{ij}$$, we define the $$U$$-centered matrices (u_centered()) $$(\widetilde{A}_{i, j})_{i,j=1}^n$$ and $$(\widetilde{B}_{i, j})_{i,j=1}^n$$

(1)#$\begin{split}\widetilde{A}_{i, j} &= \begin{cases} a_{i,j} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n a_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j, \end{cases} \\ \widetilde{B}_{i, j} &= \begin{cases} b_{i,j} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n b_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j. \end{cases}\end{split}$

Then, $$\Omega_n(x, y)$$ is defined as

$\Omega_n(x, y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.$

We can also obtain an estimator of $$\mathcal{R}^2(X, Y)$$ (u_distance_correlation_sqr()) using $$\Omega_n(x, y)$$, as we did with $$\mathcal{V}_n^2(x, y)$$. $$\Omega_n(x, y)$$ does not verify that $$\Omega_n(x, y) \geq 0$$, because sometimes could take negative values near $$0$$.

There are algorithms that can compute $$\mathcal{V}_n^2(x, y)$$ and $$\Omega_n(x, y)$$ for random variables with $$O(n\log n)$$ complexity [CCH19, CHS16]. Since the estimator formulas explained above have complexity $$O(n^2)$$, this improvement is significant, specially for larger samples.

## Partial distance covariance and partial distance correlation#

Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed [CSR14]. The population partial distance covariance $$\mathcal{V}^{*}(X, Y; Z)$$, or $$\text{pdCov}^{*}(X, Y; Z)$$, between two random vectors $$X$$ and $$Y$$ with respect to a random vector $$Z$$ is

$\begin{split}\mathcal{V}^{*}(X, Y; Z) = \begin{cases} \mathcal{V}^2(X, Y) - \frac{\mathcal{V}^2(X, Z)\mathcal{V}^2(Y, Z)}{\mathcal{V}^2(Z, Z)} & \text{if } \mathcal{V}^2(Z, Z) \neq 0 \\ \mathcal{V}^2(X, Y) & \text{if } \mathcal{V}^2(Z, Z) = 0 \end{cases}\end{split}$

where $$\mathcal{V}^2({}\cdot{}, {}\cdot{})$$ is the squared distance covariance.

The corresponding partial distance correlation $$\mathcal{R}^{*}(X, Y; Z)$$, or $$\text{pdCor}^{*}(X, Y; Z)$$, is

$\begin{split}\mathcal{R}^{*}(X, Y; Z) = \begin{cases} \frac{\mathcal{R}^2(X, Y) - \mathcal{R}^2(X, Z)\mathcal{R}^2(Y, Z)}{\sqrt{1 - \mathcal{R}^4(X, Z)}\sqrt{1 - \mathcal{R}^4(Y, Z)}} & \text{if } \mathcal{R}^4(X, Z) \neq 1 \text{ and } \mathcal{R}^4(Y, Z) \neq 1 \\ 0 & \text{if } \mathcal{R}^4(X, Z) = 1 \text{ or } \mathcal{R}^4(Y, Z) = 1 \end{cases}\end{split}$

where $$\mathcal{R}({}\cdot{}, {}\cdot{})$$ is the distance correlation.

### Estimators#

As in distance covariance and distance correlation, the $$U$$-centered distance matrices $$\widetilde{A}_{i, j}$$, $$\widetilde{B}_{i, j}$$ and $$\widetilde{C}_{i, j}$$ corresponding with the samples $$x$$, $$y$$ and $$z$$ taken from the random vectors $$X$$, $$Y$$ and $$Z$$ can be computed using using (1).

The set of all $$U$$-centered distance matrices is a Hilbert space with the inner product (u_product())

$\langle \widetilde{A}, \widetilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}.$

Then, the projection of a sample $$x$$ over $$z$$ (u_projection()) can be taken in this Hilbert space using the associated matrices, as

$P_z(x) = \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.$

The complementary projection (u_complementary_projection()) is then

$P_{z^{\perp}}(x) = \widetilde{A} - P_z(x) = \widetilde{A} - \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}.$

We can now define the sample partial distance covariance (partial_distance_covariance()) as

$\mathcal{V}_n^{*}(x, y; z) = \langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle$

The sample distance correlation (partial_distance_correlation()) is defined as the cosine of the angle between the vectors $$P_{z^{\perp}}(x)$$ and $$P_{z^{\perp}}(y)$$

$\begin{split}\mathcal{R}_n^{*}(x, y; z) = \begin{cases} \frac{\langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle}{||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)||} & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| \neq 0 \\ 0 & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| = 0 \end{cases}\end{split}$

## Energy distance#

Energy distance is an statistical distance between random vectors $$X, Y \in \mathbb{R}^d$$ [CSR13], defined as

$\mathcal{E}(X, Y) = 2\mathbb{E}(|| X - Y ||) - \mathbb{E}(|| X - X' ||) - \mathbb{E}(|| Y - Y' ||)$

where $$X'$$ and $$Y'$$ are independent and identically distributed copies of $$X$$ and $$Y$$, respectively.

It can be proved that, if the characteristic functions of $$X$$ and $$Y$$ are $$\phi_X(t)$$ and $$\phi_Y(t)$$ the energy distance can be alternatively written as

$\mathcal{E}(X, Y) = \frac{1}{c_d} \int_{\mathbb{R}^d} \frac{|\phi_X(t) - \phi_Y(t)|^2}{||t||^{d+1}}dt$

where again $$c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}$$ is half the surface area of the unit sphere in $$\mathbb{R}^d$$.

### Estimator#

Suppose that we have $$n_1$$ observations of $$X$$ and $$n_2$$ observations of $$Y$$, denoted by $$x$$ and $$y$$. We denote as $$x_i$$ the $$i$$-th observation of $$x$$, and $$y_i$$ the $$i$$-th observation of $$y$$. Then, an estimator of the energy distance (energy_distance()) is

$\mathcal{E_{n_1, n_2}}(x, y) = \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}|| x_i - y_j || - \frac{1}{n_1^2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_1}|| x_i - x_j || - \frac{1}{n_2^2}\sum_{i=1}^{n_2}\sum_{j=1}^{n_2}|| y_i - y_j ||$