Theory ====== This section provides an explanation of the distance measures provided by this package (distance covariance and distance correlation). The package can be used without a deep understanding of the mathematics involved, so feel free to skip this chapter. Distance covariance and distance correlation -------------------------------------------- Distance covariance and distance correlation are recently introduced dependency measures between random vectors :cite:`c-distance_correlation`. Let :math:`X` and :math:`Y` be two random vectors with finite first moments, and let :math:`\phi_X` and :math:`\phi_Y` be the respective characteristic functions .. math:: \phi_X(t) &= \mathbb{E}[e^{itX}] \\ \phi_Y(t) &= \mathbb{E}[e^{itY}] Let :math:`\phi_{X, Y}` be the joint characteristic function. Then, if :math:`X` and :math:`Y` take values in :math:`\mathbb{R}^p` and :math:`\mathbb{R}^q` respectively, the distance covariance between them :math:`\mathcal{V}(X, Y)`, or :math:`\text{dCov}(X, Y)`, is the non-negative number defined by .. math:: \mathcal{V}^2(X, Y) = \int_{\mathbb{R}^{p+q}}|\phi_{X, Y}(t, s) - \phi_X(t)\phi_Y(s)|^2w(t,s)dt ds, where :math:`w(t, s) = (c_p c_q |t|_p^{1+p}|s|_q^{1+q})^{-1}`, :math:`|{}\cdot{}|_d` is the euclidean norm in :math:`\mathbb{R}^d` and :math:`c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}` is half the surface area of the unit sphere in :math:`\mathbb{R}^d`. The distance correlation :math:`\mathcal{R}(X, Y)`, or :math:`\text{dCor}(X, Y)`, is defined as .. math:: \mathcal{R}^2(X, Y) = \begin{cases} \frac{\mathcal{V}^2(X, Y)}{\sqrt{\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y)}} &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) > 0$} \\ 0 &\text{ if $\mathcal{V}^2(X, X)\mathcal{V}^2(Y, Y) = 0$.} \end{cases} Properties ^^^^^^^^^^ The distance covariance has the following properties: * :math:`\mathcal{V}(X, Y) \geq 0`. * :math:`\mathcal{V}(X, Y) = 0` if and only if :math:`X` and :math:`Y` are independent. * :math:`\mathcal{V}(X, Y) = \mathcal{V}(Y, X)`. * :math:`\mathcal{V}^2(\mathbf{a}_1 + b_1 \mathbf{C}_1 X, \mathbf{a}_2 + b_2 \mathbf{C}_2 Y) = |b_1 b_2| \mathcal{V}^2(Y, X)` for all constant real-valued vectors :math:`\mathbf{a}_1, \mathbf{a}_2`, scalars :math:`b_1, b_2` and orthonormal matrices :math:`\mathbf{C}_1, \mathbf{C}_2`. * If the random vectors :math:`(X_1, Y_1)` and :math:`(X_2, Y_2)` are independent then .. math:: \mathcal{V}(X_1 + X_2, Y_1 + Y_2) \leq \mathcal{V}(X_1, Y_1) + \mathcal{V}(X_2, Y_2). The distance correlation has the following properties: * :math:`0 \leq \mathcal{R}(X, Y) \leq 1`. * :math:`\mathcal{R}(X, Y) = 0` if and only if :math:`X` and :math:`Y` are independent. * If :math:`\mathcal{R}(X, Y) = 1` then there exists a vector :math:`\mathbf{a}`, a nonzero real number :math:`b` and an orthogonal matrix :math:`\mathbf{C}` such that :math:`Y = \mathbf{a} + b\mathbf{C}X`. Estimators ^^^^^^^^^^ Distance covariance has an estimator with a simple form. Suppose that we have :math:`n` observations of :math:`X` and :math:`Y`, denoted by :math:`x` and :math:`y`. We denote as :math:`x_i` the :math:`i`-th observation of :math:`x`, and :math:`y_i` the :math:`i`-th observation of :math:`y`. If we define :math:`a_{ij} = | x_i - x_j |_p` and :math:`b_{ij} = | y_i - y_j |_q`, the corresponding double centered matrices (:func:`~dcor.double_centered`) are defined by :math:`(A_{i, j})_{i,j=1}^n` and :math:`(B_{i, j})_{i,j=1}^n` .. math:: A_{i, j} &= a_{i,j} - \frac{1}{n} \sum_{l=1}^n a_{il} - \frac{1}{n} \sum_{k=1}^n a_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n a_{kl}, \\ B_{i, j} &= b_{i,j} - \frac{1}{n} \sum_{l=1}^n b_{il} - \frac{1}{n} \sum_{k=1}^n b_{kj} + \frac{1}{n^2}\sum_{k,l=1}^n b_{kl}. Then .. math:: \mathcal{V}_n^2(x, y) = \frac{1}{n^2} \sum_{i,j=1}^n A_{i, j} B_{i, j} is called the squared sample distance covariance (:func:`~dcor.distance_covariance_sqr`), and it is an estimator of :math:`\mathcal{V}^2(X, Y)`. Its square root (:func:`~dcor.distance_covariance`) is thus an estimator of the distance covariance. The sample distance correlation :math:`\mathcal{R}_n(x, y)` (:func:`~dcor.distance_correlation`) can be obtained as the standardized sample covariance .. math:: \mathcal{R}_n^2(x, y) = \begin{cases} \frac{\mathcal{V}_n^2(x, y)}{\sqrt{\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y)}}, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) > 0$}, \\ 0, &\text{ if $\mathcal{V}_n^2(x, x)\mathcal{V}_n^2(y, y) = 0$.} \end{cases} These estimators have the following properties: * :math:`\mathcal{V}_n^2(x, y) \geq 0` * :math:`0 \leq \mathcal{R}_n^2(x, y) \leq 1` In a similar way one can define an unbiased estimator :math:`\Omega_n(x, y)` (:func:`~dcor.u_distance_covariance_sqr`) of the squared distance covariance :math:`\mathcal{V}^2(X, Y)`. Given the previous definitions of :math:`a_{ij}` and :math:`b_{ij}`, we define the :math:`U`-centered matrices (:func:`~dcor.u_centered`) :math:`(\widetilde{A}_{i, j})_{i,j=1}^n` and :math:`(\widetilde{B}_{i, j})_{i,j=1}^n` .. math:: :label: ucentering \widetilde{A}_{i, j} &= \begin{cases} a_{i,j} - \frac{1}{n-2} \sum_{l=1}^n a_{il} - \frac{1}{n-2} \sum_{k=1}^n a_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n a_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j, \end{cases} \\ \widetilde{B}_{i, j} &= \begin{cases} b_{i,j} - \frac{1}{n-2} \sum_{l=1}^n b_{il} - \frac{1}{n-2} \sum_{k=1}^n b_{kj} + \frac{1}{(n-1)(n-2)}\sum_{k,l=1}^n b_{kl}, &\text{if } i \neq j, \\ 0, &\text{if } i = j. \end{cases} Then, :math:`\Omega_n(x, y)` is defined as .. math:: \Omega_n(x, y) = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}. We can also obtain an estimator of :math:`\mathcal{R}^2(X, Y)` (:func:`~dcor.u_distance_correlation_sqr`) using :math:`\Omega_n(x, y)`, as we did with :math:`\mathcal{V}_n^2(x, y)`. :math:`\Omega_n(x, y)` does not verify that :math:`\Omega_n(x, y) \geq 0`, because sometimes could take negative values near :math:`0`. There are algorithms that can compute :math:`\mathcal{V}_n^2(x, y)` and :math:`\Omega_n(x, y)` for random variables with :math:`O(n\log n)` complexity :cite:`c-fast_distance_correlation_avl,c-fast_distance_correlation_mergesort`. Since the estimator formulas explained above have complexity :math:`O(n^2)`, this improvement is significant, specially for larger samples. Partial distance covariance and partial distance correlation ------------------------------------------------------------ Partial distance covariance and partial distance correlation are dependency measures between random vectors, based on distance covariance and distance correlation, in with the effect of a random vector is removed :cite:`c-partial_distance_correlation`. The population partial distance covariance :math:`\mathcal{V}^{*}(X, Y; Z)`, or :math:`\text{pdCov}^{*}(X, Y; Z)`, between two random vectors :math:`X` and :math:`Y` with respect to a random vector :math:`Z` is .. math:: \mathcal{V}^{*}(X, Y; Z) = \begin{cases} \mathcal{V}^2(X, Y) - \frac{\mathcal{V}^2(X, Z)\mathcal{V}^2(Y, Z)}{\mathcal{V}^2(Z, Z)} & \text{if } \mathcal{V}^2(Z, Z) \neq 0 \\ \mathcal{V}^2(X, Y) & \text{if } \mathcal{V}^2(Z, Z) = 0 \end{cases} where :math:`\mathcal{V}^2({}\cdot{}, {}\cdot{})` is the squared distance covariance. The corresponding partial distance correlation :math:`\mathcal{R}^{*}(X, Y; Z)`, or :math:`\text{pdCor}^{*}(X, Y; Z)`, is .. math:: \mathcal{R}^{*}(X, Y; Z) = \begin{cases} \frac{\mathcal{R}^2(X, Y) - \mathcal{R}^2(X, Z)\mathcal{R}^2(Y, Z)}{\sqrt{1 - \mathcal{R}^4(X, Z)}\sqrt{1 - \mathcal{R}^4(Y, Z)}} & \text{if } \mathcal{R}^4(X, Z) \neq 1 \text{ and } \mathcal{R}^4(Y, Z) \neq 1 \\ 0 & \text{if } \mathcal{R}^4(X, Z) = 1 \text{ or } \mathcal{R}^4(Y, Z) = 1 \end{cases} where :math:`\mathcal{R}({}\cdot{}, {}\cdot{})` is the distance correlation. Estimators ^^^^^^^^^^ As in distance covariance and distance correlation, the :math:`U`-centered distance matrices :math:`\widetilde{A}_{i, j}`, :math:`\widetilde{B}_{i, j}` and :math:`\widetilde{C}_{i, j}` corresponding with the samples :math:`x`, :math:`y` and :math:`z` taken from the random vectors :math:`X`, :math:`Y` and :math:`Z` can be computed using using :eq:`ucentering`. The set of all :math:`U`-centered distance matrices is a Hilbert space with the inner product (:func:`~dcor.u_product`) .. math:: \langle \widetilde{A}, \widetilde{B} \rangle = \frac{1}{n(n-3)} \sum_{i,j=1}^n \widetilde{A}_{i, j} \widetilde{B}_{i, j}. Then, the projection of a sample :math:`x` over :math:`z` (:func:`~dcor.u_projection`) can be taken in this Hilbert space using the associated matrices, as .. math:: P_z(x) = \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}. The complementary projection (:func:`~dcor.u_complementary_projection`) is then .. math:: P_{z^{\perp}}(x) = \widetilde{A} - P_z(x) = \widetilde{A} - \frac{\langle \widetilde{A}, \widetilde{C} \rangle}{\langle \widetilde{C}, \widetilde{C} \rangle}\widetilde{C}. We can now define the sample partial distance covariance (:func:`~dcor.partial_distance_covariance`) as .. math:: \mathcal{V}_n^{*}(x, y; z) = \langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle The sample distance correlation (:func:`~dcor.partial_distance_correlation`) is defined as the cosine of the angle between the vectors :math:`P_{z^{\perp}}(x)` and :math:`P_{z^{\perp}}(y)` .. math:: \mathcal{R}_n^{*}(x, y; z) = \begin{cases} \frac{\langle P_{z^{\perp}}(x), P_{z^{\perp}}(y) \rangle}{||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)||} & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| \neq 0 \\ 0 & \text{if } ||P_{z^{\perp}}(x)|| ||P_{z^{\perp}}(y)|| = 0 \end{cases} Energy distance --------------- Energy distance is an statistical distance between random vectors :math:`X, Y \in \mathbb{R}^d` :cite:`c-energy_distance`, defined as .. math:: \mathcal{E}(X, Y) = 2\mathbb{E}(|| X - Y ||) - \mathbb{E}(|| X - X' ||) - \mathbb{E}(|| Y - Y' ||) where :math:`X'` and :math:`Y'` are independent and identically distributed copies of :math:`X` and :math:`Y`, respectively. It can be proved that, if the characteristic functions of :math:`X` and :math:`Y` are :math:`\phi_X(t)` and :math:`\phi_Y(t)` the energy distance can be alternatively written as .. math:: \mathcal{E}(X, Y) = \frac{1}{c_d} \int_{\mathbb{R}^d} \frac{|\phi_X(t) - \phi_Y(t)|^2}{||t||^{d+1}}dt where again :math:`c_d = \frac{\pi^{(1 + d)/2}}{\Gamma((1 + d)/2)}` is half the surface area of the unit sphere in :math:`\mathbb{R}^d`. Estimator ^^^^^^^^^ Suppose that we have :math:`n_1` observations of :math:`X` and :math:`n_2` observations of :math:`Y`, denoted by :math:`x` and :math:`y`. We denote as :math:`x_i` the :math:`i`-th observation of :math:`x`, and :math:`y_i` the :math:`i`-th observation of :math:`y`. Then, an estimator of the energy distance (:func:`~dcor.energy_distance`) is .. math:: \mathcal{E_{n_1, n_2}}(x, y) = \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}|| x_i - y_j || - \frac{1}{n_1^2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_1}|| x_i - x_j || - \frac{1}{n_2^2}\sum_{i=1}^{n_2}\sum_{j=1}^{n_2}|| y_i - y_j || References ---------- .. bibliography:: refs.bib :labelprefix: C :keyprefix: c-