.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_dcov_test.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_plot_dcov_test.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_dcov_test.py:


The distance covariance test of independence
============================================

Example that shows the usage of the distance covariance test.

.. GENERATED FROM PYTHON SOURCE LINES 8-19

.. code-block:: Python


    import time

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd

    import dcor

    # sphinx_gallery_thumbnail_number = 3


.. GENERATED FROM PYTHON SOURCE LINES 20-27

Given matching samples of two random vectors with arbitrary dimensions, the
distance covariance can be used to construct a permutation test of
independence.
The null hypothesis :math:`\mathcal{H}_0` is that the two random
vectors are independent, where the alternative hypothesis
:math:`\mathcal{H}_1` considers the presence of a (possibly nonlinear)
dependence between them.

.. GENERATED FROM PYTHON SOURCE LINES 29-30

As an example, we can consider a case with independent observations:

.. GENERATED FROM PYTHON SOURCE LINES 30-47

.. code-block:: Python


    n_samples = 1000
    random_state = np.random.default_rng(83110)

    x = random_state.uniform(0, 1, size=n_samples)
    y = random_state.normal(0, 1, size=n_samples)

    plt.scatter(x, y, s=1)
    plt.show()

    dcor.independence.distance_covariance_test(
        x,
        y,
        num_resamples=200,
        random_state=random_state,
    )


.. image-sg:: /auto_examples/images/sphx_glr_plot_dcov_test_001.png
   :alt: plot dcov test
   :srcset: /auto_examples/images/sphx_glr_plot_dcov_test_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    HypothesisTest(pvalue=0.8955223880597015, statistic=0.196086744754483)


.. GENERATED FROM PYTHON SOURCE LINES 48-61

Under the null hypothesis, the p-value would have a uniform
distribution between 0 and 1.
Under the alternative hypothesis, the p-value would tend to 0.
Thus, it is common to reject the null hypothesis when the p-value is
below a predefined threshold :math:`\alpha` (the significance level).
There is thus a probability :math:`\alpha` of rejecting the null
hypothesis even when it is true (Type I error).
To ensure that this does not happen often one typically chooses a value
for :math:`\alpha` of 0.05 or 0.01, to obtain a Type I error less than
5% or 1% of the time, respectively.
In this case as the p-value is greater than the threshold we (correctly)
don't reject the null hypothesis, and thus we would consider the random
variables independent.

.. GENERATED FROM PYTHON SOURCE LINES 63-64

We can now consider the following data:

.. GENERATED FROM PYTHON SOURCE LINES 64-79

.. code-block:: Python


    u = random_state.uniform(-1, 1, size=n_samples)

    y = (
        np.cos(u * np.pi)
        + random_state.normal(0, 0.01, size=n_samples)
    )
    x = (
        np.sin(u * np.pi)
        + random_state.normal(0, 0.01, size=n_samples)
    )

    plt.scatter(x, y, s=1)
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_plot_dcov_test_002.png
   :alt: plot dcov test
   :srcset: /auto_examples/images/sphx_glr_plot_dcov_test_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 80-83

Clearly there is a nonlinear relationship between x and y.
We can use the distance covariance test to check that this
is the case:

.. GENERATED FROM PYTHON SOURCE LINES 83-91

.. code-block:: Python


    dcor.independence.distance_covariance_test(
        x,
        y,
        num_resamples=200,
        random_state=random_state,
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    HypothesisTest(pvalue=0.004975124378109453, statistic=12.173422287248243)


.. GENERATED FROM PYTHON SOURCE LINES 92-95

We can see that the p-value obtained is indeed very small,
and thus we can safely reject the null hypothesis, and consider
that there is indeed dependence between the random vectors.

.. GENERATED FROM PYTHON SOURCE LINES 97-103

The test illustrated here is a permutation test, which compares the distance
covariance of the original dataset with the one obtained after random
permutations of one of the input arrays.
Thus, increasing the number of permutations makes the p-value more accurate,
but increases the computational cost.
The following graph illustrates this:

.. GENERATED FROM PYTHON SOURCE LINES 103-132

.. code-block:: Python


    num_resamples_list = [10, 50, 100, 200, 500]

    pvalues = []
    times = []

    for num_resamples in num_resamples_list:

        start_time = time.monotonic()
        test_result = dcor.independence.distance_covariance_test(
            x,
            y,
            num_resamples=num_resamples,
            random_state=random_state,
        )
        end_time = time.monotonic()

        pvalues.append(test_result.pvalue)
        times.append(end_time - start_time)

    fig, axes = plt.subplots(2, 1, sharex=True)
    axes[0].plot(num_resamples_list, pvalues)
    axes[1].plot(num_resamples_list, times, color="C1")
    axes[1].set_xticks(num_resamples_list)
    axes[1].set_xlabel("number of permutations")
    axes[0].set_ylabel("p-value")
    axes[1].set_ylabel("time (in seconds)")
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_plot_dcov_test_003.png
   :alt: plot dcov test
   :srcset: /auto_examples/images/sphx_glr_plot_dcov_test_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 133-142

In order to check that this test control effectively the Type I error,
we can do a simple Monte Carlo test, as explained in the Example 1 of
:footcite:t:`szekely++_2007_measuring`.
What follows is a replication of the results obtained in that example, using
a lower number of test repetitions due to time constraints.
Users are encouraged to download this example and increase that number to
obtain better estimates of the Type I error.
In order to replicate the original results, one should set the value of
``n_tests`` to 10000.

.. GENERATED FROM PYTHON SOURCE LINES 144-151

We generate independent data following a multivariate Gaussian distribution
as well as different :math:`t(\nu)` distributions.
In all cases we consider random vectors with dimension 5.
We perform the tests for different number :math:`n` of observations,
computing the number of permutations used as
:math:`\lfloor 200 + 5000 / n \rfloor`.
We fix the significance level to 0.1.

.. GENERATED FROM PYTHON SOURCE LINES 151-216

.. code-block:: Python


    n_tests = 100
    dim = 5
    n_obs_list = [25, 30, 35, 50, 70, 100]
    significance = 0.1


    def num_resamples_from_obs(n_obs):
        return 200 + 5000 // n_obs


    num_resamples_list = [num_resamples_from_obs(n_obs) for n_obs in n_obs_list]


    def multivariate_normal(n_obs):
        return random_state.normal(
            size=(n_obs, dim),
        )


    def t_dist_generator(df):
        def t_dist(n_obs):
            return random_state.standard_t(
                df=df,
                size=(n_obs, dim),
            )

        return t_dist


    distributions = {
        "Multivariate normal": multivariate_normal,
        "t(1)": t_dist_generator(1),
        "t(2)": t_dist_generator(2),
        "t(3)": t_dist_generator(3),
    }
    table = pd.DataFrame()
    table["n_obs"] = n_obs_list
    table["num_resamples"] = num_resamples_list

    for dist_name, dist in distributions.items():
        dist_results = []
        for n_obs, num_resamples in zip(n_obs_list, num_resamples_list):
            n_errors = 0
            for _ in range(n_tests):
                x = dist(n_obs)
                y = dist(n_obs)

                test_result = dcor.independence.distance_covariance_test(
                    x,
                    y,
                    num_resamples=num_resamples,
                    random_state=random_state,
                )

                if test_result.pvalue < significance:
                    n_errors += 1

            error_prob = n_errors / n_tests
            dist_results.append(error_prob)

        table[dist_name] = dist_results

    table


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>n_obs</th>
          <th>num_resamples</th>
          <th>Multivariate normal</th>
          <th>t(1)</th>
          <th>t(2)</th>
          <th>t(3)</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>25</td>
          <td>400</td>
          <td>0.12</td>
          <td>0.12</td>
          <td>0.05</td>
          <td>0.09</td>
        </tr>
        <tr>
          <th>1</th>
          <td>30</td>
          <td>366</td>
          <td>0.07</td>
          <td>0.05</td>
          <td>0.07</td>
          <td>0.12</td>
        </tr>
        <tr>
          <th>2</th>
          <td>35</td>
          <td>342</td>
          <td>0.09</td>
          <td>0.15</td>
          <td>0.10</td>
          <td>0.08</td>
        </tr>
        <tr>
          <th>3</th>
          <td>50</td>
          <td>300</td>
          <td>0.09</td>
          <td>0.09</td>
          <td>0.09</td>
          <td>0.09</td>
        </tr>
        <tr>
          <th>4</th>
          <td>70</td>
          <td>271</td>
          <td>0.09</td>
          <td>0.12</td>
          <td>0.11</td>
          <td>0.09</td>
        </tr>
        <tr>
          <th>5</th>
          <td>100</td>
          <td>250</td>
          <td>0.10</td>
          <td>0.11</td>
          <td>0.12</td>
          <td>0.09</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 217-220

Bibliography
------------
.. footbibliography::


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 58.105 seconds)


.. _sphx_glr_download_auto_examples_plot_dcov_test.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/VNMabus/dcor/develop?filepath=examples/plot_dcov_test.py
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_dcov_test.ipynb <plot_dcov_test.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_dcov_test.py <plot_dcov_test.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_