.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINXGALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_energy_test.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphxglrdownloadlinknote
Click :ref:`here `
to download the full example code or to run this example in your browser via Binder
.. rstclass:: sphxglrexampletitle
.. _sphx_glr_auto_examples_plot_energy_test.py:
The energy distance test of homogeneity
=======================================
Example that shows the usage of the energy distance test.
.. GENERATED FROM PYTHON SOURCE LINES 815
.. codeblock:: default
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import dcor
.. GENERATED FROM PYTHON SOURCE LINES 1622
Given samples (of possible different sizes) of several random vectors with
the same arbitrary dimension, the energy distance can be used to construct a
permutation test of homogeneity.
The null hypothesis :math:`\mathcal{H}_0` is that the two random
vectors have the same distribution, where the alternative hypothesis
:math:`\mathcal{H}_1` is that their distributions differ.
.. GENERATED FROM PYTHON SOURCE LINES 2425
As an example, we can consider a case with identically distributed data:
.. GENERATED FROM PYTHON SOURCE LINES 2544
.. codeblock:: default
n_samples_x = 1000
n_samples_y = 600
random_state = np.random.default_rng(63263)
x = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_x)
y = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_y)
plt.scatter(x[:, 0], x[:, 1], s=1)
plt.scatter(y[:, 0], y[:, 1], s=1)
plt.show()
dcor.homogeneity.energy_test(
x,
y,
num_resamples=200,
random_state=random_state,
)
.. imagesg:: /auto_examples/images/sphx_glr_plot_energy_test_001.png
:alt: plot energy test
:srcset: /auto_examples/images/sphx_glr_plot_energy_test_001.png
:class: sphxglrsingleimg
.. rstclass:: sphxglrscriptout
.. codeblock:: none
HypothesisTest(pvalue=0.9502487562189055, statistic=0.730988457642523)
.. GENERATED FROM PYTHON SOURCE LINES 4558
Under the null hypothesis, the pvalue would have a uniform
distribution between 0 and 1.
Under the alternative hypothesis, the pvalue would tend to 0.
Thus, it is common to reject the null hypothesis when the pvalue is
below a predefined threshold :math:`\alpha` (the significance level).
There is thus a probability :math:`\alpha` of rejecting the null
hypothesis even when it is true (Type I error).
To ensure that this does not happen often one typically chooses a value
for :math:`\alpha` of 0.05 or 0.01, to obtain a Type I error less than
5% or 1% of the time, respectively.
In this case as the pvalue is greater than the threshold we (correctly)
don't reject the null hypothesis, and thus we would consider the random
variables independent.
.. GENERATED FROM PYTHON SOURCE LINES 6061
We can now consider the following data:
.. GENERATED FROM PYTHON SOURCE LINES 6176
.. codeblock:: default
x = random_state.multivariate_normal(
np.zeros(2),
[
[1, 0.5],
[0.5, 1],
],
size=n_samples_x,
)
y = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_y)
plt.scatter(x[:, 0], x[:, 1], s=1)
plt.scatter(y[:, 0], y[:, 1], s=1)
plt.show()
.. imagesg:: /auto_examples/images/sphx_glr_plot_energy_test_002.png
:alt: plot energy test
:srcset: /auto_examples/images/sphx_glr_plot_energy_test_002.png
:class: sphxglrsingleimg
.. GENERATED FROM PYTHON SOURCE LINES 7779
Now the two distributions have different variance. Thus, the test should
reject the null hypothesis:
.. GENERATED FROM PYTHON SOURCE LINES 7987
.. codeblock:: default
dcor.homogeneity.energy_test(
x,
y,
num_resamples=200,
random_state=random_state,
)
.. rstclass:: sphxglrscriptout
.. codeblock:: none
HypothesisTest(pvalue=0.004975124378109453, statistic=8.192156215078683)
.. GENERATED FROM PYTHON SOURCE LINES 8891
We can see that the pvalue obtained is indeed very small,
and thus we can safely reject the null hypothesis, and consider
that the distributions are very different.
.. GENERATED FROM PYTHON SOURCE LINES 93101
The test illustrated here is a permutation test, which compares the distance
covariance of the original dataset with the one obtained after random
permutations of one of the input arrays.
Thus, increasing the number of permutations makes the pvalue more accurate,
but increases the computational cost.
With a low number of permutations or low number of observations, it is even
possible to not reject the true hypothesis when it is not true
(Type II error).
.. GENERATED FROM PYTHON SOURCE LINES 103113
We can now check how this test control effectively the Type I and Type II
errors.
We can do a simple Monte Carlo test, as explained in the Example 1 of
:footcite:t:`szekely+rizzo_2004_testing`.
What follows is a replication of the results obtained in that example, using
a lower number of test repetitions due to time constraints.
Users are encouraged to download this example and increase that number to
obtain better estimates of the Type I and Type II errors.
In order to replicate the original results, one should set the value of
``n_tests`` to 10000 and ``num_resamples`` to 499.
.. GENERATED FROM PYTHON SOURCE LINES 115120
We generate data from two uncorrelated multivariate normal distributions,
with means :math:`(0, 0)` and :math:`(0, \delta)`.
For :math:`\delta = 0` the two random vectors have the same distribution,
and thus we can check the Type I error. In all the other cases we can check
the Type II error for a particular value of :math:`\delta`.
.. GENERATED FROM PYTHON SOURCE LINES 120165
.. codeblock:: default
n_tests = 100
n_obs_list = [10, 15, 20, 25, 30, 40, 50, 75, 100]
num_resamples = 200
significance = 0.1
def multivariate_normal(n_obs, delta):
return random_state.multivariate_normal(
[0, delta],
np.eye(2),
size=n_obs,
)
deltas = [0, 0.5, 0.75, 1]
table = pd.DataFrame()
table["n₁"] = n_obs_list
table["n₂"] = n_obs_list
for delta in deltas:
dist_results = []
for n_obs in n_obs_list:
n_errors = 0
for _ in range(n_tests):
x = multivariate_normal(n_obs, 0)
y = multivariate_normal(n_obs, delta)
test_result = dcor.homogeneity.energy_test(
x,
y,
num_resamples=num_resamples,
random_state=random_state,
)
if test_result.pvalue < significance:
n_errors += 1
error_prob = n_errors / n_tests
dist_results.append(error_prob)
table[f"δ = {delta}"] = dist_results
table
.. raw:: html

n₁ 
n₂ 
δ = 0 
δ = 0.5 
δ = 0.75 
δ = 1 
0 
10 
10 
0.11 
0.20 
0.43 
0.47 
1 
15 
15 
0.08 
0.25 
0.53 
0.77 
2 
20 
20 
0.07 
0.36 
0.59 
0.82 
3 
25 
25 
0.13 
0.42 
0.71 
0.92 
4 
30 
30 
0.06 
0.47 
0.79 
0.98 
5 
40 
40 
0.07 
0.59 
0.91 
0.98 
6 
50 
50 
0.09 
0.71 
0.93 
1.00 
7 
75 
75 
0.13 
0.82 
0.99 
1.00 
8 
100 
100 
0.14 
0.95 
1.00 
1.00 
.. GENERATED FROM PYTHON SOURCE LINES 166169
Bibliography

.. footbibliography::
.. rstclass:: sphxglrtiming
**Total running time of the script:** ( 2 minutes 6.807 seconds)
.. _sphx_glr_download_auto_examples_plot_energy_test.py:
.. only:: html
.. container:: sphxglrfooter sphxglrfooterexample
.. container:: binderbadge
.. image:: images/binder_badge_logo.svg
:target: https://mybinder.org/v2/gh/VNMabus/dcor/master?filepath=examples/plot_energy_test.py
:alt: Launch binder
:width: 150 px
.. container:: sphxglrdownload sphxglrdownloadpython
:download:`Download Python source code: plot_energy_test.py `
.. container:: sphxglrdownload sphxglrdownloadjupyter
:download:`Download Jupyter notebook: plot_energy_test.ipynb `
.. only:: html
.. rstclass:: sphxglrsignature
`Gallery generated by SphinxGallery `_