.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_energy_test.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:here  to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_energy_test.py: The energy distance test of homogeneity ======================================= Example that shows the usage of the energy distance test. .. GENERATED FROM PYTHON SOURCE LINES 8-15 .. code-block:: default import matplotlib.pyplot as plt import numpy as np import pandas as pd import dcor .. GENERATED FROM PYTHON SOURCE LINES 16-22 Given samples (of possible different sizes) of several random vectors with the same arbitrary dimension, the energy distance can be used to construct a permutation test of homogeneity. The null hypothesis :math:\mathcal{H}_0 is that the two random vectors have the same distribution, where the alternative hypothesis :math:\mathcal{H}_1 is that their distributions differ. .. GENERATED FROM PYTHON SOURCE LINES 24-25 As an example, we can consider a case with identically distributed data: .. GENERATED FROM PYTHON SOURCE LINES 25-44 .. code-block:: default n_samples_x = 1000 n_samples_y = 600 random_state = np.random.default_rng(63263) x = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_x) y = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_y) plt.scatter(x[:, 0], x[:, 1], s=1) plt.scatter(y[:, 0], y[:, 1], s=1) plt.show() dcor.homogeneity.energy_test( x, y, num_resamples=200, random_state=random_state, ) .. image-sg:: /auto_examples/images/sphx_glr_plot_energy_test_001.png :alt: plot energy test :srcset: /auto_examples/images/sphx_glr_plot_energy_test_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none HypothesisTest(pvalue=0.9502487562189055, statistic=0.730988457642523) .. GENERATED FROM PYTHON SOURCE LINES 45-58 Under the null hypothesis, the p-value would have a uniform distribution between 0 and 1. Under the alternative hypothesis, the p-value would tend to 0. Thus, it is common to reject the null hypothesis when the p-value is below a predefined threshold :math:\alpha (the significance level). There is thus a probability :math:\alpha of rejecting the null hypothesis even when it is true (Type I error). To ensure that this does not happen often one typically chooses a value for :math:\alpha of 0.05 or 0.01, to obtain a Type I error less than 5% or 1% of the time, respectively. In this case as the p-value is greater than the threshold we (correctly) don't reject the null hypothesis, and thus we would consider the random variables independent. .. GENERATED FROM PYTHON SOURCE LINES 60-61 We can now consider the following data: .. GENERATED FROM PYTHON SOURCE LINES 61-76 .. code-block:: default x = random_state.multivariate_normal( np.zeros(2), [ [1, 0.5], [0.5, 1], ], size=n_samples_x, ) y = random_state.multivariate_normal(np.zeros(2), np.eye(2), size=n_samples_y) plt.scatter(x[:, 0], x[:, 1], s=1) plt.scatter(y[:, 0], y[:, 1], s=1) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_plot_energy_test_002.png :alt: plot energy test :srcset: /auto_examples/images/sphx_glr_plot_energy_test_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 77-79 Now the two distributions have different variance. Thus, the test should reject the null hypothesis: .. GENERATED FROM PYTHON SOURCE LINES 79-87 .. code-block:: default dcor.homogeneity.energy_test( x, y, num_resamples=200, random_state=random_state, ) .. rst-class:: sphx-glr-script-out .. code-block:: none HypothesisTest(pvalue=0.004975124378109453, statistic=8.192156215078683) .. GENERATED FROM PYTHON SOURCE LINES 88-91 We can see that the p-value obtained is indeed very small, and thus we can safely reject the null hypothesis, and consider that the distributions are very different. .. GENERATED FROM PYTHON SOURCE LINES 93-101 The test illustrated here is a permutation test, which compares the distance covariance of the original dataset with the one obtained after random permutations of one of the input arrays. Thus, increasing the number of permutations makes the p-value more accurate, but increases the computational cost. With a low number of permutations or low number of observations, it is even possible to not reject the true hypothesis when it is not true (Type II error). .. GENERATED FROM PYTHON SOURCE LINES 103-113 We can now check how this test control effectively the Type I and Type II errors. We can do a simple Monte Carlo test, as explained in the Example 1 of :footcite:t:szekely+rizzo_2004_testing. What follows is a replication of the results obtained in that example, using a lower number of test repetitions due to time constraints. Users are encouraged to download this example and increase that number to obtain better estimates of the Type I and Type II errors. In order to replicate the original results, one should set the value of n_tests to 10000 and num_resamples to 499. .. GENERATED FROM PYTHON SOURCE LINES 115-120 We generate data from two uncorrelated multivariate normal distributions, with means :math:(0, 0) and :math:(0, \delta). For :math:\delta = 0 the two random vectors have the same distribution, and thus we can check the Type I error. In all the other cases we can check the Type II error for a particular value of :math:\delta. .. GENERATED FROM PYTHON SOURCE LINES 120-165 .. code-block:: default n_tests = 100 n_obs_list = [10, 15, 20, 25, 30, 40, 50, 75, 100] num_resamples = 200 significance = 0.1 def multivariate_normal(n_obs, delta): return random_state.multivariate_normal( [0, delta], np.eye(2), size=n_obs, ) deltas = [0, 0.5, 0.75, 1] table = pd.DataFrame() table["n₁"] = n_obs_list table["n₂"] = n_obs_list for delta in deltas: dist_results = [] for n_obs in n_obs_list: n_errors = 0 for _ in range(n_tests): x = multivariate_normal(n_obs, 0) y = multivariate_normal(n_obs, delta) test_result = dcor.homogeneity.energy_test( x, y, num_resamples=num_resamples, random_state=random_state, ) if test_result.pvalue < significance: n_errors += 1 error_prob = n_errors / n_tests dist_results.append(error_prob) table[f"δ = {delta}"] = dist_results table .. raw:: html
n₁ n₂ δ = 0 δ = 0.5 δ = 0.75 δ = 1
0 10 10 0.11 0.20 0.43 0.47
1 15 15 0.08 0.25 0.53 0.77
2 20 20 0.07 0.36 0.59 0.82
3 25 25 0.13 0.42 0.71 0.92
4 30 30 0.06 0.47 0.79 0.98
5 40 40 0.07 0.59 0.91 0.98
6 50 50 0.09 0.71 0.93 1.00
7 75 75 0.13 0.82 0.99 1.00
8 100 100 0.14 0.95 1.00 1.00

.. GENERATED FROM PYTHON SOURCE LINES 166-169 Bibliography ------------ .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 6.807 seconds) .. _sphx_glr_download_auto_examples_plot_energy_test.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/VNMabus/dcor/master?filepath=examples/plot_energy_test.py :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:Download Python source code: plot_energy_test.py  .. container:: sphx-glr-download sphx-glr-download-jupyter :download:Download Jupyter notebook: plot_energy_test.ipynb  .. only:: html .. rst-class:: sphx-glr-signature Gallery generated by Sphinx-Gallery _