Usage
=====
.. _basic_concepts:
Basic concepts
--------------
Once loaded, the functions provided by entropy.entropy can be used straightforwardly::
import numpy
import entropy.entropy as entropy # estimation functions
x = numpy.random.randn(1,10000)
H = entropy.compute_entropy(x)
print(H)
The library offers a set of compute_XXX functions to compute quantities XXX over any multi-dimensional data/signal. As of now, XXX can be:
* entropy : the continuous entropy $H$
* entropy_increments : the entropy $H_{\rm inc}$ of the increments (of arbitrary order)
* entropy_rate : the entropy rate $h$ (of arbitrary order)
* regularity index : the difference $H_{\rm inc} - h$
* relative entropy : the relative entropy between the distributions of 2 signals
* MI : the mutual information between 2 signals
* PMI : the partial mutual information between 2 signals, conditionned by a third one
* TE : the transfer entropy from a signal to another
* PTE : the partial transfer entropy from a signal to another, conditioned by a third one
* DI : the directed information from a signal to another
* Renyi : the Renyi entropy
* complexities : ApEn and SampEn complexity measures
* ...
There are some concepts underlying the whole set of available compute_XXX functions. These affect the way the input parameters of the functions are defined, and the way the estimations are performed.
At its core, the library uses a nearest-neighbors approach to estimate information theory quantities, like the continuous entropy (the Shannon entropy for continuous signals). Nearest neighbors searches are performed with the `ANN (Approximate Nearest Neighbors) library by David M. Mount and Sunil Arya `_, which has been patched to work with the `pthreads library `_ and thus benefit from multithreading.
One important concept is the Takens (time-)embedding that is performed on-the-fly by any function of the library. One specific feature of the library is its ability to explore any quantity at a given scale (stride). An exploration of the behavior of the quantity as a function of the scale is called a multi-scale analysis. Note that if you do not want to use this feature, simply do not use the parameters m or n_embed, nor the parameter stride.
On the contrary, if you plan to vary the scale/stride, take a look at the function :any:`compute_over_scales`.
Below are listed input parameters that can be used in any compute_XXX function, and global settings that apply to all functions.
.. _input_parameters:
Input parameters
----------------
The following are parameters that directly influence the estimation and that can be provided to each compute_XXX function independently, i.e., by indicating a value
* x (and eventually y, z) = the data : all function expect some data as input. A single data set is expected as a 2-dimensional Numpy array that represents either unidimensional or multi-dimensional data along an index that is refered as time. Note that dimension ordering is such that the direction of analysis (the "time" direction) is always assumed to be the last dimension of the data. If this is not the case, you can always reorder the data with the function :any:`reorder` from entropy.tools::
import entropy.tools as tools # tools for input formating and other tools
x = numpy.random.randn(2, 100000) # OK, correct ordering for bi-variate signal
H = entropy.compute_entropy(x) # works fine
y = numpy.random.randn(100000, 2) # not OK, the functions will complain
H = entropy.compute_entropy(y) # complains
y = tools.reorder(y)
H = entropy.compute_entropy(y) # works fine
The function :any:`reorder` operates on a 1-dimensional or multi-dimensional Numpy arrays, and always returns a C-contiguous multi-dimensional Numpy array.
* k, the number of nearest neighbors to consider in the estimations. A typical value for k is 5, and this is the default value.
* Gaussian approximation: if you provide the special value k==-1 to a compute_XXX function, then a "Gaussian approximation" is performed instead of a k-nn estimation. This is much faster but completely discards any high-order dependence structure that could be present in the data.
Note that the Gaussian approximation estimation is performed in the very same way as would be a regular k-nn estimation: the data is ampled in the same way, and you have access to the same errorbar estimation (using N_real independent realizations).
* stride is a very important parameter, as it specifies the (time-)scale at which the estimation is performed. As you may often want to vary stride, the library offers a function to itaratively estimate your favorite quantity on a set of time-scales (strides), see :any:`compute_over_scales`.
* m, the embedding dimension, is available in almost all functions.
* Theiler, the Theiler sub-sampling (time-)scale, can be specified explicitly, or left to some automatic selection by the library.
If you impose a positive value for Theiler, than this value will be used as a sub-sampling (time-)scale.
In most cases, Theiler can be set to a negative value, ranging between -1 and -4, that indicates that the function(s) can automaticaly select the most appropriate Theiler sub-sampling scale, according to:
- -1 (or "legacy") : tau_Theiler=tau(=stride) + uniform sampling (thus localized in the dataset) (legacy)
- -2 (or "smart") : tau_Theiler=max>=tau(=stride) + uniform sampling (covering the full dataset)
- -3 (or "random") : tau_Theiler=tau(=stride) + random sampling
- -4 (or "adapted") : tau_Theiler>(or <)tau(=stride)
Note that even if you use m=1 (no embedding), the sub-sampling is applied.
* N_eff, the number of effective points to use for the estimation. As the bias depends directly on this number, it is reaasonable to keep it fix when the time-scale (stride) is varied.
* N_real, the number of realizations to consider in the provided data. As soon as this number is larger than 1, the original dataset will be split into smalleer chunks over which separate estimations will be performed. This allows for a better estimation, and also provides an errorbar for the estimator (via the standard deviation of the estimator, see :any:`errorbars`). For each realization If you have a lot of points, it is likely you can ask for a large N_real, at the cost of a higher computational time.
* mask, an optional parameter, allows for explicit selection/deselection of the points in the dataset. This is usefull when you want to compute statistics on parts of the signal only (e.g., on specific epochs: you can then define a mask to be True at the dates in the epoch), or ssimply discard abnormal values from the compputation (e.g., assuming you have NaN values, you can build a mask corresponding to the locations of these NaNs, see :any:`entropy.masks`). If you plan on using masks for your estimations, take a look at the module :any:`entropy.masks` which offers some useful functions::
import entropy.masks as masks # for some functions allowing masks manipulation
.. _global_settings:
Global settings
---------------
Besides explicitly giving compute_XXX functions input parameters, it is possible to tweak the behavior the library by setting global parameter values.
* Kraskov-Stogbauer-Grassberger algorithm: there are 2 different algorithms, which only share parts of their functionning. Any compute_XXX function for XXX in {MI, TE, PMI, PTE, etc} can use any or both algorithm(s). Of course, the computation time increases if you ask for the 2 algorithms simultaneously, but you then get 2 estimates. See the function :any:`choose_algorithm` to select which algorithm you want (default : algorithm 1).
* default value for the Theiler prescription can be set using the function :any:`set_Theiler`. It will then apply to each new call to a compute_XXX function.
* default values for either the Theiler prescription, N_eff and N_real can be set using the function :any:`set_sampling`. These values will then apply to each new call to a compute_XXX function, without the need to explicitly specify these parameters.
* multithreading can be controled by the function :any:`multithreading`. By default, multi-threading is activated and an automatic selection of the number of cores used is performed at the import of the library. You can track the number of cores used with::
entropy.multithreading('info')
* etc (doc to do)
.. _errorbars:
Errorbars for the estimators
----------------------------
Because there are usually several statistically independent realizations (N_real) used in the estimation, it is possible to have an estimate for the errorbar of the estimator. This is defined as the standard deviation of the estimator computed over the N_real independent realizations.
The function :any:`get_last_info` gives access to this standard deviation at nbo additionaal computational cost (it is already computed anyway)::
N_eff = 2000
H = entropy.compute_entropy(x, N_eff=N_eff)
H_std = entropy.get_last_info()[0]
print("with %d points, H = %1.3f +/- %1.3f" %(N, H, H_std))
If the function returns 2 estimators (e.g.: MI, PMI, TE, PTE, DI, etc.), their respective standard deviations are returned as the first and second elements of the returned values of :any:`get_last_info`::
MI = entropy.compute_MI(x, y)
MI_std = entropy.get_last_info()[0,1]
print("with algorithm 1, MI = %1.3f +/- %1.3f" %(MI[0], MI_std[0]))
print("with algorithm 2, MI = %1.3f +/- %1.3f" %(MI[1], MI_std[1]))
Extra tools
-----------
If you import :any:`entropy.tools`, you'll have access to several useful functions::
import entropy.tools as tools
Some other useful functions are in the main module :any:`entropy.entropy`.
* reorder() : the function :any:`tools.reorder` offers a straightforward way of making **any** nd-array compatible with the conventino of the library (C-ordering, F-major contiguous memory alignement). See the example in the above section :any:`input_parameters`.
* compute_over_scales() : the function :any:`tools.compute_over_scales` performs a multiple (time-)scale analysis, using any compute_XXX function of the library. You simply have to provide a set of (time-)scales in the form of a nd-array.
* surrogate() : the function :any:`entropy.surrogate` creates a surrogate version of its input variable. Such a surrogate is usually defined as having the same statistics, but without any residual auto-correlation (as obtained by shuffling the original data to destroy all correlations). There are several methods provided by the function, e.g., to create surrogate data with the same correlation, but different (Gaussian) distribution.