Cross-fields concepts are mathematical concepts shared across multiple fields or study, which are employed in Data Science as well.
To run the code here, you just need some imports:
import numpy as npfrom matplotlib import pyplot as plt
In thermodynamics, the entropy is defined as
In statistical mechanics, Boltzmann gave the definition as a measure of uncertainty and demonstrated that it is equivalent to the thermodynamics definition:
the entropy quantifies the degree to which the probability of the system is spread over different microstates and is proportional to the logarithm of the number of possible microconfigurations which give rise to the macrostate.
Which, written down, is
(sum over all the possible microstates, where is the probability of state i to be occupied). The postulate is that the occupation of every microstate is equiprobable.
In Information Theory, Shannon defined the entropy as a measure of the missing information before the reception of a message:
whereis the probability that character of typein the string of interest. This entropy measures the number of binary (YES/NO) questions needed to determine the content of the message. The link between the statistical mechanics and the information theory concepts is debated.
In Ecology, defining the diversity index D as the number of different types (species) in a dataset among which individuals are distributed, so that it is maximised when all types are equally abundant,
where H is the uncertainty in predicting the species of an individual taken at random from the dataset
which at the denominator has the weighted geometric mean of the .
If all types are equally occupied,, then (H max)
If only one type is presentand , then H=0
Given two distributions over the set of events, p and q, the cross entropy between them is calculated as
whereis the entropy of the distribution p andis the Kullback-Leibler divergence of q from p, also known as the relative entropy of p with respect to q.
The cross entropy measures the average number of bits needed to identify an event drawn from the set if another distribution is assumed; the KL divergence measures the difference between the two probability distributions, or, better, the information gained when the priors $q$ are revised in light of posteriors $p$ (in other words, the amount of information lost when $q$ is used instead of $p$). It is defined as
Note that for continuous variables the sums become integrals.
The inverse participation ratio quantifies how many states a particle, or whatever has a distribution, is distributed over, and is defined as
whereis the probability of occupation of state i.
The extreme situations are:
If there is only one state, so thatand, then
If there is an even distribution, so that where N is the number of states, then
With two states, we have
which has the shape in figure down here, where you see that the maximum is for p=0.5, equally probable states (a fair coin).
p = np.arange(0,1.05,0.05)I = 1./(1 + 2*p**2 -2*p)plt.plot(p, I)plt.xlabel('$p$')plt.ylabel('$I$')plt.show();
It is a concept originated in mathematics (optimisation) but often employed in Machine Learning and it asserts that the computational cost of finding a solution for a problem of a given class, averaged over all problems in the class, is the same for every method employed (see references). In short, you don't get anything for nothing (the "free lunch"). The phrasing seem to have its origins into an old practice of USA saloons where you could get food for free when purchasing drinks.
This means that there is no algorithm which is optimal on all possible problems, as its excellent performance on a problem is counterbalanced by bad performance on another problem.
Also see reference for a deeper explanation.
D H Wolpert, W G Macready No free lunch theorems for optimization, IEEE transactions on evolutionary computation, 1.1 (1997)