Information and data are different things. Not all data is the same. But how much information can any data fragment contain? For the first time, this issue was disclosed in the 1948 article “Mathematical Theory of Communication” by MIT Honorary Professor Claude Shannon. One of the breakthrough results of Shannon is the idea of entropy, which allows quantitatively assessing the amount of information inherent in any random object, including random values that model the observed data. The results of Shannon laid the foundations of the theory of information and modern telecommunications. The concept of entropy was also central in computer science and machine learning.
But the use of the Shannon formula can quickly become insoluble from a computational point of view. This requires an accurate calculation of the probability of data and all possible methods of data arising in the framework of a probabilistic model. This becomes a problem in real cases, for example, medical testing, where a positive test result is the result of hundreds of interacting variables, and all of them are unknown. Having only 10 unknown, data already has 1000 possible explanations. With several hundreds of possible explanations more than atoms in a well -known universe, which makes the calculation of entropy an absolutely insoluble problem.
MIT researchers have developed a new method for evaluating approximations to many information quantities, such as Entropy Shannon, using probabilistic conclusion. The work is presented in the article of the AISTATS 2022 conference. The key conclusion is to use algorithms of probabilistic output instead of listing all explanations. This will help first draw a conclusion which explanations are likely, and then use them to build high -quality assessments of entropy. It is proved that this approach, based on conclusions, can be much faster and more accurately than previous approaches.
The assessment of entropy and information in the probabilistic model is fundamentally complex, since it often requires solving a multidimensional integration problem. Many previous works have developed assessments of these values for some special cases, but new ratings of entropy through output (EEVI) offer the first approach, which can give accurate upper and lower boundaries for a wide set of values based on the theory of information. The upper and lower boundaries mean that although we do not know true entropy, we can get a number that is less than it, and the number that is higher than it. The difference between the upper and lower boundaries gives a quantitative idea of how sure we should be in estimates. Using more computing resources, you can reduce the difference between two boundaries to zero, which “squeezes” the true value with a high degree of accuracy. You can also make these boundaries to form estimates of many other values that say how informative different variables are in the model for each other.
The new method is especially useful for the request of probabilistic models in areas such as medical diagnostics. For example, to solve new requests using rich generative models for complex diseases previously studied by medical experts.