Compare "3. THE SERIES OF APPROXIMATIONS TO ENGLISH" in http://people.math.harvard.edu/~ctm/home/text/others/shannon... (1948)
I imagine the timing of this post is correlated with release of the documentary Bit Player about Claude Shannon. Haven't seen it yet but looking forward to it.
The article does a decent job at graphing and laying out some of the concepts of entropy for information theory, but I'm not sure who the target reader is, since prereqs are perhaps only slightly narrower than what one needs to read Shannon's paper and the article is really illustrating only a fraction of the concept.
It can perhaps work as a primer for what shows up starting on pages 10-11 of the original document, in any case, provided you grasp the mathematical definition of entropy through thermodynamics, and the microstates-based definition through Boltzman, as well as "basic probabilities" (expected value, typical discrete distributions, terms like "i.i.d"), you should be good to go. But then you might already know all this..
And if you do, and you like what you read, then the full original thing by Shannon is a delight to explore to truly grasp what has been so foundational to a lot of things since 1948.
Shannon’s: A Mathematical Theory of Communication
I like the original paper by Shannon "A Mathematical Theory of Communication" (http://people.math.harvard.edu/~ctm/home/text/others/shannon...) a lot. It is quite readable and also probably the most important paper in the field of information theory.
In my opinion, saying entropy is a measure of randomness is confusing at best and wrong at worst.
Entropy is a the amount of information it takes to describe a system. That is, how many bits does it take to "encode" all possible states of the system.
For example, say I had to communicate the result of 100 (fair) coin flips to you. This requires 100 bits of information as each of the 100 bit vectors is equally likely.
If I were to complicate things by adding in a coin that was unfair, I would need less than 100 bits as the unfair coin would not be equally distributed. In the extreme case where 1 of the 100 coins is completely unfair and always turns up heads, for example, then I only need to send 99 bits as we both know the result of flipping the one unfair coin.
The shorthand of calling it a "measure of randomness" probably comes from the problem setup. For the 100 coin case, we could say (in my opinion, incorrectly) that flipping 100 fair coins is "more random" than flipping 99 fair coins with one bad penny that always comes up heads.
Shannon's original paper is extremely accessible and I encourage everyone to read it . If you'll permit self-promotion, I made a condensed blog post about the derivations that you can also read, though it's really Shannon's paper without most of the text .