I like the original paper by Shannon "A Mathematical Theory of Communication" (http://people.math.harvard.edu/~ctm/home/text/others/shannon...) a lot. It is quite readable and also probably the most important paper in the field of information theory.
In my opinion, saying entropy is a measure of randomness is confusing at best and wrong at worst.
Entropy is a the amount of information it takes to describe a system. That is, how many bits does it take to "encode" all possible states of the system.
For example, say I had to communicate the result of 100 (fair) coin flips to you. This requires 100 bits of information as each of the 100 bit vectors is equally likely.
If I were to complicate things by adding in a coin that was unfair, I would need less than 100 bits as the unfair coin would not be equally distributed. In the extreme case where 1 of the 100 coins is completely unfair and always turns up heads, for example, then I only need to send 99 bits as we both know the result of flipping the one unfair coin.
The shorthand of calling it a "measure of randomness" probably comes from the problem setup. For the 100 coin case, we could say (in my opinion, incorrectly) that flipping 100 fair coins is "more random" than flipping 99 fair coins with one bad penny that always comes up heads.
Shannon's original paper is extremely accessible and I encourage everyone to read it . If you'll permit self-promotion, I made a condensed blog post about the derivations that you can also read, though it's really Shannon's paper without most of the text .