Or, the inherent backdoor in poorly encrypted VOIP without proper packet padding: http://cs.unc.edu/~fabian/papers/tissec2010.pdf
Hey, cool. Always nice to see this work show up on HN. But I don't think this is the paper you're looking for. In '08, we could only spot phrases that we knew in advance, and they had to be at least a certain length.
The most impressive results -- going from encrypted VoIP to text -- were done by Andy White and others, a couple years after the paper you linked above. It's this one:
A.M. White, A.R. Matthews, K.Z. Snow, and F. Monrose. "Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on fon-iks." In Proceedings of IEEE S&P, 2011. http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf
The paper linked in the article: "Phonotactic Reconstruction of Encrypted VoIP Conversations" specifically works in a situation without padding because the compression scheme affects the length of packets in a semantic way.
The paper cited in this article (Phonotactic Reconstruction of Encrypted VoIP Conversations) really deserves to be highlighted, so I submitted it separately: