Review of The Information

All disciplines have interesting histories that explain their development. For some reason, different studies seem to ‘value’ their history differently. Art and music students are required to study art and music history with multiple dedicated classes. Computer science and mathematics students do not have to study math or computer science history in a typical undergraduate program (if you are lucky, maybe one or two classes). Instead, if the computer science or math student is lucky, they get a charismatic professor who is a good story teller and fits in anecdotes about the creators of the topics being studied, or a reference to a book about this.

Presumably, the history of art is useful to an artist, not only for telling interesting stories to their students one day, but also to understand something deeper about where new art comes from. A deeper understanding may be useful for a number of things, including how to go about creating novel art. The creation of new abstract concepts in math and art have certain high-level similarities – they provide a new way to look at things. Perhaps the poster child for this type of thinking is Xenakis, whose existence united architecture, statistics, algorithms, and music, where knowledge across the fields had an interesting synergy. But even if we ignore cross-discipline examples, I think we will find that the innovators typically have had an interest in the history. Is the assumption for the sciences that this is correlation, and not causation? That argument can be made, but it seems less likely to me. Or is the assumption that in the interest of time, most undergraduate students don’t need to be innovators, and rather just need to understand how to solve the damn equation, not how it came to be?

Perhaps this is a straw-man. I may be generalizing from an unrepresentative experience, and it has been a number of years since I was in school. It seems like folks pursuing graduate degrees in math/computer science have more understanding of history, and because this information is easier to come by today than 20 years ago, people that self-study also pick this up naturally. In that case all I can say is that I was not aware of so much of the history involved in computer science, and I wish I had started my studies with something like James Gleick’s The Information.

Information theory is a relatively new field of the sciences. Of course, it did not spring out of nowhere. There are a few history-oriented books that describe its formation, but there is not much. Gleick’s coverage is by far the widest I’ve seen.

The book has an excellent cast of characters, starting out with long distance communication with African drums, Babbage/Lovelace and early computers, and Laplace. As the book develops the more typical founding characters in Information theory appear, with Maxwell and his heat daemons, Clausius and his entropy, Morse with his codes, Hartley, Shannon, Wiener, Turing, and Kolmogorov. What makes the book’s presentation special is the depth in which each character is gone into. There are a large amount of supporting characters and competitors that I hadn’t heard of, which provides great context for the developments. Naturally, a large amount of time is spent of the juicy rivalries such as the Shannon-Wiener relationship, but also how it fit into the rest of the world, e.g., how Kolmogorov felt about them.

I was introduced to a range of new connections that I was not aware of, including the Schrodinger (yes, that Schrodinger) connection to molecular biology and What is Life?. There were also nice teasers for the parts of info theory I haven’t had exposure to such as Quantum computing and Schorr/Feynman’s thoughts on it. There also deeper ties to fundamental math history such as the early developments in greek and arabic states from Aristotle to al-Khwarizmi. I was also unaware of the amount of obsoleted infrastructure required for telegraph networks, and the book spends a good time talking about the logistics of this kind of thing.

I very much enjoyed this book, although it still misses a few important areas. Notably, Kullback’s application of information theory to statistics, as well as Bayesian statistics and the related information criteria are not mentioned. Deep learning is also not mentioned, but the book was published in 2011, before the recent surge. Naturally, Gleick also discusses the fictional works of Borges. Unfortunately as much as I enjoy Borges, I found this to be the weakest part of the book.

At 426 pages, Gleick’s presentation is almost entirely conceptual and non-technical, so I think this would be an great bedtime read for anyone interested in the topic that isn’t in a rush. For faster and a more technical approach, one might consider John Pierce’s book.

Noise bands from interpolating instantaneous frequency

Frequency bands are often used in analysis or input representation, for example, in mel spectogram, there are a number of bands of differing frequency widths used to represent the signal. In synthesis, frequency bands are also used. However, synthesizing a frequency band of non zero width is usually a noisy process. The most common way to synthesize this is by generating wide band noise, either with a white noise generator or an impulse, and filtering it to have fewer frequencies. For example, with a band pass filter or high-Q IIR filter. When narrow enough these will resemble sinusoids with some attractive roughness. FM can be used to create narrow bands as well, and be made more complex with a daisy-chained FM. In the music page, the piece “December 9” uses daisy chained FM and granular synthesis exclusively to create tones and rain-like sounds. Each of these has some drawbacks and advantages depending on your use case – IIR might explode, daisy-chained FM is unstable in other ways. They are all pretty neat – somehow it’s enjoyable to turn noise into something that resembles a sinusoid.

I want to describe another technique to create banded noise which I’ve used before that does the inverse, turning a sinusoid into noise. I’m fairly sure others have used this as well, but it doesn’t seem to be well documented. The basic idea is to start with a sinusoidal unit generator and linearly interpolate the instantaneous frequency from the previous target frequency to a random target frequency, with the target updating at an average but random duration that is inversely proportional to the width. The phase advances monotonically each step. The result is a band of noise that is a perfect sinusoidal tone at zero width, and white noise at full width.

I may try to add more details to this post later.

Books I Read in 2020: Stats

I wanted to write in-depth book reviews for books that I enjoyed in 2020 at the very start of the year. Well, it’s already a month and I haven’t gotten to it.
So to kickstart it, I’m just going to make the task easier and just list some of the books I read in 2020 that were noteworthy with one or two sentences. And to narrow it down further, I’m just going to talk about stats-related books, because I read a few of them.

  • Science Fictions, Stuart Ritchie (2020)

  • This is a relentless, even thrilling debunking of bad science, starting with the replication crisis in psychology and ending up touching much more than I expected. The stories are captivating, and the explanations focus on the system’s perverse incentives for fraud, hype and negligence more than blaming the individuals (although there is definitely shaming where it is called for). The criticism of Kahneman’s overconfidence in Thinking Fast and Slow was refreshing to read, because he (and Yudkowsky in his senquences) says something to the effect of ‘you have no other choice to believe after reading these studies’, which felt like it didn’t match the larger message about questioning your beliefs and updating them on new information. It is a good lesson, not without a healthy helping of irony that rationalists need to be told that they should be less confident that they don’t know everything yet. Another great criticism was about the hugely popular, and often unchallenged Matthew Walker’s Why We Sleep, which I just took at face value before reading this book.

  • Statistical Rethinking, Richard McElreath, 2nd edition (2020)

  • This is an excellent introduction to Bayesian statistics, and pairs wonderfully with the author’s engaging, enlightening, and entertaining recorded 2019 lectures on YouTube as well as the homework problems on GitHub. Miraculously, McElreath manages to pull off a new video phenomena that mixes statistics with hints of standup comedy. There are no mesmerizing 3blue1brown-like plots, but McElreath picks interesting problems and datasets to play with, and breaks down the model into the core components that a newcomer would need to understand it. I also appreciate that the course is designed for a wide range of people, so there are very little assumptions about math backgrounds, but if you know calculus, information theory, and linear algebra, there are nice little asides that go deeper. It’s also great how up-to-date the book is – I didn’t expect to be so interested by the developments in Hamiltonian Monte Carlo that have come in the past few years, but it seems the field is undergoing rapid development. This book also helped me with understanding casual inference and confounders and is a great follow-up to the casual-reader oriented Pearl book mentioned below.

  • The Book of Why, Judea Pearl (2018)

  • This book was the first thing I read on causal inference and causality in general. It’s a light non-fiction book that assumes no math background, and does a good job of explaining how to answer the question of disentangling correlation and causation. The book has interesting problems and examples, such as how controlling on certain types of data (confounders) can actually cause you to find spurious causation, and how to go about showing that smoking really does cause cancer when it is unethical to do a randomized controlled trial and there are tons of confounders. There is a fair amount of interesting history in it, and because of that, there are traces of personal politics that seem slightly out of place, but it doesn’t detract from the book too much. This book is sort of a teaser for Pearl’s deeper textbook, Causality, which I haven’t read. The Book of Why doesn’t really go into how you would create a model of your own, or how such a bayesian model would compare to the deep neural network/stochastic gradient descent models that are driving the computing industry today. Perhaps Causality covers some of these things, but I felt like it would benefit from a few comparisons between the popular frameworks with a toy problem that deep learning can’t solve. Still, it could be argued that this was not the point of the book. In any case it was an enlightening introduction that gave me other questions to pursue, which is the type of book I am after.

Log probability, entropy, and intuition about uncertainty in random events

Probability is hard thing for humans to think about. Setting that aside, there are a whole bunch of fields that care about log probability. Log probability is an elemental quantity of information theory. Entropy is a measure of uncertainty, and at the core of information theory and data compression/coding. Expected log probability is entropy:
H(X) = -\sum_x p(x) \log p(x) = -\mathop{\mathbb{E}} \log p(x)

For a uniform distribution this can be simplified even further. A fair coin toss, or die throw, for example, has uniform probability of heads/tails or any number, and we get:
H(X) = -\sum \frac{1}{n} \log \frac{1}{n} = -n \frac{1}{n} \log \frac{1}{n} =  -\log \frac{1}{n} = \log{n},
where n = 2 for a coin flip and n = 8 for a 8-sided die (because there are 2 and 8 possible values, respectively). So with a base-2 log, we get H(X_{coin}) = \log{2} = 1 for the coin flip, and H(X_{d8}) = \log{8} = 3 for a 8-sided die.

One thing that might not be immediately obvious is that this allows us to compare different types of events to each other. We can now compare the uncertainty in a coin flip and the uncertainty in a 8-sided die. H(X_{d8}) = 3 H(X_{coin}), so it takes 3 coin flips to have the same uncertainty. In fact, this means you could simulate an 8-sided die with 3 coin flips, but not 2 coin flips with some sort of tree structure: the first flip determines if the die is in 1-4 or 5-8, the next if it is in 1-2 or 3-4 if the first flip was heads, or 5-6 vs 7-8 for tails on the first flip, and the last flip resolves which of those two numbers the die ends up on.

You probably would have been able to come up with this scheme to simulate die throws from coinflips without knowing about how entropy is formulated. I find this interesting for a couple reasons. First, this means there may be something intuitive about entropy that we have in our brains that we can dig up. The second is that this gives us a formal way to verify and check what we intuited about randomness. For this post I want to focus on intuition.

The first time you are presented with entropy, you might wonder why we take the log of probability. That would be a funny thing to do without a reason. Why couldn’t I say, ‘take your logs and build a door and get out, I’ll just take the square or root instead and use that for my measure of uncertainty’, and continue with my life? It turns out there are reasons. I wanted to use this post to capture those reasons and the reference.

If you look at Shannon’s A Mathematical Theory of Communication, you will find a proof-based solution that’s quite nice. Even after looking at it if you haven’t looked at convexity-based proof in a while, it can still be somewhat unintuitive why there needs to be a logarithm involved. Here is a less formal and incomplete explanation that would have been useful for me to get more perspective on the problem.

There are a few desirable properties of entropy that Shannon describes. For example, entropy should be highest when all the events are equally likely. Another one of them is how combining independent events like coin flips or dice rolls combine the possibilities. This means that the number of outcomes is exponential on the number coin flips or die rolls. So if I compute the entropy of one coin flip and another coinflip, and add them together, the sum should be the same value if I were to compute a single entropy on those two coinflips together.

If you want a measure that grows linearly on number of coin flips or die rolls that achieves this property, then taking the logarithm of the number of combinations gives you just that. No other function that isn’t a logarithm will do that. This is because the number of possibilities of n coinflips is exponential. Notice that \log_2(2^n) = n, where n is the number of coin tosses and 2^n are the possible outcomes for n coin tosses). So the log inverts the exponent added by the combinations of multiple events, which gives entropy the linear property on n.

Nostalgia, Music, and Utility Functions

We accept that certain events that happened in the early course of our lives influence us with a permanence that lives on in our identity. For many people music will be one of those events. In the opening of High Fidelity the main character sarcastically warns about the dangers of music for kids. In this post I want to consider the after effects (such as nostalgia) for music after the novelty wears off. “Sentimental music has this great way of taking you back somewhere at the same time that it takes you forward, so you feel nostalgic and hopeful all at the same time.” — also from High Fidelity, and closer to today’s topic.

The older I get, the less surprises there seem to be. I probably should restate that. There are still many surprises, but the intensity and the excitement — the amount something grips me — seems to go down with age. This is a bit depressing if you look at it one way, but I don’t think it has to be. Rather it seems like it is a natural consequence of understanding more about the world. The most obvious analogue is film and literature – at first you can start with the classics and they are all amazing. Then eventually you get to a point where you can see how those classics influenced the newer works, getting more perspective and insight into the world. This is quite an interesting feeling too. But the novel spark of that first foray into the arts is really something special.

Where did those bright eyed, wonderful times go? I’m reminded of them most viscerally when I put on music. Sometimes I talk or write about experimental computer music. But the particular music that does this for me is only a few specific bands and genres that are now very far from me culturally – the teenage and college years of grunge, pop-punk, emo, and indie rock, from The Descendents to the popular garden state soundtrack. Near 40, I still feel a certain strong and invincible euphoria when listening to these genres, even though it’s been years since I listened. If I get into it and shake my head and air-drum it is even better. The music has wired up a strong pathway for belonging and excitement in my brain and then neurons were set to be read only, to last for much longer than I would have thought. When I am 80, will it still be there?

At the same time, the act of listening to this music feels more than a bit weird. Perhaps it feels masturbatory or escapist. There is a disconnect between who I am now and who I was then, and the music is an instant portal that revives those youthful brain pathways.

And if we take it one step further, where did those bright eyes and times of wonder come from? The act of falling in love with a (sub)-culture because of someone you loved is one of the slyest and most curious phenomenon that is near the heart of western individualism. The association with a beautiful face that you might get to see if you go to the next concert. Culture is much more than a sexually transmitted meme. The acceptance from dressing a certain way and having a crew is really powerful, and dancing in unison to a beat is a primal joy even if the footsteps are different. For me, it was always the elusive nature and the promise that glory was just around the corner. The Cat Power song goes “It must just be the colors and kids that keep me alive, because the music is boring me to death”, which is something that a popular artist might be able to say after overwhelming success. But I was always on the outskirts, never really feeling in the middle of the culture that I loved. The colors, the kids, they were great, but the thing the lasted for me was the music. And I think, just as with the friendship paradox, this is the case for most people.

What would it take to find something like that again? Is it a matter of age, hormones, and college or can we recreate that somehow? Would I even want to if I could? Why can’t I, say, take these nostalgic feelings from music that I grew out of and ‘install’ them into music and things I am interested in now?

This is related to rewriting your utility function, which is a very scary thing that will literally annihilate your identity if you get it wrong. Just as we are programmed through evolution to love doing certain things – to feel joy from the wind going through your hair on a good walk, we are programmed through society and culture to a lesser extent during our developmental youth. But we do not get to swap the way we feel about walking through nature and thinking about high dimensional spaces. The latter has an intellectual pleasure for sure, but it is less ‘natural’. If we could do this, the world would be very different in very little time. I am not sure if we would converge or diverge as a species, but I would posit that if you could make being good at your job feel like dancing for the average person, productivity would blow up globally and previous social problems would disappear in place of newer probably scarier problems. But this is a rabbit hole that has many tropes and discussions such as wireheading that I would like to leave aside for now. I just want to note that it seems that technology is still a ways off from this, but this feeling I get when I listen to the right music is a sneak preview. It is frightening and feels great at the same time.