(Michael Chinen)

03 Jan 24
06:45

AR/VR 3D Audio editors in 2024

I’m curious what the ideal 3D audio editing interface for casual editing/playback would be like in VR or AR. ‘3D audio editors’ might not be the right term here. There are a few companies including one I ran into recently called Sound Particles in Leiria, Portugal that produce professional audio editors for post-production, including cinematic rendering of battle scenes and the like where spatial audio is important, and capturing the 3d scene is key. 3D scene management (tracking the position of the camera and all of the entities) is the core of game engines and CGI.

I’m actually interested something else: audio editing in a VR (or AR) context, where you want to mix some tracks, or edit a podcast, or do typical things you’d do in a 2d audio editor like audacity, where scene and entity locations aren’t the primary concern. I wasn’t aware of this kind of editor, but I bet something exists, either in FOSS, commericial apps, and if not, definitely in academic research. So here’s what I found.

Before I dive in, here are my relatively naive thoughts on possible VR applications in the editor domain. I’ve developed a few audio editors, including being part of the audacity team (2008-2012) and making an audio editor on iOS called Voicer that I retired a few years ago. But I haven’t been very close to editor development for a while.

Spectral envelopes/3d spectrogram is a fairly obvious physical mapping to do, and kind of fun to look at as evinced by 90s music videos and winamp. However, most people that I know prefer waveform to spectrogram editing. At the end of the day the artefact being produced is a time-domain, and spectra are literally convoluted with respect to time, leaving the user to be uncertain about if any spectral edit would add an unintentional click or blurriness. Another way to explain this is because spectra are computed over time windows, and if we are plotting spectrograms in 3d, with one of the axes being time, there is ambiguity as to what editing a certain point should mean. Another issue is that there are numerically impossible power spectrograms because of the overlap in time, but there are no impossible waveforms, since the waveform is the ground truth.
Providing a waveform interface is still important. Being able to zoom and accurately select, move, cut, and apply effects to a waveform is the core of an audio editor. The waveform provides quite a bit of information: it’s easy to tell if there is a tonal or noisy component present when zoomed in, and the RMS information at zoomed-out scales gives a reasonable amount of info about where the events lie in time. 3D elements might be used to add more space for large numbers of tracks, or possibly positioning the stereo or ambisonic location.
It’s now obligatory to mention why AI won’t make any new tool obsolete before getting started. So why not make a 0-D audio editor that is just a text box telling the AI what to edit? If it worked well enough, that would capture some percentage of the market (e.g. removing the pauses from a recording is already popular use case). Generative audio will become more useful for audio creators too. But it’s still a while before we capture what human audio editors do. There is a lot of audio data, but little collected data about the creative process of editing audio. Editing is also a highly interactive process with necessary trial and error, where the trials and errors are useful for aesthetic and deeper understanding of the objects behind the audio that reveal the next step to the editor. I think as long as humans want to be creative, we will need audio editors.
Audio editing capability has been stagnant until recently. Although I worked on Audacity, I was always rooting for something better to come around. In fact, one of the reasons I worked on it was because it had obvious issues that could be resolved (multithreading, hardware interface UI). Sound Forge was my favorite audio editor in the early 2000s. When I talked to sound engineers, they mostly wanted something that was fast, accurate, and reliable, with some preferring support of certain plugins. They don’t need a DAW for everything, but everything basically turned into a DAW. The basic linear interface wasn’t really improved on, just more support for tracks or inputs was added. This could mean that innovation in interface is highly constrained because what we have today gets us there eventually without having to relearn or encounter problems with a new interface. Because of this, I would consider VR editors better suited as a hobbyist or research project than a business venture.

Here’s what I found:

Immersive VisualAudioDesign: Spectral Editing in VR (Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion): A university-driven research project that made a nice analogy of morning (low azimuth) shadows from mountains to spectral masking. Also cool that paper prototyping for academics is a thing. I remember it catching on for game dev and mobile app design in the late ’00s and thought it would make more sense in other areas as well. It works for spectrograms because they are a linear physical to physical mapping. This project seems like it wasn’t developed further though.
There are a few synth/patch programming VR projects (one called SoundStage from at least 2016), but I don’t consider these audio editors. Music production/education is an interesting area to look into as well, and probably a lot of fun.
Almost all VR audio editor searches return results on how to create audio for VR, which is what I expected. There might not be a lot of demand. OTOH, I feel like people on the meta quest really like novel utilities.
The Sound Particles interface is clearly designed for a 2d/DAW scene-entity paradigm, which I said wasn’t the focus of the post, but it’s actually the closest example I could find to something that you could drop into VR, since it renders the audio scene visually in 3D.

So I didn’t do a lot of research, but I promised myself to do enough to make a hot take post in a day. I feel like there isn’t much, probably due to lack of demand and a history of audio editing progress being relatively incremental, but that also means it’s possible to do something fun in the space, even if being useful still seems like it would take some time to dial in. So there you go. Please let me know if you know of any interesting editors, applications, or research in the area.

31 Dec 23
05:48

Code mysticism and Halt and Catch Fire post LLMs

I’ve been re-watching AMC’s epic series Halt and Catch Fire. As a historical drama covering computer and internet developments and an seizing opportunity with ambition in the 70s to the 90s, it was nostalgic and motivating to watch when it came out in 2014. It never really got that popular. It was sort of scoped like Mad Men, covering several years per season, with a much smaller budget, and there are some holes in the writing, including a few two-dimensional characters. Still, it fills a unique niche and has a solid fan base. It’s one of a few series I re-watch occasionally.

HACF breaks the main roles up into hardware/systems engineers, software ‘creative’ engineers, investors, and vision/product people that work together or against each other as leaders or founders of a company. When I watch shows with plausible tech or science experts, it’s fun to see how expertise is communicated to general audience. I’m not alone – there are many that track the technobable in Star Trek (e.g. inverse tachyon beam) or noted how Emmet Brown endearingly lost his science credit when he called it a ‘jigawatt’. There is also the non-sarcastic appreciation of fictitious displays of tech expertise and the realism of the futuristic interfaces in movies like Hackers. One thing I liked about HACF is despite using a decent amount of technobable, it plausibly captures the approach and spirit of hacking and coding, like reverse engineering a memoryless chip by rigging up a hex LED system to read out the values for each of the 65536 inputs to a ROM.

The expertise of the coders are demonstrated mainly by others admiring the structural complexity of their code as objects of beauty. This is something that feels extra nostalgic now. Some examples from real life: Donald Knuth wrote the Art of Computer Programming. The book I learned 3D programming from was called The Black Art of Macintosh Game Programming (there were many like this in the 90’s/early 00’s, and this was the mac version of the popular Andre LaMothe books). In 2008 when I joined Google Summer Of Code, they gave everyone a free O’Reilly book called Beautiful Code, which covered a lot of real world algorithms and problems that had ‘beauty’ in their code solutions.

I think this code mysticism was already a fading trend, but LLMs with their code generation have made coding seem a bit less magical because now I didn’t write it, or at least, I didn’t have to write it. Maybe getting rid of the woo is a good thing; maybe there’s a lot of ego behind appreciating code this way. But re-watching HACF made me think about some of the nice parts of banging your head against a wall and waking up with the solution that is just right for your constraints.

It’s also possible I’m not representative of others here. I should definitely ask some younger startup employees and grad students how they feel about coding. From my observation, people still appreciate good, elegant, clean, efficient code, and attribute expertise to the people that can produce this regularly and well. But it feels like the legendary 10x hackerman is slowly being made more approachable with ‘average’ coders being able to write better and faster by employing an LLM to fill in their gaps.

But the point I’m making is not about the hackerman rockstar, but about the worship of the code. It seems like a special kind of appreciation that is reserved for the arts. In most other science domains, expertise and feelings of admiration were generally attributed to the innovator, not the invention. Code takes on a life of it’s own. But now that something else is getting close to be able to create nice code, maybe that kills a part of the appreciation of the art. Here is a hot take that is certainly problematic: Once something is able to be mass produced by sufficiently skilled artisans, it is not artistically interesting to make more of that thing. The thing becomes a craft – it can still be difficult for you or me to produce from scratch, but with the right tools and knowledge, a large enough percentage of the population can do it.

LLMs are not quite there yet for coding tasks that are beyond leetcode. If you ask it to do complex in-domain things related to speech/music DSP (e.g. please synthesize a flute with ASDR envelope) or ML, it gets the outline, but fails on the details. To me, the leetcode problems are more difficult, but the LLM has seen enough of them to be able to have inductive bias. It’s also only able to do the local code suggestions at the single functional or class level – it starts to fail once you need a nice structure (e.g. using polymorphism gracefully in game entity objects/maps). The latter point – structural complexity of class/API/function is still out of the LLMs grasp. It’s closer to the aesthetic of (physical) architecture that combines form and function with consistency. Maybe it’s difficult due to context window restrictions, but I won’t be surprised if it requires something extra to be able to learn this particular aesthetic.

There is also a kind of beauty in low level, highly constrained programming and system design, such as embedded systems and low resource or highly reliable systems. This is an information theoretic-type of beauty that can be seen through an objective lens, with things like Kolmogorov complexity and reducing a solution to the smallest amount of object code. This is the type that is demonstrated a lot in the first season of HACF. I don’t think the LLMs are great at this either, since correctness and meeting precise resource requirements are one thing it struggles with. Since this can be made objective, it’s probably easier to approach for the LLMs, though I’m not sure how many people are focusing on this right now. For the past few years it’s been the case that more data trumps all, but I could see these kinds of things (and other things that require correctness and precision like math proofs) require some special focus.

Perhaps part of the fading mysticism that I perceive is that these two areas of code appreciation are at the opposite extremes and more of the actual problems and engineering are in the middle, at least for my career path. Less people need to roll out their entire stack from scratch on their own, and this is a good thing. The beauty still exists and is even more complex and interesting if you zoom out of an individual’s work to the team or organization. But the object is now missing that 70’s-90’s American individualism aspect that has been culturally ingrained in my generation. Again, maybe that is a good change. Maybe the mystery and appreciation of a relatively new frontier was part of a generation’s collective motivation for diving so deep into the matrix. I don’t know. I’m curious if this is just my own feeling or others notice this too. It’s entirely possible that this is just a way of coping with ‘losing to the machines’, or at least, losing ground to the machines. To be clear, I am fascinated by the research and application progress of ML and will continue to be. But I think it’s fine to be nostalgic about things too.

19 Nov 23
14:16

If conditional entropy only goes down, why does finding out some things make me more uncertain?

Conditional probability, P(X|Y), gives the probability of x given y, as opposed to the unconditional probability P(X). Conditional entropy, H(X|Y), measures the uncertainty in x as opposed to unconditional entropy H(X). It follows from information theory that H(X|Y) <= H(X), or in other words ‘entropy only goes down (or stays the same) when conditioning on additional information’. This is fine in the relatively clean world of communication of a message, but can be a can of worms if you try to interpret this in your life with real world, everyday scenarios.

It’s somewhat non-intuitive that conditional entropy must always go down. While it’s straightforward that conditional probability doesn’t necessarily decrease — since probabilities must sum to 1 — the concept of entropy, which does not have an upper bound like probability, might seem less intuitive in this regard. Does this translate into how we measure uncertainty for real life predictions? Certainly it seems like sometimes we get more uncertain after receiving some information. In theory, Jaynes’ robot would adhere to the laws of entropy, and there is nothing to prevent Jaynes’ robot from being a human. So where does this mismatch come from?

At least one part of it is the mathematical concept of entropy being well defined versus the psychological feeling of uncertainty having a cloudy definition with lots of emotional baggage. When you find out your business partner betrayed you, or your hometown was hit by a natural disaster, you may feel less certain about the world despite receiving additional information. However, this form of uncertainty is more akin to a psychological or emotional response, differing fundamentally from the mathematical concept of entropy; a homonym.

Another problem is that it’s common to ignore priors, so we write H(X|Y) instead of H(X|YI), but in practice we are always conditioning on some prior. The closest thing we have to unconditional entropy is some entropy we compute with the principle of indifference where the number of outcomes is known but nothing else, which can never be the case for real world problems (as opposed to abstract geometric problems like balls in urns or the Bertrand paradox which is about dropping sticks onto a circle). But there is more.

An even more interesting mismatch comes from receiving information that is not consistent with your previous (prior) information. In some cases, we can condition on information that can’t possibly be true for reasons we don’t quite see yet, and this causes us to have an incorrect estimate of entropy. If you ask an insulated 16 year-old monotheist for how likely it is their god exists, many will say it is 100% because they are computing P(‘My god exists’| ‘There is an old book that says my god exists, it seems important and well written; everyone around me thinks so too’). If you ask some of them after they have been exposed to college, philosophy, or the inner workings of other religions, suddenly it will be less than 100% for a sufficiently rational student. So the entropy ‘became’ higher after conditioning on more things, and indeed, they are more uncertain.

Another real world scenario is present in the literature: What is the probability that a coin is fair (without flips)? What if the world has a just one tricky Alice, who uses a fair coin only half the time, and ten Honest Bobs, that always use a fair coin, and you don’t know who will flip the coin?

Yes, the uncertainty can increase with certain additional information here – if you know that it’s Alice doing the flipping, you aren’t certain that the coin is fair. But if a random person flips it, you are somewhat certain it will be fair because there are so many Honest Bobs. This example highlights a subtle but important distinction: while the rule H(X) >= H(X|Y) applies when considering all possible instances of Y, it does not necessarily hold for a specific instance of Y. In other words, H(X) < H(X|Y=y) is possible in certain cases, like knowing Alice is flipping the coin. That is, you can have H(X) < H(X|Y=y), for example, if you know Alice is flipping, H(X|Y=Alice) = 1. But H(X|Y) = 1/11 is still near zero since most of the time one of the many Honest Bobs will be doing the flipping. p(X=fair) = 10.5 / 11; p(X=unfair) = 0.5 / 11, so we get .27 bits with the standard entropy formula.

If we go back to the case of the teenager who becomes more uncertain over time, this is clearly a different mechanism by which uncertainty increases, which is more subjective. It does not consider all conditions, but just the specific Y=’There is an old book that says my god exists, it seems important and well written; everyone around me thinks so too’, so no conditional entropy rules are violated. But the key is that entropy exists only in the eye of the beholder, not as a property of the coin, or in god. And you are allowed to use a completely incorrect probability distribution (more certain than reasonable, or less certain than you should be given the conditioning). But if your distribution has no bearing on reality, you will pay the price (e.g. in bits, when using the wrong Huffman tree). Fixing overconfidence also means being less certain. One lesson here is that it’s probably best to leave conditional entropy at the door when thinking about specific conditioning, and use probability or evidence in dB instead. It’s a good thing when you realize that you’re incorrect and update your model. Jaynes invented a probability theory robot that tries to be a relatively objective by being the ‘correct’ amount of certain, using things like the principle of indifference and maximum entropy. There is another step further that Jaynes takes where he highlights the importance of the choice of hypotheses being considered beyond just the binary, but that is for another day.

19 Jul 23
07:28

Brainstorming on Windows

The choice of a window in a discrete Fourier transform (DFT) is an art due to tradeoffs that it implies. When the time domain signal with a single frequency (e.g. a sine-tone) is transformed into the frequency domain, it creates a lot of other ‘phantom’ frequencies called sidelobes. The no-window choice of the rectangular window has a -13 dB sidelobe whenever the frequencies of the signal are not perfectly divisible by the window length (but when they are there are no sidelobes). In practice, for audio processing, I see Hann and Hamming windows being used. The Hamming window is my personal favorite due to a very high sidelobe rejection of over -40 dB, and has a raised edge, which destroys less information. Also Richard Hamming is cool (~~cf.~~ see: Bell labs – Shannon’s manager and the famous asker of the ‘Hamming Question’) .

When I think about windows I usually think about the Fourier transform of these windows. But they don’t tell the full story. Phase, as usual, complicates everything. Notice how the sidelobes are themselves periodic. Without even analyzing anything, this implies there is some complex rotation going on, and the DFT is just ‘sampling’ this rotation at regular intervals. This is why you can use a rectangular window on a signal with a certain frequency and get no windowing artifacts or sidelobes – in this case the sampling/aliasing of the window happens to fall exactly where the complex values are the same. So as you move away from the exactly-aligned frequency to other frequency, the sidelobes ‘pop’ into existence.

Generally, DSP practitioners just pick one window for an application and stick to it. In theory, we could use an adaptive window to reduce sidelobe confusion, but it is not simple. Using an convolution directly on the waveform in ML processing is effectively assuming a fixed window, (and often comes with windowed preprocessing) as in TasNet or wav2vec. Something about this feels like a Bayesian problem since the choice of window provides different uncertainties conditional on the application and properties of the signal. It’s fun to think about how an adaptive window might work. Adaptive filters are fairly common, of course (and virtually all ML-processing is adaptive), but somehow the first thing that touches the signal – the window – is more difficult to touch. Perhaps this has to do the the relative stability that a fixed window brings. But if you look at picking the window for each frame to reduce the uncertainty. The ‘easy’ way out, is to use a very high sampling rate and/or Mel spectrogram, which smoothes out the sidelobes considerably for the high frequencies. Anyway this is a brainstorm, and something to look at further in the future.

The other thing I want to dig further into is the complex phasor for these windows. In other words, how the real and imaginary values evolve with respect to frequency for the window. My guess is that it looks a lot more steady than the sidelobes in the typical magnitude frequency domain that we commonly look at, especially if we looked at the Dolph-Chebyschev window which is designed to have flat sidelobes. This is not a big leap in perspective, and probably something people have looked at, but I don’t recall it being discussed in my signal processing education. I’d like a 3blue1brown-style animation of this.

10 Jul 23
08:19

Stochastic Processes in the real world

‘Uncertainty’ as a word is very useful in talking about probability theory to the general public. Because we associate fear and doubt and other emotions with uncertainty in the casual usage, the word immediately gets across the point that at least some of the variance is due to the observer’s mental state, which translates well to the concept of prior information. But how much of the uncertainty in a random variable or parameter is due to uncertainty, and how much is actually in the process.

Edwin Thompson Jaynes’ chapter 16 of Probability Theory: The Logic of Science (16.7 What is real, the probability or the phenomenon?) is very explicit in saying that believing randomness exists in the real world is a mind projection fallacy. The projection is that because we things appear uncertain to us, we believe the generating process itself is random, rather than appreciating our lack of information as the source of the uncertainty. We had a discussion in our reading group about this – Jaynes being a physicist and all, it seems like a strong statement to make without qualifiers, since ‘true randomness’ might exist in quantum processes, like entanglement/wave functions, when an observer ‘samples the distribution’ by taking a measurement.

But if we take a step back, my reading group buddy mentioned that Jaynes’ field was closer to thermodynamics, not quantum mechanics. So maybe he was only considering particular ‘higher level’ aspects of the real world that have deterministic properties. But in many fields treating processes as stochastic is fairly common. Perhaps they don’t consider it as part of the real world or not, but usually humans don’t care about the territory as much as the map because the map is literally what we see. I suppose the problem with this approach is when we encounter Monty-Hall like paradoxes, which are surprisingly infrequent in the real world, probably because things are tangled up and correlated for the most part. Below are some examples from my world where the stochastic process is considered. I don’t find these problematic, but are kind of interesting to think about.

In discussions with machine learning/signal processing folk, sometimes I hear the distinction between stationary noise and signal, or between voiced and unvoiced speech as deterministic and stochastic, with attempts to model as such. My own group at Google even used such a strategy for speech synthesis. Here, ‘stochastic’ and ‘uncertain’ are interchangeable. If we knew the sound pressure levels of all points in the past millisecond within the 21.4 cm (343 m/s speed of sound divided by 16 kHz sample rate), we would be able to predict the next sample at the center of the sphere with higher accuracy, even if it was Gaussian noise. Jaynes believes this uncertainty can be reduced to zero with enough information. For this case, it’s much easier than quantum mechanics to see it as a deterministic process, since sound is mostly just pressure waves moving linearly through space.

Another connection from my perspective is the connection to process based ‘academic’ music, such as John Cage or Christian Wolff, and of course Iannis Xenakis, who explicitly references stochastic processes. Here, I think the term stochastic process tends to refer to explicit (as in Xenakis’ granular synthesis) or implicit (as in Cage’s sheet tossing in cartridge music or die rolling in other pieces). Die rolling music goes back at least as far as to Mozart’s Würfelspiel, but I think Mozart thought about it more as a parlor trick than an appreciation of randomness. The Cage vs Xenakis style can be considered from Jaynes’ pure determinism stance, and gets even crazier if you consider the folks that believe consciousness arises from quantum processes, since Cage/Wolfe often use the signalling/interaction between musicians to ‘generate entropy’.

I find that statisticians, or at least statistics literature tends to much more opinionated than other areas of STEM, like computer science and math to the point where it is quasi-religious, but I’m curious if insiders e.g. in pure math feel otherwise. Recent examples are Jaynes and Pearl, which make interesting arguments but survey histories with some preaching for at least a quarter of the text. It makes it interesting to read, but also difficult to know if I’ve processed it well. This book is full of great examples that I feel I will need to look at from other perspectives.

At the end of the day though, I’m uncertain (no pun intended) if the real world contains random processes, and how these might bubble up to ‘real randomness’ in larger processes (or if the ‘randomness’ is aggregated in some central limit theorem-like or law of large numbers-type of way that ‘cancels it out’). I am certain that uncertainty exists in my mind, and that a good amount of it could be reduced with the right information. But I also like the information theory/coding problem where we treat some sources of uncertainty as things we don’t care about the fine details of (the precise order of grains of sand doesn’t matter for an image of a family on a beach). In this case we care about the grains of sand having some plausible structure (not all the black grains clustered together, but uniformly spread out, with some hills). This maps well to the way classical GANs or VAEs operate by injecting noise to fill in these details that capture an incredible amount of entropy as far as the raw signal goes, but is constrated the way recent GANS, Conformers, or MAEs don’t typically use any ‘input noise’ at all to generate all of the fine structure.

This was fun but it’s getting a bit rambley now, and I’m done for the day. I guess that’s what happens when I read and try to connect everything that I’ve done.

« Newer Posts — Next Page »

« Previous Page — Older Posts »