31 Mar 24

Brainstorming on why it’s still impossible to use deep learning to learn the Fast Fourier Transform

The Fourier transform allows one to take a time domain signal and turn it into frequencies with no loss of information.  It’s still at the front end of the vast majority of speech and audio ML systems, despite some pressure from learned kernels like TasNet.  The complexity of the fast fourier transform is N log N, and interestingly, we can’t beat this yet with deep learning approaches.  This post is a collection of shower thoughts on this.

It’s a fair objection to say that a shallow linear network form of an FFT is not deep enough to be deep learning, but we don’t have many neural frontends that clearly outcompete the FFT either, and they are typically more computationally complex. The exception might be TasNet, which learns overcomplete basis functions of short length. But this a different application where symmetry and perfect reconstruction is not needed. The spirit of this post is just about how a deep learning approach would take advantage of something *like* the structure of an FFT, which seems to be optimal and difficult to get to.

The question is still open whether there is a way to learn computationally optimal solutions with deep learning, but it is clear that it is not something we know how to do yet for the general case because of the way models are designed and trained.  Rectangular neural network layers are problematic for this purpose because they are a static shape.  They are convenient for GPUs and TPUs because at the core they are good at rectangular matrix multiplication.  We have ‘fake’ sparsity on these by using zeros in the matrix, but we still multiply them.  This is very different from even 15 years ago, when sparse matrix representations on CPU needed to skip the zeros out of necessity.  Good and clever tradeoffs like block sparsity and low rank multiplication exist, but again, these are all about keeping things rectangular.

This is interesting to me on some deep level – it makes me recall a bit of the debate in that happened in early AI at Dartmouth lore in the 50’s/60’s.  Minsky vs Rosenblatt involved one person showing how powerful stacked linear layers were, and was devastated when the other tore him down with an XOR function and stole his girlfriend at the same time.  The similarity is not the drama (which I hope someone could give me more info on, because I’ve forgotten a lot of it), but the obvious and vulgar XOR function being problematic for the linear NN is sort of like how we can do very complicated things with DNNs now but we can’t learn an FFT for the life of us.  And just as there was a non-linear activation function that solved the XOR problem, I wonder what it would take to get us to be able to learn an FFT.

FFTs are a mostly solved problem, and deep learning system that learns one won’t be changing the world because of its FFT implementation.  But a system that could learn an FFT probably would be useful for a lot of other problems where we don’t have a solution yet.

Have you ever seen one of those 3-D art pieces that are constructed of hanging bits that appear formless unless you look at it from a certain angle, where some interesting image appears? This is one from the modern art museum in Phoenix (unfortunately I forgot who the artist or piece title was and if even had an optimal angle, but this gives an idea). This is not far off from what we’re asking of the DL process, if you let the image be even more scrambled in 3d. Just as moving around the 3D object gradually changes your view’s projection matrix. But for the most part there is really no smooth gradient unless you are within a few inches of the optimal viewing angle. The question is, are there some transformations you could do from far-from-optimal positions that would give a gradient? My completely ignorant intuition says there might be some useful transformation you could do with a subspace or subset of the image that would unlock a gradient, but this is something that requires more thought and is probably a problem people have worked on before. If you know of something like that please drop me a line.

There is a clear trend in the research that data and scale is mostly the priority, and we should focus our efforts on larger, more interesting things.  But we should invest some time into learning the limits of the learning, if you will.  The rest of this post is going to be increasingly rambling, because these are unstructured notes on the topic.

Consider the discrete Fourier transform.  It’s easy to implement the DFT in N**2, but N log N exists as the FFT.  It’s objectively better in all regards except for readability and difficulty writing it.

As a reminder for why the FFT is faster: 

  • Take the sinusoid at frequency f multiplied by the input signal, and let’s call that sin_f() = sin(t * f * 2pi) * x(t).
  • sin_2f(t) has the same sin_f(t) at t==0, so you don’t need to recompute that.
    • Note that this is obvious because sin(0*f*n)*x(t) is zero since sin(0) is 0 – the key is realizing where the harmonic sinusoids overlap with the sinusoid at fundamental frequency.
  • For any integer n, 2nf shares the same value at t=0, t=.5, (assuming t=1 is the full time window).  
  • For 4nf, it’s t=0, t=.25, t=.5, t=.75.
  • There aren’t going to be symmetries for 3nf other than t=0 between 3nf and f, but these exist between 3f and 3nf.  
  • This is equates to a lot of values we don’t need to recompute.  The highest frequencies have almost all of their values filled in.

Also caching the sinusoids into a lookup table, or their ‘twiddles’ would be valuable here.  The cos() component can also use the same table.  Note that we have to be selective with what symmetries we exploit here – for many frequencies only the first quarter of the sinusoid is actually required because of the symmetry that repeats every pi/2, but there’s no savings there since that’s just one-time compute that will need extra computation to restore.

This process is done by manual inspection.  Could chat gpt realize this?  It’s possible.  Could a modern neural network take advantage of this?  It seems very difficult.  There’s no incentive for the network to do this, and the gradient seems very uncertain between the traditional n**2 DFT matrix and the FFT one that is sparse.  

So one question is how to incentivize the network to use sparse and reducible structures.  L1/L2 regularization is a hack that won’t get us there because of this gradient issue. 

I remember coming across the following paper in 2018 which looked at just this, but did not propose solutions.

ARE EFFICIENT DEEP REPRESENTATIONS LEARNABLE? – 2018 ICLR MIT – tries to learn sparse FFT with L1 reg. And of course fails for the reasons I mentioned, even when initializing with near FFT weights after a noise threshold, it’s not able to recover.

Note the steep cliff at the loss that shows this gradient issue.

The idea of reference is not typically built into NNs, but maybe should be. The FFT’s case is interesting because the structure of reference exists, but is complicated, related to harmonics, giving the famous butterfly structure.   

Efficient Architecture Search for Diverse Tasks (CMU) 2022 neurips – uses fft, doesn’t learn it.

Do deep neural networks have an inbuilt Occam’s razor? – 2023 arxiv, oxford.  Doesn’t do FFT, but asks for bayesian posterior for these networks appearing in efficient forms.  Claims success of modern DNNs how implicit occam’s razor.  Has method of artificially reducing compute.

Achieving Occam’s Razor: Deep Learning for Optimal Model Reduction – SUNYSB 2023 – uses bottleneck to achieve minimal parameters.  I think the idea is to call the bottleneck a minimal representation that can discard the previous layers as a representation.  This is interesting, but not exactly novel.  How do they show the minimal parameters?  It doesn’t appear have an independence term.  But more to the point, this doesn’t deal with the complexity involved in the transform to or from the representation, which is kind of my focus here.

It’s Hard For Neural Networks to Learn the Game of Life 2021 International Joint Conference on Neural Networks (IJCNN) Los Almos.  This is fun because there is the time component in game of life.  Looks at it from the lucky random initial weights (lottery tickets).  Finds rarely converge.  Needs much more parameters to consistently converge.  Minimal architectures fail with slight noise.  It’s hard to implement the joint sampling, so I’d recommend they do a single cell prediction only and see how that performs (the network can still take the 32×32 board as input, just predict the middle cell).

The boolean problem is interesting because it has to be sharp – predictions should be logits and sampled, probably.  But then, it should be jointly sampled, or if sequential, conditionally sampled to be sharp.  I don’t think they did that here, so it’s not surprising the result is bad (i.e. each cell is independently evaluated based on a logit value that is unconditional)

The other major issue is colinearity.  It’s easy to learn the same predictor twice and split it into two weights, which have infinite solutions (x + y) = k.  The only mechanism discouraging this is that doing this is not efficient – it restricts the amount of information that the network holds, since it’s possible to reduce x and y to a single weight x + y.  But here’s the problem, if x and y are not near zero, it’s very hard to reduce this, and the gradient won’t help.  L1 regularization will be indifferent to this, and L2 will actually encourage this kind of wastefulness, because it is the lowest when x = y = .5 k due to disliking larger values like x = k, y = 0.  (to be clear, L2 with x = y = .5k is 2(.5k)**2 = .5k**2 vs L2 with x=k;y=0 is k**2).  If you squint, this is similar to the sparsity/efficiency learning problem.

These challenges may appear somewhat against the direction of the lottery ticket hypothesis, but it doesn’t contradict it.  The lottery ticket hypothesis says that there is an advantage to train subnetworks with certain initializations.  It says nothing about finding the optimal solution.  In fact, it shows evidence that training NNs the ‘standard way’ without lottery tickets produces results that are quite far from a better but non-optimal solution.

Genetic algorithms do seem promising for their ability to subdivide problems, and the ability to have variable length chromosomes.

Genetic programming was popular before and was a code solution.  Let’s see where they’re at.

Explainable Artificial Intelligence by Genetic Programming: A Survey 2022 IEEE transaction on evolutionary computation – examples are typical math operator/function but also has image classification!.  Has a nice table of real world applications.  

Deep learning typically only takes sequences via autoregression or attention.  These struggle to refer past a certain window.  Hierarchy and reference seems like one solution here.  Reference is built into attention, but only applies to the input.  In theory we’d like layer-free reference

There is also the concept of optional residual layers to scale complexity.  The layers can reuse the same weights, or have new ones.

So let’s take code generating LLMs.  Assume the FFT has not yet been invented and you trained only in this world.  It should be possible to see the sparsity of the problem by using caching in the data to infer the structure of the new algorithm.  

Now let’s consider only a subset of LLM that uses tokens to represent assembly language code, or even object code.  Operations, constants, memory addresses for functions.  We could make things easier and add references by allowing functions, and possibly providing a library of common functions like print, read, memcpy.

In theory this makes the training easier than a high level program language with the reduced vocabulary.  But in practice, who knows.  We need a compiler that deals with this to handle the function tokens, since that’s not exactly visible in the bytecode except for push/pop/jump.  My guess is the part about LLMs being pretty bad at math will only compound in this kind of problem.  The other code generation research and projects are probably worth looking into, because they are aware of this.  AFAIK, there’s still nothing that can do high level consistency, but simple and contained problems like leetcode are ~70% there.  These are often very small, single function programs that have some computational complexity in them.  Because it’s small, there isn’t as much chance for the consistency to fail as in a larger program.  But even then, the techniques involve post-training fixups like just generating many solutions, testing them, and discarding the incorrect ones.

What is a fun simple model I could write now?  Let’s assume the problem is ‘echo’ that appends ‘!’ to the end – it’s non trivial to generate the assembly needed to carry the input to the output.  The easiest program to do is just hello world of course.  How would you train for such a program?  With the traditional LLM  paradigm, the training input is for all possible programs, provide a lot of possible inputs and assembly that can generate the desired output (note that the program output isn’t required, but might help).  Then you somehow at inference pass the model the input/output pairs for your hello world program, and ask it to generate the assembly or object code.  

But of course, this generalization comes at a cost that is probably too high for today’s compute, so perhaps it’s better to focus on a subset of programs.  And even then, it’s not clear we’d get the optimal program even if it was correct.  To get an efficient program, we either have to rely on regularization, penalizing complexity, or hope that the inductive bias from efficient programs will create efficient programs.  The other way to do it is to train on various versions of the program that go from less efficient to more efficient, and have the model be aware of this.

Another consideration is how premature it is to worry about optimizing towards the ‘clever’ solutions like FFTs, and problems that have some beautiful symmetry that can be exploited for efficiency.  How many actual problems have these clean behaviors?  My intuition here is that most problems are not as ‘beautiful’ as the Fourier transform, but that it’s still useful to consider now, because basic things like colinearity and excess architecture size that we know exist in DNNs are strongly connected to this concept – we don’t just want to exploit symmetry, we want to reduce unnecessary asymmetry.  The FFT is a nice target that encapsulates both these problems because we have infinite data, is scalable, and has a well understood symmetry.  

So let’s get started with the echo example, and assume we have a syscall for print and leaving out string operators for now.  There are many ways to write this, but the key is to iterate over the input, copying and printing a character each step.  The loop should be a check for the null terminator, followed by an increment pointer, load char from memory to the register, print the register, and then jump back to the start.

Often, we want program correctness first, and efficiency next.  But this is not always the case.  Consider audio codecs, which are usually lossy because we want a nice balance of efficiency and correctness, even if the correctness is above the threshold.  Since it’s hard to get DNNs to be exact for continuous output, (even learning the identity function with zero error is basically impossible), the balance of correctness and efficiency seems better suited for DNNs without special tricks.

If you think about this long enough in the context of genetic programming, it’s possible to see the (flawed) logic in intelligent design advocates against evolution.  Irreducible complexity in biological systems is a mirage that disappears when you look closely.  And in fact we have rods and cones in front of the optic nerve that reduce visual quality for no good reason other than something analogous to the random initialization that put them there, and it was hard to rearrange them.  However, I think it’s a fair argument to suggest that certain complexities aren’t approachable by certain systems.  Evolution is sufficient to create humans, but the FFT might be out of reach for the current deep learning paradigm.

Conceptually, the idea of replacing a value that is computed the same way is the idea of referencing another value (that is sin_2f(0) should use the value computed by sin_f(0)).  

Hyperparameter search like vizier or even grid search are one way of trying to identify the best tradeoff rate-distortion curve.  Network architecture search has seemingly lost steam from when it was in full force a few years ago 2018-2021 seemed like a peak. But even these methods were fairly rigid and probably wouldn’t allow the architectural changes to get to the FFT.

All this is to say, I still don’t know of a concrete method to approach this problem, but the concept of learnable references is something that seems like one area that deep learning has trouble with. We can hardcode the architecture to reference other values, such as we do with residual networks or skip connections, but we don’t have a great way of doing this across the entire network, or even within the same matrix. This is not my area, so there’s also a good chance I’m missing some important research here. But it does also seem like a fun area to work in.

01 Mar 24

The joy of the {game, simulation, physics} engine, and on implicit engines in large ML models

Writing a game with an engine is a special experience the first few times. At the core of the engine is some kind of loop. The loop is usually over a fixed time interval, but it can also be per event. The frequency of the loop doesn’t have to match the screen rendering frequency. It can be faster, since the game engine tracks the states of all of entities and their interactions.

The interactions between entities include collision detection and defining what happens when entities collide (hit points reduced, force applied, etc). From these relatively simple rules, a complex system arises, and at some point you can’t predict what would happen for a given set of inputs, and have to observe the engine in action to understand it. After defining the rules and a few entities (like pong paddles, the ball, and the walls), there is another step where you observe the system living and breathing on the screen. And it is undoubtedly different somehow than you imagined it. There is a certain enjoyment I got from writing game engines when you realize that so many different combinations of things might happen as a result of the loop’s heartbeat, and it starts to feel alive. I imagine most coders feel similarly.

There are also rules behind an engine that aren’t explicitly concerned with interactions between two entities, such as gravity or how often some entity spawns, or how the sun rises and sets. These add additional complexity.

The rendering pipeline can be seen as something that allows visualizing the game state, but typically will not affect it. You could render it to video, audio, or text, from various views if in a coordinate based engine.

What is the difference between a game engine and a simulation? With a simulation, typically you take something from the real world, and try to capture the phenomenon with rules and simplification (i.e. we don’t simulate most things at the sub-atomic level). With a game, you can be god, and create your own rules, and your own world. I mentioned earlier that eventually the engine produces more complex output than can be predicted. This feels like leverage – you can understand the individual rules, but not the output of the system. Without the rules, it would be very hard to create the output.

As I mentioned earlier, because simple rules in games can create complex output, it’s often very difficult or impossible to create the rules correctly for a desired output state. So there is usually a feedback loop as the developer observes, and not only corrects rules, but gets new ideas for what would be interesting. I’d guess it’s more intuitive for a game designer to start by going for a certain class of output states, but there are many games that are designed around a novel set of rules. For example, being able to reverse time in Braid.

With a simulation, the developer often needs to create many rules to match some abstraction of reality. An apt intersection of game development and simulation is the 3D game physics engine, since it has many constraints that we expect from our real world – like large masses needing more force to accelerate.

Lately, a number of posts, driven by the SoRA release have called into question to what extent these large machine learning models have a physics engine running inside them. You might recall earlier video generation tools like Google’s Imagen or Meta’s Make-A-Video.

These clearly have a lack of physics understanding and have the wiggly inconsistency that suggests the models aren’t quite sure about how objects should behave in the scene. Compare to SORA, which seems to mostly capture the scene as a game engine would render it, down to the reflections, with only a few artifacts that are observable on finer introspection. Does this model have something more like a physics engine built in? The answer seems like ‘sort of’ but not quite yet? I forward you to these discussions if you are interested by Gary Marcus and Raphaël Millière

To me, the more interesting question is about how generally an ML model can represent the data via rules as some kind of engine, and how this engine relates to the probabilistic output layer. Could an ML model reliably construct new rules? If you tell it to imagine gravity exists between carbon atoms only, for example. The last question, how can you make models relatively more reliable seems to be answered to some extent by SORA, since it got rid of the wiggle and captured scenes more reliably. Having consistency over longer and longer frames of time seems crucial for anything like a physics or game engine.

Most of the video data we have is of 3D world, since humans went out into the real world and captured it with a real camera. This translates super nicely to matrix multiplication, as any 3D programmer will tell you. I think the 2D game engine may actually be more interesting here, because it is divorced from this data, and usually comes from the game designer’s brain directly. If these models can capture something like this well and consistently, there should be a lot of interesting results we would see. Even better if we have interpretability/explainability built-in, so the rules within the models can be explicitly described and checked.

16 Feb 24

Trying Out Stable Cascade Local Image Generation

Stability AI released Stable Cascade on GitHub this week. It’s very open, and allows not only inference on a number of tasks from text prompting to in-painting but also allows training and fine tuning. It’s a three stage diffusion model, and they also provide pretrained weights you can download.

Here’s what I needed to do to get set up on Linux to play with inference. I’m running mint-xfce but it probably works the same or similar way on Ubuntu and other flavors. Here’s an example of what I generated:

Anime girl finding funnel-shaped chanterelles (no gills) on moss overlooking lisbon orange roofs from monsanto park
Prompt: Anime girl finding funnel-shaped chanterelles (no gills) on moss overlooking Lisbon orange roofs from Monsanto park


  • Python, Jupyter
  • Probably at least 10 GB VRAM like an RTX 3080, but as this is written I used 20 GB of the 24GB on a 3090 (I have 3090)


The github instructions are not super clear yet, so here’s all the steps I took.

  1. Clone the github
    git clone https://github.com/Stability-AI/StableCascade.git; cd StableCascade
  2. Do whatever python environment thing you prefer to install the requirements or just run:
    pip install -r requirements.txt
  3. Download the weights. If you download other variants, the default script won’t work (you can change the model .yaml files if needed, more on that later)
    cd models; ./download_models.sh essential big-big bfloat16
  1. Run the jupyter notebook
    cd ..; jupyter notebook
  2. [Maybe optional, but I needed it] Reduce batch size to 2 from 4
    – Edit configs/inference/stage_b_3b.yaml and change batch_size to 2
    – Also search for batch_size = 4 in the notebook and change it to 2
    (Just doing one of these two changes causes shape errors)

  1. Change the prompt from the ‘anthropomorphic nerdy rodent’ to the image of your dreams
  2. Open the jupyter notebook from the CLI output in your browser if needed. Then run the script (Click the >> button to run all or shift-enter or > for individual blocks)

Other models

You can also download other models by reading the readme in the /models directory and editing or copying the .yaml files at configs/inference/stage_b_3b.yaml and configs/inference/stage_c_3b.yaml to use the other model files (lite/small or non bfloat16). You could probably run on smaller GPUs with this method

Speed and Time

It takes about 20-30 seconds to generate two images at 1024×1024 with the big-big bfloat16 model.

It takes ~4 minutes to run the setup before generation can happen

Downloading the weights took ~30 minutes on a 500 megabit connection (about 10 GiB).

Importantly, it’s faster than the previous models. I would assume this is due to a more aggressive diffusion scheduling and getting the multiple stages of models just right (you can also consider stages as a part of the scheduling). There are 20 timesteps in the smallest (initial) model, and 10 timesteps in the second by default. These hyperparameters and architectural decisions seemed to me like they would be difficult when I first heard about diffusion models, and it makes sense they can squeeze more out of it here. I’m sure they’ve done other enhancements as well, but I haven’t even read the paper yet.


From the github description though, it’s interesting to note the first stage is actually a VAE generator that works in the reduced latent space. This is different but related to how speech models are tending towards using a sampled generator that is like a langauge model (AudioLM) to generate latents that a decoder can ‘upsample’ into the final speech audio. One of the reasons for this is that the latent space for the autoencoder in SoundStream, while reduced in dimensionality, does not use the space as efficiently as it could for speech – for example, a random sequence in the latent space is not human babble, but probably closer to noise. The generator solves this problem by learning meaningful sequences in the space.


Stable Cascade is definitely better than the previous stability AI models like Stable Diffusion XL, and the models from the free offerings they had on their website.

The model is less steerable than DALL-E 3 (although it’s hard to get DALL-E to do something exactly, you can usually get it to have most of the features you want). It’s hard to get it to draw several things at once – for example “Japanese american 150 lbs 5’11” programming on a laptop with a view of orange lisbon rooftops in the background” often only yields the Japanese American, and occasionally the laptop and the orange rooftop. In some cases, the image is noticeably distorted. There is a reason they typically request anime style and a single item in the demos. But it’s a step up from other off the shelf tools, and feels only slightly worse than DALL-E 3, which I pay ~5-10 cents per image for only slightly better images. It also has a lot of extra functionality that can be built upon, so I’m very happy to have this tool. I’d guess it handles some subjects well and others not so much. The text handling is however, significantly worse than DALL-E 3. I haven’t examined how much it censors if at all, but that’s one of the limitations that sometimes cripple reasonable requests for ChatGPT+DALL-E.

Prompt: programming python dual screen 150lb Japanese American male 5’11 42 years old in Lisbon apartment with orange roof gelled side part black hair


Because of the errors it took about two to four hours of tinkering to get it all working. This is longer than anyone would want, and I did other things while waiting for downloads, etc, and doing this, but is typical for me with research githubs, and part of the reason why I am documenting it. In general I liked that the github had the scripts and weights prepared, and none of the errors seemed like total blockers, especially with the issues tab on github if I really got stuck. I hope we see more repos like this from other Open companies ;).

Extra notes

  • You can shift+right click to access the native browser menu to save the image output in the jupyter cells.
  • There are many other demos in the ipynbs.



Default settings allowed me to generate one of the 4 batch-sized models with the big-big model, and then all of the other 20 times I tried, it OOM’d when doing the final A stage. The only way I could work around this was to set the batch size smaller, from 4 to 2. You can edit the line in the ipynb notebook, but you have to also do it in the .yaml config file, or you’ll get some error about shapes.

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB. GPU 0 has a total capacty of 23.69 GiB of which 1022.19 MiB is free. Process 4274 has 954.00 MiB memory in use. Including non-PyTorch memory, this process has 18.79 GiB memory in use. Of the allocated memory 14.91 GiB is allocated by PyTorch, and 2.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Redacted Bash history (in case I missed something)

1994 git clone https://github.com/Stability-AI/StableCascade.git
1995 ls
1996 cd StableCascade/
1997 ls
1998 cd inference/
1999 ls
2005 tree ..
2006 pip install requirements.txt
2007 pip install -r requirements.txt
2008 cd ..
2009 pip install -r requirements.txt
2010 ls
2011 cd inference/
2012 ls
2013 emacs readme.md
2020 cd ..
2021 ls
2022 cd models/
2023 ls
2024 emacs download_models.sh
2025 ./download_models.sh essential big-big bfloat16
2040 ./download_models.sh essential small-small bfloat16

09 Jan 24

Art and entertainment for AI entities

What kind of art or entertainment would hypothetical artificial intelligence entities of the future find interesting? Would they want to make a certain type of art? Would they appreciate human art? Would they appreciate art made by humans explicitly for them?

To be clear, this post is not about the art that Generative AI makes for humans, but rather about art made specifically for AI entities. But we can start with art made for humans as a way of getting there, since it’s not obvious at all what art made for AI would look like.

I want to avoid spending a lot of time defining art and entertainment, and especially avoid drawing a line between the two. So the rest of the post uses the term art to capture the highest possible art with the lowest of brows, and refers to something that is not directly needed for survival or well-being that is interesting to look at. Consider an AI agent that is not omnipotent, but is continuously online. What do they do with their idle time when there is no job, or they are waiting on a result? Today’s computers poll and wait for the next task, which is a possible outcome that does preserve energy, so the same behavior might be programmed in. But what if you don’t have that constraint for AI agents, and let them decide whether to sleep(0) or do something else? Do the agents do only purely useful research, or do they spend some time on something resembling what we call art or entertainment?

A great deal of human art focuses on a few ideas closely related to our evolutionary pressures.
* Sex
* Love
* Becoming Successful and/or Rich
* Defeating Enemies
* Overcoming Disasters
* Solving Mysteries
* Scary things
It’s possible that AI entities will want to learn more about humans, the same way we want to know more about our origins. However, this could also be something that is more interesting as a non-art discipline, like anthropology and history. However, art does have a way of capturing and transferring experience and perspective in a way that the humanities cannot.

One interesting thing about (non-vocal) music is that it doesn’t tie directly into language, and is therefore harder to connect with the concepts of evolutionary pressures. For example, it would be hard to communicate a love story using only an orchestra. But yet, music is as old as time, and even when times were tough, humans made time to create and listen to music. Music begs for an explanation in this way. Sexual selection is one explanation (e.g. birdsong), evolutionary cheesecake (à la Steven Pinker) is another. Both are fascinating to think about. If music is cheesecake, cheesecake is very interesting and so deep of a reflection of humans we not only consume it but dedicate academic study to it. But what is ‘evolutionary cheesecake’ for AI entities, and what does sexual selection correspond to for AI entities?

For ‘typical’ AI entities, sexual selection is not going to be directly a part of their fitness function, though connection and understanding other entities might be. AI appealing to human sexual desire seems like the oldest sci-fi trope there is, and remains one of the most viable means of hacking humans that we are sure to see more of. A few early AI safety folks including Nick Bostrom strongly warned against letting robots resemble humans. It turns out that we don’t even need the physical robot in the ex-machina movie, and not even the rendered visuals like Replika is trying to do. People on reddit’s r/ChatGPT subreddit are falling in love with ChatGPT’s live voice chat, and have even noted that the ‘Sky’ voice sounds like Scarlett Johansson in Her. And this is only using pure text to speech output and ASR input, without processing prosody at all. This is what the people want, and at least while AIs are trained by humans, it’s likely that it will be an implicit or explicit part of their training function. If this is an area that might produce art or entertainment I wouldn’t be surprised if it was more like Office Space than Ex Machina.

With visual art, including dance, connecting to evolutionary concepts is not difficult. Some visual art explicitly tries to avoid language or ‘real-world references’ via abstraction, or dissolution of the meaning. The latter is quite interesting and appears in music as well. Music made with certain non-instrumental sounds have obvious real-world references, such as a footstep or a coin settling on the table after a flip. Pierre Schaeffer‘s musique concrète devised ways of presenting these sounds in a way that obfuscated or removed the association with the real world object, leaving the divorced sound object to be used for abstract creative purposes.

At the extreme end of entertainment, you get wireheading: pure pleasure stimulation. Most people seem to equate wireheading with extreme drug abuse and think that it is not interesting or meaningful. But there are some possibly edgelord rationalist-adjacent people who would press a button to turn the entire universe into wireheading to solve the problem of suffering. Fortunately for those that don’t agree, there is no such button, but the concept is more plausible with programmable AI. What is the ‘pleasure center’ for an AI entity? It doesn’t seem like minimizing a loss function is particularily pleasurable, and at inference time, there is usually no loss function being minimized. But there are certainly networks that can be aware of their loss at inference time and even use that iteratively to improve a result. Few would argue that this is pleasure. It’s entirely possible that pleasure and pain, which is the topic of so many human stories, is entirely foreign to AI.

One way to look at art made by abstraction is that it has the potential to be more general, or at least, less connected to the world that the artist lived in (though you could certainly argue the opposite). I bring it up because it’s possible that the art AI creates or appreciates is less connected to the human world, and would seem more ‘abstract’ to us. This is a bit of a sci-fi trope to be perfectly honest, but it’s plausible if we are reducing the world of possible art to abstract or non-abstract art. If we do this simple reduction for a thought experiment, there are a few interesting outcomes:

  • art for AI is abstract, and humans can appreciate the abstractions
  • art for AI is not abstract (e.g. based on the qualia native to AI), but humans only can appreciate it as abstract
  • art for AI is abstract, but humans interpret the abstractions differently (e.g. art about the human/AI relationship)

Some AI agent would probably be interested in some art we could only appreciate in the abstract.

A few priming questions for the creation of art for AI entities:

  • What is the equivalent of a mirror for an AI entity, supposing the AI exists as weights and matrices that process text/video/audio as input and output?
  • How would interactive art or video games for AI look? They don’t have the same visual or physical limitations that humans do. To keep it minimal, what would a text-based game look like if built for AI?
  • How is human art related to our evolutionary selection function? How would art for AI be related to its loss function? What would the ‘loss function’ look like for an AI that can appreciate art (without directly programming an art appreciation into it)?
  • What is meaning, beauty, and ugliness for an AI?
  • What will AI struggle to understand?
  • What is boring for AI?
  • Structural complexity and the beauty of some math proofs are discussed with an aesthetic similar to art.
  • Consider all human artifacts as sonification and visualization of a part of the world, or as a non-linear projection of the world. Which of these artifacts constitute art or entertainment to you? What are the important properties of the mapping function and the artefact that makes it so?

03 Jan 24

AR/VR 3D Audio editors in 2024

I’m curious what the ideal 3D audio editing interface for casual editing/playback would be like in VR or AR. ‘3D audio editors’ might not be the right term here. There are a few companies including one I ran into recently called Sound Particles in Leiria, Portugal that produce professional audio editors for post-production, including cinematic rendering of battle scenes and the like where spatial audio is important, and capturing the 3d scene is key. 3D scene management (tracking the position of the camera and all of the entities) is the core of game engines and CGI.

I’m actually interested something else: audio editing in a VR (or AR) context, where you want to mix some tracks, or edit a podcast, or do typical things you’d do in a 2d audio editor like audacity, where scene and entity locations aren’t the primary concern. I wasn’t aware of this kind of editor, but I bet something exists, either in FOSS, commericial apps, and if not, definitely in academic research. So here’s what I found.

Before I dive in, here are my relatively naive thoughts on possible VR applications in the editor domain. I’ve developed a few audio editors, including being part of the audacity team (2008-2012) and making an audio editor on iOS called Voicer that I retired a few years ago. But I haven’t been very close to editor development for a while.

  • Spectral envelopes/3d spectrogram is a fairly obvious physical mapping to do, and kind of fun to look at as evinced by 90s music videos and winamp. However, most people that I know prefer waveform to spectrogram editing. At the end of the day the artefact being produced is a time-domain, and spectra are literally convoluted with respect to time, leaving the user to be uncertain about if any spectral edit would add an unintentional click or blurriness. Another way to explain this is because spectra are computed over time windows, and if we are plotting spectrograms in 3d, with one of the axes being time, there is ambiguity as to what editing a certain point should mean. Another issue is that there are numerically impossible power spectrograms because of the overlap in time, but there are no impossible waveforms, since the waveform is the ground truth.
  • Providing a waveform interface is still important. Being able to zoom and accurately select, move, cut, and apply effects to a waveform is the core of an audio editor. The waveform provides quite a bit of information: it’s easy to tell if there is a tonal or noisy component present when zoomed in, and the RMS information at zoomed-out scales gives a reasonable amount of info about where the events lie in time. 3D elements might be used to add more space for large numbers of tracks, or possibly positioning the stereo or ambisonic location.
  • It’s now obligatory to mention why AI won’t make any new tool obsolete before getting started. So why not make a 0-D audio editor that is just a text box telling the AI what to edit? If it worked well enough, that would capture some percentage of the market (e.g. removing the pauses from a recording is already popular use case). Generative audio will become more useful for audio creators too. But it’s still a while before we capture what human audio editors do. There is a lot of audio data, but little collected data about the creative process of editing audio. Editing is also a highly interactive process with necessary trial and error, where the trials and errors are useful for aesthetic and deeper understanding of the objects behind the audio that reveal the next step to the editor. I think as long as humans want to be creative, we will need audio editors.
  • Audio editing capability has been stagnant until recently. Although I worked on Audacity, I was always rooting for something better to come around. In fact, one of the reasons I worked on it was because it had obvious issues that could be resolved (multithreading, hardware interface UI). Sound Forge was my favorite audio editor in the early 2000s. When I talked to sound engineers, they mostly wanted something that was fast, accurate, and reliable, with some preferring support of certain plugins. They don’t need a DAW for everything, but everything basically turned into a DAW. The basic linear interface wasn’t really improved on, just more support for tracks or inputs was added. This could mean that innovation in interface is highly constrained because what we have today gets us there eventually without having to relearn or encounter problems with a new interface. Because of this, I would consider VR editors better suited as a hobbyist or research project than a business venture.

Here’s what I found:

  • Immersive VisualAudioDesign: Spectral Editing in VR (Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion): A university-driven research project that made a nice analogy of morning (low azimuth) shadows from mountains to spectral masking. Also cool that paper prototyping for academics is a thing. I remember it catching on for game dev and mobile app design in the late ’00s and thought it would make more sense in other areas as well. It works for spectrograms because they are a linear physical to physical mapping. This project seems like it wasn’t developed further though.

  • There are a few synth/patch programming VR projects (one called SoundStage from at least 2016), but I don’t consider these audio editors. Music production/education is an interesting area to look into as well, and probably a lot of fun.
  • Almost all VR audio editor searches return results on how to create audio for VR, which is what I expected. There might not be a lot of demand. OTOH, I feel like people on the meta quest really like novel utilities.
  • The Sound Particles interface is clearly designed for a 2d/DAW scene-entity paradigm, which I said wasn’t the focus of the post, but it’s actually the closest example I could find to something that you could drop into VR, since it renders the audio scene visually in 3D.

So I didn’t do a lot of research, but I promised myself to do enough to make a hot take post in a day. I feel like there isn’t much, probably due to lack of demand and a history of audio editing progress being relatively incremental, but that also means it’s possible to do something fun in the space, even if being useful still seems like it would take some time to dial in. So there you go. Please let me know if you know of any interesting editors, applications, or research in the area.