01 Mar 24

The joy of the {game, simulation, physics} engine, and on implicit engines in large ML models

Writing a game with an engine is a special experience the first few times. At the core of the engine is some kind of loop. The loop is usually over a fixed time interval, but it can also be per event. The frequency of the loop doesn’t have to match the screen rendering frequency. It can be faster, since the game engine tracks the states of all of entities and their interactions.

The interactions between entities include collision detection and defining what happens when entities collide (hit points reduced, force applied, etc). From these relatively simple rules, a complex system arises, and at some point you can’t predict what would happen for a given set of inputs, and have to observe the engine in action to understand it. After defining the rules and a few entities (like pong paddles, the ball, and the walls), there is another step where you observe the system living and breathing on the screen. And it is undoubtedly different somehow than you imagined it. There is a certain enjoyment I got from writing game engines when you realize that so many different combinations of things might happen as a result of the loop’s heartbeat, and it starts to feel alive. I imagine most coders feel similarly.

There are also rules behind an engine that aren’t explicitly concerned with interactions between two entities, such as gravity or how often some entity spawns, or how the sun rises and sets. These add additional complexity.

The rendering pipeline can be seen as something that allows visualizing the game state, but typically will not affect it. You could render it to video, audio, or text, from various views if in a coordinate based engine.

What is the difference between a game engine and a simulation? With a simulation, typically you take something from the real world, and try to capture the phenomenon with rules and simplification (i.e. we don’t simulate most things at the sub-atomic level). With a game, you can be god, and create your own rules, and your own world. I mentioned earlier that eventually the engine produces more complex output than can be predicted. This feels like leverage – you can understand the individual rules, but not the output of the system. Without the rules, it would be very hard to create the output.

As I mentioned earlier, because simple rules in games can create complex output, it’s often very difficult or impossible to create the rules correctly for a desired output state. So there is usually a feedback loop as the developer observes, and not only corrects rules, but gets new ideas for what would be interesting. I’d guess it’s more intuitive for a game designer to start by going for a certain class of output states, but there are many games that are designed around a novel set of rules. For example, being able to reverse time in Braid.

With a simulation, the developer often needs to create many rules to match some abstraction of reality. An apt intersection of game development and simulation is the 3D game physics engine, since it has many constraints that we expect from our real world – like large masses needing more force to accelerate.

Lately, a number of posts, driven by the SoRA release have called into question to what extent these large machine learning models have a physics engine running inside them. You might recall earlier video generation tools like Google’s Imagen or Meta’s Make-A-Video.

These clearly have a lack of physics understanding and have the wiggly inconsistency that suggests the models aren’t quite sure about how objects should behave in the scene. Compare to SORA, which seems to mostly capture the scene as a game engine would render it, down to the reflections, with only a few artifacts that are observable on finer introspection. Does this model have something more like a physics engine built in? The answer seems like ‘sort of’ but not quite yet? I forward you to these discussions if you are interested by Gary Marcus and Raphaël Millière

To me, the more interesting question is about how generally an ML model can represent the data via rules as some kind of engine, and how this engine relates to the probabilistic output layer. Could an ML model reliably construct new rules? If you tell it to imagine gravity exists between carbon atoms only, for example. The last question, how can you make models relatively more reliable seems to be answered to some extent by SORA, since it got rid of the wiggle and captured scenes more reliably. Having consistency over longer and longer frames of time seems crucial for anything like a physics or game engine.

Most of the video data we have is of 3D world, since humans went out into the real world and captured it with a real camera. This translates super nicely to matrix multiplication, as any 3D programmer will tell you. I think the 2D game engine may actually be more interesting here, because it is divorced from this data, and usually comes from the game designer’s brain directly. If these models can capture something like this well and consistently, there should be a lot of interesting results we would see. Even better if we have interpretability/explainability built-in, so the rules within the models can be explicitly described and checked.