Sora, Groq, and Virtual Reality – Stratechery by Ben Thompson

Matthew Ball wrote a fun essay earlier this month entitled On Spatial Computing, Metaverse, the Terms Left Behind and Ideas Renewed, tracing the various terms that have been used to describe, well, that’s what the essay is about: virtual reality, augmented reality, mixed reality, Metaverse, are words that have been floating around for decades now, both in science fiction and in products, to describe what Apple is calling spatial computing.

Personally, I agree with Ball that “Metaverse” is the best of the lot, particularly given Ball’s succinct description of the concept in his conclusion:

I liked the term Metaverse because it worked like the Internet, but for 3D. It wasn’t about a device or even computing at large, just as the Internet was not about PC nor the client-server model. The Metaverse is a vast and interconnected network of real-time 3D experiences. For passthrough or optical MR to scale, a “3D Internet” is required – which means overhauls to networking infrastructure and protocols, advances in computing infrastructure, and more. This is, perhaps the one final challenge with the term – it describes more of an end state than a transition.

A challenge, perhaps, or exactly what makes the term the right one: to the extent the Metaverse is the “3D Internet” is the extent to which it is fully interoperable with and additive to the Internet. This, moreover, is a well-trodden path; two years ago I wrote in DALL-E, the Metaverse, and Zero Marginal Content:

Games have long been on the forefront of technological development, and that is certainly the case in terms of medium. The first computer games were little more than text:

Images followed, usually of the bitmap variety; I remember playing a lot of “Where in the world is Carmen San Diego” at the library:

A screenshot from "Where in the world is Carmen San Diego"

Soon games included motion as you navigated a sprite through a 2D world; 3D followed, and most of the last 25 years has been about making 3D games ever more realistic. Nearly all of those games, though, are 3D images on 2D screens; virtual reality offers the illusion of being inside the game itself.

Social media followed a similar path: text to images to video and, someday, shared experiences in 3D space (like the NBA Slam Dunk Contest); I noted that generative AI would follow this path as well:

What is fascinating about DALL-E is that it points to a future where these three trends can be combined. DALL-E, at the end of the day, is ultimately a product of human-generated content, just like its GPT-3 cousin. The latter, of course, is about text, while DALL-E is about images. Notice, though, that progression from text to images; it follows that machine learning-generated video is next. This will likely take several years, of course; video is a much more difficult problem, and responsive 3D environments more difficult yet, but this is a path the industry has trod before.

In a testament to how quickly AI has been moving, “several years” was incredibly pessimistic: Stable Diffusion was being used to generate video within a few months of that post, and now OpenAI has unveiled Sora. From OpenAI’s website:

Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world. The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.

The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory…

Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.

The last two paragraphs in that excerpt are in tension, and have been the subject of intense debate on X: does Sora have, or signal a future, of an emergent model of physical reality, simply by predicting pixels?

Sora and Virtual Reality

One of the more memorable Sora videos came from the prompt “Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.”

This is, frankly, astounding, particularly the rendition of water and especially light: it is only in the past few years that video games, thanks to ray-tracing, have been able to deliver something similar, and even then I would argue Sora has them beat. And yet, a 2nd or 3rd viewing reveals clear flaws; just follow the red flag flying from the ship on the right and how the ship completely flips directions:

The flipping ship in OpenAI's demo

Sora is a transformer-based model, which means it scales in quality with compute; from OpenAI’s technical report about Sora:

Sora is a diffusion model; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer. Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, computer vision, and image generation.

Sora's diffuser transformer model

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

How training compute improves Sora

This suggests that the flag on the ship in the coffee cup (what a phrase!) can be fixed; I’m skeptical, though, that what is, at the end, pixel prediction, could ever be used to replace the sort of physics modeling I discussed in last week’s Stratechery Interview with Rescale CEO Joris Poort about high-performance computing. Note this discussion about modeling an airplane wing:

So let’s take a simple example like fluid flow. You can actually break an airplane wing into many small little boxes or any kind of air or liquid into any small box and understand the science and the physics within that little box and we usually call that a mesh, so that’s well understood. But if you look at something like a more complicated concept like turbulent flow, we’ve all experienced turbulence on an aircraft and so this is not a smooth kind of flow and so it’s discontinuous, so you actually have to time step through that. You have to look at every single small little time step and recalculate all those physics and so each of those individual cells, that mesh can be calculated in parallel.

These physics simulations are meant to be the closest possible approximation to reality; if I’m skeptical that a transformer-based architecture can do this simulation, I am by extension skeptical about its ability to “understand and simulate the real world”; this, though, is where I return to Ball’s essay: we are approaching a product worthy of the term “virtual reality.”

Groq

The point of DALL-E, the Metaverse, and Zero Marginal Content was that generative AI was the key ingredient to making the Metaverse a reality:

In the very long run this points to a metaverse vision that is much less deterministic than your typical video game, yet much richer than what is generated on social media. Imagine environments that are not drawn by artists but rather created by AI: this not only increases the possibilities, but crucially, decreases the costs.

We don’t know the costs of Sora, but they are almost certainly substantial; they will also come down over time, as computing always has. What is also necessary is that rendering speed get a lot faster: one of the challenges of interacting with large language models today is speed: yes, accuracy may increase with compute and model size, but that only increases the amount of latency experienced in getting an answer (compare, say, the speed of GPT-3.5 Turbo to GPT-4). The answer here could also just be Moore’s Law, or maybe a different architectecture.

Enter Groq.1

Groq was founded in 2016 by Jonathan Ross, who created Google’s first Tensor Processing Unit; Ross’s thesis was that chips should take their cue from software-defined networking: instead of specialized hardware for routing data, a software-defined network uses commodity hardware with a software layer to handle the complexity of routing. Indeed, Groq’s paper explaining their technology is entitled “A Software-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning.”

To that end Groq started with the compiler, the software that translates code into machine language that can be understood by chips; the goal was to be able to reduce machine-learning algorithms into a format that could be executed on dramatically simpler processors that could operate at very high speed, without expensive memory calls and prediction misses that make modern processors relatively slow.

The end result is that Groq’s chips are purely deterministic: instead of the high-bandwidth memory (HBM) used for modern GPUs or Dynamic Random Access Memory (DRAM) used in computers, both of which need to be refreshed regularly to function (which introduces latency and uncertainty about the location of data at a specific moment in time), Groq uses SRAM — Static Random Access Memory. SRAM stores data in what is called a bistable latching circuitry; this, unlike the transistor/capacitor architecture undergirding DRAM (and by extension, HBM), stores data in a stable state, which means that Groq always knows exactly where every piece of data is at any particular moment in time. This allows the Groq compiler to, in an ideal situation, pre-define every memory call, enabling extremely rapid computation with a relatively simple architecture.

It turns out that running inference on transformer-based models is an extremely ideal situation, because the computing itself is extremely deterministic. An LLM like GPT-4 processes text through a series of layers which have a predetermined set of operations, which is perfectly suited to Groq’s compiler. Meanwhile, token-based generation is a purely serial operation: every single token generated depends on knowing the previous token; there is zero parallelism for any one specific answer, which means the speed of token calculation is at an absolute premium.

The results are remarkable:2

This speed-up is so dramatic as to be a step-change in the experience of interacting with an LLM; it also makes it possible to do something like actually communicate with an LLM in real-time, even half-way across the world, live on TV:

One of the arguments I have made as to why OpenAI CEO Sam Altman may be exploring hardware is that the closer an AI comes to being human, the more grating and ultimately gating are the little inconveniences that get in the way of actually interacting with said AI. It is one thing to have to walk to your desk to use a PC, or even reach into your pocket for a smartphone: you are, at all times, clearly interacting with a device. Having to open an app or wait for text in the context of a human-like AI is far more painful: it breaks the illusion in a much more profound, and ultimately disappointing, way. Groq suggests a path to keeping the illusion intact.

Sora on Groq

It is striking that Groq is a deterministic system3 running deterministic software that, in the end, produces probabilistic output. I explained deterministic versus probabilistic computing in ChatGPT Gets a Computer:

Computers are deterministic: if circuit X is open, then the proposition represented by X is true; 1 plus 1 is always 2; clicking “back” on your browser will exit this page. There are, of course, a huge number of abstractions and massive amounts of logic between an individual transistor and any action we might take with a computer — and an effectively infinite number of places for bugs — but the appropriate mental model for a computer is that they do exactly what they are told (indeed, a bug is not the computer making a mistake, but rather a manifestation of the programmer telling the computer to do the wrong thing).

I’ve already mentioned Bing Chat and ChatGPT; on March 14 Anthropic released another AI assistant named Claude: while the announcement doesn’t say so explicitly, I assume the name is in honor of the aforementioned Claude Shannon. This is certainly a noble sentiment — Shannon’s contributions to information theory broadly extend far beyond what Dixon laid out above — but it also feels misplaced: while technically speaking everything an AI assistant is doing is ultimately composed of 1s and 0s, the manner in which they operate is emergent from their training, not proscribed, which leads to the experience feeling fundamentally different from logical computers — something nearly human — which takes us back to hallucinations; Sydney was interesting, but what about homework?

The idea behind ChatGPT Gets a Computer is that large language models seem to operate somewhat similarly to the human brain, which is incredible and also imprecise, and just as we need a computer to do exact computations, so does ChatGPT. A regular computer, though, is actually the opposite of Groq: you get deterministic answers from hardware that is, thanks to the design of modern processors and memory, more probabilistic than you might think, running software that assumes the processor will handle endless memory calls and branch prediction.

In the end, though, we are back where we started: a computer would know where the bow and stern are on a ship, while a transformer-based model like Sora made a bad guess. The former calculates reality; the latter a virtual reality.

Imagine, though, Sora running on Groq (which is absolutely doable): could we have generated videos in real-time? Even if we could not, we are certainly much closer than you might have expected. And where, you might ask, would we consume those videos? How about on a head-mounted display like the Apple Vision Pro or Meta Quest? Virtual reality (my new definition) for virtual reality (the old definition).

The Impending VR Moment

The iPhone didn’t happen in a vacuum. Apple needed to learn to make low-power devices with the iPod; flash memory needed to become viable at an accessible price point; Samsung needed to make a good enough processor; 3G networking needed to be rolled out; the iTunes Music Store needed to provide the foundation for the App Store; Unity needed to be on a misguided mission to build a game engine for the Mac. Everything, though, came together in 2007, and the mobile era exploded.

Three years ago Facebook changed its name to Meta, signaling the start of the Metaverse era that quickly fizzled into a punchline; it looked like the company was pulling too many technologies forward too quickly. Apple, though, might have better timing: it’s notable that the Vision Pro and Sora launched in the same month, just as Groq started to show that real-time inferencing might be more attainable than we thought. TSMC, meanwhile, is pushing to 2nm, and Intel is making a credible bid to join them, just as the demand for high performance chips is sky-rocketing thanks to large language models generally.

I don’t, for the record, think we are at an iPhone moment when it comes to virtual reality, by which I mean the moment where multiple technological innovations intersect in a perfect product. What is exciting, though, is that a lot of the pieces — unlike three years ago — are in sight. Sora might not be good enough, but it will get better; Groq might not be cheap enough or fast enough, but it, and whatever other competitors arise, will progress on both vectors. And Meta and Apple themselves have not, in my estimation, gotten the hardware quite right. You can, however, see a path from here to there on all fronts.

The most important difference, of course, is that mobile phones existed before the iPhone: it was an easy lift to simply sell a better phone. The big question — one that we are only now coming in reach of answering — is if virtual reality will, for a meaningful number of people, be a better reality.

I wrote a follow-up to this Article in this Daily Update.

Post Author: BackSpin Chief Editor

We are the editorial staff at BackSpin Records. We love music, technology, and other interesting things!