When Robots Started to Understand: LLMs, NVIDIA GR00T, and Google RT-2 Explained
When Robots Started to Understand: LLMs, NVIDIA GR00T, and Google RT-2 Explained
For decades, robots could only do exactly what they were programmed to do. Then someone asked: what if we gave them a language model? The answer changed everything — and it happened faster than almost anyone expected.
Here's a test that roboticists used to love giving their systems: put an empty Pepsi can on a table, and ask the robot to “throw away the trash.” A classically programmed robot fails immediately. It doesn't know what trash is. It doesn't know what “throw away” means as a physical action. It's seen a Pepsi can in its training data, but not an empty one — and it has no idea those are different things.
This sounds almost funny until you realize it represents the fundamental wall that blocked robotics from becoming generally useful for most of its history. Robots were fast, precise, and tireless — but they were also, in a very real sense, comprehension-free. They executed instructions. They did not understand them.
Then, roughly between 2022 and 2026, something shifted. Language models got applied to robots. And the Pepsi can test? A robot running Google's RT-2 passes it on the first try — not because someone programmed “empty Pepsi can = trash,” but because the model learned that from the internet, the same way you did.
Let me walk you through how that happened, what it means, and where it's going — because honestly, researching this piece was one of those moments where I had to keep stopping and re-reading things. The pace of this field is genuinely disorienting.
The Old Way: Why Classic Robots Were So Brittle
To appreciate what's changed, you need to understand just how limited the old approach was. A traditional industrial robot — the kind assembling cars on a factory floor — operates on what engineers call a closed-world assumption. It knows exactly the objects it was trained on, exactly the tasks it was given, and exactly the environment it was designed for. Change any of those variables, and performance collapses.
The engineering term for this failure is “out-of-distribution.” Move the table three centimeters. Add an unfamiliar object. Give an instruction that uses slightly different wording. The robot breaks. Not dramatically — it just stops, or worse, does something confidently wrong.
Traditional robots learn tasks, not concepts. A robot trained to “pick up the red cube” has no idea what “red” means beyond the specific pixel values it was trained on. Ask it to pick up a “crimson block” and it fails. Language models understand that red, crimson, and scarlet are the same thing — and that changes everything about how a robot can be instructed.
This brittleness is why, for decades, truly general-purpose robots remained science fiction. You could have a brilliant robot arm in a car factory — doing exactly one thing, perfectly, ten thousand times a day — or you could have a clumsy, expensive research robot that sort of worked in a carefully controlled lab. Nothing in between.
The question was: could a model trained on human knowledge — all the text and images on the internet — give a robot something closer to genuine understanding? The first serious answer came from Google.
RT-2: The First Time a Robot Read the Internet
In July 2023, Google DeepMind published RT-2 — Robotics Transformer 2 — and it was a genuine conceptual leap. The idea sounds almost too simple: take a large vision-language model already trained on web-scale data, and fine-tune it on robot demonstration data. Don't treat the robot's required output (motor commands) as something fundamentally different from language. Just tokenize the actions the same way you'd tokenize words, and let the model learn to produce them.
The result was the first so-called VLA model — Vision-Language-Action. A robot that sees, understands language, and acts, all from a single unified model. No separate vision pipeline, no separate planning module, no carefully hand-engineered skill library. One model, end to end.
How a VLA (Vision-Language-Action) model works — one unified model from perception to physical action. © Paradigm Shift Lab
What made RT-2 remarkable wasn't just the architecture — it was the emergent capabilities. Things nobody explicitly trained for. The robot could follow chains of reasoning: “move the object that could be used if you were sad” — and it picked up a tissue. It knew that a tissue is used when you're sad because it read that somewhere on the internet. The robot knowledge and the world knowledge had merged into a single system.
Honestly, when I first read the paper, I thought the “sad” example was cherry-picked for the press release. But the RT-2 team ran it systematically across hundreds of such novel instructions and consistently found that web knowledge transferred to robotic action. That's not a demo trick. That's a real result.
The limitation was latency and size. RT-2 ran on 55-billion-parameter models, which meant the robot was essentially thinking via a cloud connection. Practical deployment at scale was a different problem. But as a proof of concept, it cracked open a door that's been blowing wide open ever since.
NVIDIA GR00T N1: The Open Foundation Model for Robot Bodies
In March 2025 at GTC, NVIDIA did something significant: it open-sourced a robot foundation model. GR00T N1 — the name is a very NVIDIA thing to do, so let's just accept it — was billed as the world's first fully customizable open foundation model dedicated specifically to humanoid robots.
The architecture is clever and, I think, genuinely well-designed. GR00T N1 is a VLA model, like RT-2, but it uses what NVIDIA calls a dual-system design — and once you hear the framing, you can't unhear it.
GR00T N1's dual-system design — System 2 thinks, System 1 acts, both jointly trained end-to-end. © Paradigm Shift Lab
NVIDIA borrowed the framing from Daniel Kahneman's Thinking, Fast and Slow. System 2 is the slow, deliberate, reasoning module — a vision-language backbone that interprets the scene and the instruction. System 1 is the fast, intuitive, motor module — a diffusion transformer that generates fluid physical movements in real time. They're coupled together and jointly trained.
What I found genuinely interesting here is the training data strategy. GR00T N1 uses a pyramid: at the base, internet-scale human video data (cheap and abundant but embodiment-agnostic); in the middle, synthetic data generated from simulation (controllable but artificial); at the top, real robot demonstration data (scarce but high-fidelity). The insight is that you need all three, at different proportions, to get a model that both understands the world and can act in it reliably.
By the time of writing, NVIDIA has already iterated to GR00T N1.5, with improved grounding and language following. It's open-sourced on Hugging Face, which means every humanoid robot company on the planet can fine-tune it on their own hardware. » 1X Technologies deployed it on their NEO Gamma robot for domestic tasks after minimal post-training. That turnaround time — from foundation model to working deployment — would have seemed impossible five years ago.
Google's Evolution: From RT-2 to Gemini Robotics
Meanwhile, Google didn't stop at RT-2. If anything, RT-2 was the opening move in a much longer game.
The 2026 version of Google's robotics stack — Gemini Robotics 1.5 + Gemini Robotics-ER 1.5 — is a two-model architecture that's structurally similar to GR00T N1's dual-system design, though Google came at it from a different direction. The ER (Embodied Reasoning) model handles high-level planning and can call external tools, including Google Search, to look up information mid-task. The VLA model handles physical execution.
The most striking thing to me about Gemini Robotics-ER 1.6 is the specific capability they highlighted: reading the needle on a pressure gauge. That sounds mundane. But think about what it requires — understanding what a gauge is, knowing that the needle position encodes a value, inferring the scale, and translating that into a number. That's not object recognition. That's industrial literacy.
What Actually Changes When You Put an LLM in a Robot
So practically speaking — what's actually different? Let me be concrete, because I think the implications are more specific than “robots get smarter.”
1. Instructions in plain language
Before: “execute subroutine PICK_RED_CUBE_SLOT_A3.” After: “hey, can you grab the red thing near the door?” The robot doesn't need a programmer to translate intent into code. Any person can give it a task in the same way they'd ask a colleague.
2. Generalization to new objects and environments
The old system fails the moment it sees something outside its training distribution. The new system brings internet-scale knowledge of objects, materials, and contexts. When a factory adds a new product line, you don't necessarily need to collect thousands of new robot demonstrations — fine-tuning on a handful of examples may be enough, because the base model already “knows” what most objects are and how they behave.
3. Multi-step reasoning over long tasks
Classic robots execute individual skill primitives: pick, place, push. Stringing those into long-horizon tasks required explicit programming of every transition. VLA models, especially with chain-of-thought-style reasoning built in, can figure out task sequences from a single high-level instruction: “make the workspace ready for assembly” becomes a sequence of clearing, organizing, and positioning actions without anyone spelling them out.
4. Transparent decision-making
This one surprised me when I came across it. Gemini Robotics 1.5 can actually explain what it's doing and why, in natural language, while it's doing it. For safety-critical applications — medical, industrial, service robots — a robot that can say “I'm not picking this up because it looks fragile” is fundamentally more trustworthy than one that just silently fails or succeeds.
The shift from programmed robots to LLM-powered robots is not just a performance upgrade. It's a change in what robots fundamentally are. A programmed robot is a tool that executes. An LLM-powered robot is closer to an agent that comprehends. The difference matters enormously for what tasks are even possible to assign.
The Honest Limitations
I'd be doing you a disservice if I didn't flag the real problems, because they're significant and the field is still actively working on them.
Latency is still a challenge. Large VLA models are slow relative to the physical world. A robot arm needs to react in milliseconds; a model doing serious reasoning takes longer. The dual-system designs (GR00T N1, Gemini Robotics) are partly an answer to this — offload reasoning to the slow system, keep physical actions fast. But it's still an active engineering problem.
Data remains scarce. Language models trained on the internet have trillions of tokens. The Open X-Embodiment dataset — one of the largest robot learning datasets in the world — has about 1 million trajectories. That sounds like a lot until you compare it. Synthetic data from simulation helps, but the sim-to-real gap — the performance drop when a model trained in simulation meets the real world — hasn't been fully closed.
Safety and reliability in open environments. A robot in a controlled factory is one thing. A robot in a home, a hospital, or a city street — where the unexpected is the norm — is a genuinely hard safety problem that LLMs alone don't solve. Google has a Responsibility and Safety Council specifically for Gemini Robotics. That's not marketing; the problem is real.
Cost. Running frontier-scale VLA models still requires significant compute. GR00T N1 being open-sourced is a meaningful step toward accessibility, but inference costs for real-time robot control at scale remain non-trivial.
The Bigger Picture
Here's what I keep coming back to after digging into all of this. The history of AI has a recurring pattern: a capability that seemed to require deep domain-specific engineering turns out to be achievable with enough scale and the right architecture. Language translation. Image recognition. Protein folding. Chess. Each time, the prediction was that the hard, nuanced, “human” parts would resist this treatment. Each time, they didn't.
Robotic manipulation — the ability to interact with physical objects in open environments — has been one of the most stubborn holdouts. Robots have been “almost there” for thirty years. What's different now isn't just better hardware or bigger datasets. It's that we finally have models that bring general world knowledge to bear on physical tasks. That combination — embodied AI that actually understands what it's doing — is what makes this moment feel qualitatively different.
Whether that feeling is right, or whether we're about to hit the next wall nobody anticipated — that's the honest answer. But the Pepsi can test is no longer failing. And that matters.
READ → Naver ARC Mind: While Everyone Builds Robot Bodies, Naver Built the Operating System Cloud robotics, 1784, and why NVIDIA sent a heart — the infrastructure side of the same story.Frequently Asked Questions
What is a VLA model in robotics?
VLA stands for Vision-Language-Action. It's a type of AI model that takes visual input (what the robot's cameras see), natural language instructions (what it's told to do), and outputs motor commands (physical actions). Unlike earlier systems with separate vision, language, and control modules, a VLA processes all three through a single unified model.
What is NVIDIA GR00T N1?
GR00T N1 is NVIDIA's open-source foundation model for humanoid robots, released in March 2025. It uses a dual-system architecture: a Vision-Language Model for reasoning (System 2) and a Diffusion Transformer for generating motor actions (System 1). It's trained on a mix of real robot data, human videos, and synthetic simulation data, and is available on Hugging Face for developers to fine-tune.
What is Google RT-2?
RT-2 (Robotics Transformer 2), published by Google DeepMind in July 2023, was the first VLA model to demonstrate that web-scale knowledge could transfer to robotic control. By fine-tuning a large vision-language model on robot demonstration data, RT-2 could follow novel instructions and reason about unfamiliar objects without explicit programming. It has since evolved into the Gemini Robotics family.
What is Gemini Robotics?
Gemini Robotics is Google DeepMind's current production-grade embodied AI family, built on the Gemini 2.0 foundation model. It consists of two models: Gemini Robotics (a VLA model for physical control) and Gemini Robotics-ER (an embodied reasoning model for high-level planning). As of April 2026, Gemini Robotics-ER 1.6 is the latest version, with state-of-the-art spatial reasoning.
What is the difference between LLM robots and traditional programmed robots?
Traditional robots execute pre-programmed instructions and fail outside their training distribution. LLM-powered robots bring general world knowledge to physical tasks — they can follow natural language instructions, generalize to new objects, reason across multi-step tasks, and in some cases explain their own decisions. The core shift is from robots that execute tasks to robots that comprehend them.
For thirty years, robots could do. Now, for the first time, some of them can understand. The gap between those two things is everything.
Comments
Post a Comment