The artificial intelligence industry just experienced its most significant pivot since the transformer architecture revolutionized language models in 2017. While the world fixated on chatbots and text generation, AI’s most influential pioneersโYann LeCun, Fei-Fei Li, and teams at Google DeepMindโquietly orchestrated a fundamental shift. Their target: moving AI beyond the flat, text-based understanding that has dominated for years, into a new frontier where machines comprehend physical reality in three dimensions.
In the span of just a few months bridging late 2025 and early 2026, over $1.3 billion has poured into “world model” startups. Yann LeCun, after 12 years at Meta, launched AMI Labs seeking โฌ500 million at a โฌ3 billion valuation. Fei-Fei Li’s World Labsโwhich just months ago launched Marble, its first commercial world model productโis now in talks to raise $500 million at a staggering $5 billion valuation. Google DeepMind released Genie 3, the first real-time interactive general-purpose world model. And NVIDIA’s Cosmos platform, trained on 20 million hours of real-world data, has been downloaded over 2 million times.
The message from AI’s most brilliant minds is unmistakable: the era of text-only intelligence has peaked. The next frontier belongs to AI systems that can see, simulate, and understand physical realityโwhat researchers call “spatial intelligence.” This isn’t just another incremental improvement. It’s the missing piece in the puzzle toward artificial general intelligence.
What Are World Models and Why Do They Matter?
Large language modelsโthe technology powering ChatGPT, Claude, and Geminiโexcel at predicting the next word in a sequence. They can write poetry, answer questions, and generate code. But they fundamentally lack what humans possess from infancy: an understanding of how the physical world works.
Ask GPT-4 what happens when you push a glass off a table, and it can describe the physics perfectly. But it doesn’t truly understand that action the way a toddler does after watching one glass shatter. LLMs predict text sequences; they don’t model reality.
World models take a radically different approach. Instead of predicting the next word, they predict the next state of a physical environment. They generate an internal representation of how things move, interact, and evolve in 3D space. This enables planning, physics reasoning, and cause-effect understandingโcapabilities that remain largely absent from even the most advanced language models.
“LLMs don’t really understand the world; they just predict the next word or idea,” notes TechCrunch’s analysis of the 2026 AI landscape. “That’s why many researchers believe the next big leap will come from world models: AI systems that learn how things move and interact in 3D spaces so they can make predictions and take actions.”
Leading researchers like LeCun argue that world models are essential for achieving human-level intelligence because they model reality rather than just describing it. As Fei-Fei Li wrote in her November 2025 manifesto, spatial intelligence provides “the scaffolding” of human cognition. For humans, spatial understanding of the physical world isn’t separate from intelligenceโit’s foundational to how we think, learn, and interact.
The Three Frontrunners Racing Toward Spatial Intelligence
Fei-Fei Li’s World Labs: Marble and the Commercial Breakthrough
World Labs emerged from stealth in September 2024 with $230 million in funding and a bold mission: build AI with spatial intelligence. On November 12, 2025, the company launched Marble, the first commercially available world model product, marking a watershed moment for the industry.
Marble generates persistent, downloadable 3D environments from text prompts, photos, videos, 3D layouts, or panoramic images. Unlike competitors that generate worlds on-the-fly during exploration, Marble produces discrete, editable environments that users can export in industry-standard formats compatible with Unreal Engine, Unity, and professional VFX workflows.
The technical architecture represents a significant evolution beyond frame-by-frame video generation. Marble utilizes Gaussian Splattingโa technique employing millions of semi-transparent particles to represent 3D volumeโallowing users to navigate and explore generated worlds with full geometric consistency. The platform offers AI-native editing tools and a hybrid 3D editor enabling users to block out spatial structures before AI fills in the visual details.
“What you don’t see here behind the scene is how much computation and why inference speed really matters,” Li demonstrated at CES 2026 in early January. “The faster we can run these models, the more responsive the world becomes. Instant camera moves, instant edits, and a scene that stays coherent as you actually navigate and exploreโthat’s what’s really important.”
Early adopters span gaming, visual effects, and virtual reality. Game developers use Marble to generate background environments and ambient spaces, then import those assets into game engines to add interactive elements. VFX professionals leverage Marble’s 3D precision to sidestep the inconsistency and poor camera control that plague AI video generators. Every generated world is compatible with Vision Pro and Quest 3 VR headsetsโaddressing what one industry observer called VR’s hunger for content.
But Li’s ambitions extend far beyond entertainment. “We can use this technology to create many virtual worlds that connect, extend, or complement our physical world,” she explained. The robotics implications are particularly profound. Unlike image and video generation, robotics lacks large repositories of training data. With generators like Marble, simulating training environments for robots becomes dramatically easier and safer than physical testing.
Li positions Marble as “the first step toward creating a truly spatially intelligent world model.” The ultimate vision: AI that doesn’t just generate pretty scenes, but genuinely understands the physics, geometry, and causality of the 3D worldโenabling everything from autonomous robots to scientific simulation.
Yann LeCun’s AMI Labs: The JEPA Architecture Revolution
In late 2025, Yann LeCunโMeta’s former chief AI scientist and one of the “godfathers” of deep learningโmade a stunning move. After 12 years at Meta, he left to launch Advanced Machine Intelligence (AMI) Labs, immediately seeking โฌ500 million in funding at a โฌ3 billion valuation.
LeCun has long been critical of the AI industry’s overreliance on scaling. “I think most likely in the next five years, we are going to find a better architecture that is a significant improvement on transformers,” he’s stated publicly. “And if we don’t, we can’t expect much improvement on the models.”
AMI Labs represents LeCun’s answer: an architecture called JEPA (Joint Embedding Predictive Architecture), which he first published in a 2022 paper. Unlike transformers that predict the next token in a sequence, JEPA learns by predicting representations of future states in an abstract embedding space.
The key innovation: JEPA doesn’t try to predict every pixel of what happens next. Instead, it predicts high-level featuresโthe meaningful patterns that matter. This mirrors how humans learn. When a child watches a ball roll down a ramp, they don’t encode every pixel of motion. They extract the essential pattern: round objects roll, gravity pulls downward, momentum carries things forward.
LeCun’s architecture has already shown promise in vision tasks. V-JEPA (Vision JEPA), released by Meta AI Research, demonstrated strong performance on visual understanding benchmarks by learning to predict masked portions of videos without pixel-level reconstruction. The model learns rich representations of how objects move and interact, forming an internal “world model” of visual dynamics.
AMI Labs aims to scale this approach into production-ready world models that can power everything from robotics to autonomous systems. The โฌ500 million funding target signals LeCun’s confidence that JEPA represents a fundamental breakthroughโnot just an incremental improvement over transformers, but a genuine paradigm shift.
The implications are profound. If LeCun succeeds, AI systems could learn to understand physical reality with orders of magnitude less data and compute than current approaches require. The race is on to see whether JEPA can deliver on its theoretical promise at commercial scale.
Google DeepMind’s Genie 3: Real-Time Interactive Worlds
While startups commanded headlines, Google DeepMind quietly built the most technically impressive world model to date. In late 2025, the company launched Genie 3 and Project Genieโthe first real-time interactive general-purpose world model.
Genie 3 generates navigable 3D environments at 24 frames per second from text prompts alone. Unlike previous systems that required extensive 3D data or specialized training, Genie 3 learned from videos of real-world environments and gameplay footage, then generalized to create entirely new interactive worlds.
The technical achievement is staggering. Real-time generation at gaming-ready frame rates, with geometric consistency maintained as users navigate through environments, represents a massive computational challenge. DeepMind’s breakthrough came from advances in transformer efficiency, novel attention mechanisms for 3D spatial reasoning, and clever training strategies that teach the model physics and geometric relationships.
Project Genie takes this further, providing a research platform where developers can experiment with interactive world generation. While still in limited research preview, early demonstrations show users creating explorable game-like environments through simple text descriptions, with the model generating not just visuals but consistent physics and object interactions.
Vincent Sitzmann, an MIT assistant professor and expert on AI world modeling, explains that video generation models are essentially “proto-world models.” The progression from static images to video to fully interactive 3D worlds represents increasing levels of understanding about how reality works.
Google’s approach emphasizes generality. While Marble focuses on persistent, downloadable environments for creative professionals, and AMI Labs pursues architectural innovation with JEPA, DeepMind aims to build foundation models that can generate any type of interactive 3D environment on-the-flyโgaming worlds, scientific simulations, training environments for robots, or virtual spaces for human collaboration.
The competitive dynamics are fascinating. Google has the computational resources and research depth to push boundaries. Startups like World Labs move faster and focus on specific commercial applications. The winner won’t be determined by who has the best model in isolation, but who can translate technical capability into genuine business value.
NVIDIA’s Infrastructure Play: The Cosmos Platform
While others build world models, NVIDIA provides the plumbing. At CES 2025, the chip giant launched Cosmos, a platform for physical AI development specifically targeting autonomous vehicles and robotics.
By January 2026, the results spoke for themselves: over 2 million downloads of Cosmos world foundation models. The platform trained on 9,000 trillion tokens from 20 million hours of real-world data spanning human interactions, environments, industrial settings, robotics, and driving scenarios.
Cosmos comprises generative world foundation models, advanced tokenizers, guardrails, and an accelerated video processing pipeline. The models predict and generate physics-aware videos of future environment states, enabling synthetic training data generation at massive scaleโcritical for robotics and autonomous systems that need millions of training scenarios.
NVIDIA’s strategy is classic platform play: provide the tools, infrastructure, and pre-trained models that everyone building world-aware AI systems will need. As companies race to deploy robots, autonomous vehicles, and embodied AI, Cosmos positions NVIDIA as the essential infrastructure providerโsimilar to how CUDA dominated deep learning training.
The 2 million download milestone in just weeks validates the strategy. Robotics companies, autonomous vehicle developers, and research labs all need world models for simulation and training. Cosmos provides production-ready tools with the computational efficiency only NVIDIA’s specialized hardware can deliver.
Why Now? The Convergence of Multiple Breakthroughs
The world models explosion isn’t accidental. Several technological streams converged in late 2025 to make this moment possible:
Computational Efficiency: Training and running 3D-aware models requires vastly more compute than text generation. Advances in GPU efficiency, novel architectures like Gaussian Splatting, and training optimizations made real-time world generation feasible for the first time.
Architectural Innovation: Transformers reached their limits for certain tasks. New architecturesโJEPA, diffusion models adapted for 3D, and specialized spatial reasoning modulesโenabled breakthroughs that pure scaling couldn’t achieve.
Data Availability: Years of video games, autonomous vehicle footage, robotics datasets, and 3D scanning created the training data foundation. NVIDIA’s Cosmos alone trained on 20 million hours of real-world observation.
Market Demand: The AI industry’s “demos to production” inflection point created pressure for systems that actually work in the real world. Text-only AI hit adoption limits; spatial intelligence addresses concrete use cases in robotics, autonomous systems, gaming, and design.
Research Maturity: Computer vision, 3D reconstruction, neural rendering, and physics simulation research compounded over decades. World Labs co-founder Ben Mildenhall noted that Marble represents “the integration and scaling of breakthroughs the computer vision community has had over the last decade.”
The timing crystallized in late 2025 when multiple labs simultaneously demonstrated commercial viability. TechCrunch’s analysis captures the mood: “If 2025 was the year AI got a vibe check, 2026 will be the year the tech gets practical. The focus is already shifting away from building ever-larger language models and toward the harder work of making AI usable.”
The Applications: From Gaming to AGI
World models unlock capabilities impossible with text-only AI:
Creative Industries
Game developers use world models to generate background environments, architectural spaces, and ambient settings at a fraction of traditional costs. VFX artists leverage persistent 3D outputs for frame-perfect camera control in film production. Architects and designers rapidly prototype spaces by describing them in natural language, then refining the generated 3D models.
Marble’s commercial launch targets these users specifically. As Justin Johnson, World Labs co-founder, explained: “It’s not designed to replace the entire existing pipeline for gaming, but to just give you assets that you can drop into that pipeline.”
Robotics and Autonomous Systems
Training robots in the real world is expensive and dangerous. World models enable massive-scale simulation. A robot learning to navigate can train in thousands of simulated environmentsโcluttered warehouses, busy streets, disaster zonesโbefore ever entering a physical space.
NVIDIA’s Cosmos specifically targets this application. The platform’s physics-aware generation creates training data that captures the complexity and unpredictability of real-world operationโessential for building robust autonomous systems.
Scientific Simulation
Researchers could explore the inside of a human cell, simulate drug interactions at molecular scales, or model climate systems with unprecedented detail. The key requirement: accuracy. For entertainment, visual realism suffices. For science, faithfulness to real-world physics is paramount.
Fei-Fei Li envisions future applications: “If I’m a surgeon being trained to do laparoscopic surgery, I could be inside an intestine.” World Labs’ founders acknowledge the tradeoffs between realism and faithfulness but believe models will eventually provide both.
The Path to AGI
Perhaps most significantly, leading researchers view world models as essential for artificial general intelligence.
For years, the AI community debated whether language alone suffices for AGI. Large language models demonstrated impressive reasoning within the confines of text, but fundamental limitations emerged. They hallucinate facts, struggle with spatial reasoning, and lack common sense about physical reality.
The success of world models suggests language is necessary but insufficient. To truly understand the world, AI must understand physics, geometry, and causalityโthe substrate of reality itself. As one analysis notes: “To truly understand the world, an AI must understand that if you push a glass off a table, it will breakโa concept that Marble’s physics-aware modeling aims to master.”
This milestone is being compared to the “ImageNet moment” of 2012, which Fei-Fei Li also spearheaded. Just as ImageNet provided the data needed to kickstart the deep learning revolution, spatial intelligence provides the geometric foundation needed to kickstart the AGI revolution.
The Challenges: Not All Smooth Sailing
Despite enormous progress, world models face significant hurdles:
Computational Cost: Generating and maintaining consistent 3D environments requires vastly more compute than text generation. Real-time interaction at scale remains expensive.
Accuracy vs. Realism: Beautiful visuals don’t guarantee physical accuracy. A simulation that looks perfect but violates physics is useless for robotics training or scientific research.
Data Efficiency: World models currently require massive training datasets. Improving few-shot or zero-shot generation remains an open challenge.
Blurring Reality: As AI-generated worlds become indistinguishable from real footage, concerns about manipulation, deepfakes, and misinformation intensify. The technology that creates training simulations for helpful robots could equally generate convincing fake videos for harmful purposes.
Commercial Viability: The gap between impressive demos and sustainable business models remains wide. Game developers remain cautiousโa recent Game Developers Conference survey found a third believed generative AI negatively impacts the industry, citing intellectual property theft and quality concerns.
The Competitive Landscape: Who Wins?
The race has just begun, but patterns are emerging:
World Labs leads in commercial deployment and creative tools. The $5 billion valuation in talks reflects confidence that Marble’s approachโpersistent, editable 3D environmentsโaddresses real customer needs today. Focus on gaming, VFX, and design provides clear revenue paths.
AMI Labs bets on architectural innovation. If JEPA delivers on its promiseโlearning from less data, requiring less computeโit could leapfrog competitors. The โฌ500 million raise signals serious ambition, but converting research breakthroughs into products remains uncertain.
Google DeepMind leverages research depth and computational resources. Genie 3’s real-time performance sets the technical bar. Integration with Google’s broader ecosystemโcloud services, developer tools, potential Android/AR applicationsโprovides strategic advantages startups can’t match.
NVIDIA plays the infrastructure angle. Cosmos doesn’t compete with application-layer products; it powers them. The 2 million download milestone suggests every serious player needs NVIDIA’s tools, cementing the company’s position across the AI stack.
The ultimate winner might be “all of the above.” World models could fragment into specialized domains: entertainment (World Labs), robotics (NVIDIA Cosmos), general intelligence (AMI Labs/DeepMind), with different architectures optimal for different applications.
What This Means for the AI Industry
The shift from text-only LLMs to spatial intelligence represents more than new products. It signals the maturation of AI from narrow tools to systems with general understanding.
For Developers: Master 3D graphics, physics simulation, and spatial reasoning. The skills that dominated game development become central to AI engineering. Computer vision expertise moves from specialized niche to core competency.
For Enterprises: Evaluate use cases where spatial understanding adds value. Manufacturing, logistics, architecture, training simulations, and robotics become AI-ready in ways they weren’t with text-only models.
For Researchers: The transformer’s dominance is ending. Novel architectures, hybrid approaches, and fundamentally new learning paradigms are back on the table. The next major breakthrough might not be “bigger transformer” but something entirely different.
For Investors: Follow the funding. Over $1.3 billion flowing into world models in early 2026 signals where smart money believes the future lies. The question isn’t whether to invest in spatial intelligence, but which approach and which applications will dominate.
The Road Ahead: 2026 and Beyond
Industry experts see 2026 as a transition year. “The party isn’t over, but the industry is starting to sober up,” TechCrunch observes. The focus shifts from scaling language models to researching new architectures, from flashy demos to targeted deployments, from agents that promise autonomy to ones that actually work.
IBM’s Kevin Chung identifies three defining trends: AI shifting from individual usage to team and workflow orchestration; systems that don’t just follow instructions but anticipate needs; and democratization of AI agent creation beyond developers to everyday business users.
Peter Staar at IBM predicts “robotics and physical AI are definitely going to pick up” in 2026. While LLMs remain dominant, the industry recognizes diminishing returns from pure scaling. The next breakthroughs require new approachesโand world models are leading candidates.
The regulatory landscape will intensify. President Trump’s December 2025 executive order aimed to preempt state AI laws, sparking political warfare over who governs the technology. As world models enable increasingly powerful applicationsโfrom deepfakes to autonomous weaponsโthe stakes escalate.
But the potential outweighs the risks. Imagine surgeons training in photorealistic simulations, architects visualizing buildings before laying foundations, robots learning in virtual environments indistinguishable from reality, scientists exploring molecular interactions at unprecedented scales. These applications aren’t decades awayโthey’re emerging now.
As Fei-Fei Li wrote in her manifesto: “Spatial intelligence will transform how we create and interact with real and virtual worldsโrevolutionizing storytelling, creativity, robotics, scientific discovery, and beyond.”
The world models revolution isn’t coming. It’s here. And it’s rewriting the rules of what artificial intelligence can achieve.
Related Articles:
- Understanding Emergent Properties in AI: When Machines Surprise Us
- The Rise of AI Agents: From Chatbots to Autonomous Problem Solvers
- Agent Swarms: How Multi-Agent Systems Mirror Nature’s Collective Intelligence
- Orchestrating AI Agent Teams: Comparing LangGraph, CrewAI, and AutoGen
- The Future of Agentic AI: Challenges, Opportunities, and What’s Next
Further Reading:
- Fei-Fei Li’s Manifesto: “From Words to Worlds: Spatial Intelligence is AI’s Next Frontier” (DrFeiFei on Substack)
- World Labs Case Study: “Marble: A Multimodal World Model” (World Labs)
- TechCrunch: “In 2026, AI will move from hype to pragmatism”
- MIT Technology Review: “What’s next for AI in 2026”







