“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

Key Concepts

Visual spatial intelligence
Deep learning
Generative AI
3D computer vision
Neural Radiance Fields (NeRF)
Spatial Computing
Multimodal models
1D vs. 3D representation
World Generation
Augmented Reality (AR)
Robotics
Deep Tech

AI Evolution and Key Contributions

AI Winter to Cambrian Explosion: The discussion starts by highlighting the current exciting moment in AI, contrasting it with the "AI winter" of the past. The field has progressed from early AI to deep learning, and now to a "Cambrian explosion" with applications in text, pixels, videos, and audio.
Deep Learning's Rise: Justin describes his entry into AI around 2011-2012, driven by the "cat paper" from Google Brain. This paper demonstrated the power of combining generic learning algorithms with large amounts of compute and data.
ImageNet's Impact: Fei-Fei Li emphasizes the importance of data in driving AI advancements. She recounts the "crazy bet" on ImageNet, which scaled image datasets from thousands to millions, enabling breakthroughs in computer vision.
Transformers and Stable Diffusion: The conversation identifies the Transformers paper (attention mechanism) and Stable Diffusion as key algorithmic unlocks that have propelled the field forward.
Compute as a Key Enabler: Justin stresses the underestimation of compute's role. He illustrates this with the example of AlexNet (2012), which used a 60 million parameter network trained on two GTX 580s for six days. The equivalent training run would now take under five minutes on a single GB200.
The Bitter Lesson: The "bitter lesson" is mentioned, emphasizing the importance of leveraging available compute. Algorithms should be designed to take advantage of increasing computational power.
Data Sources and Human Labeling: The discussion explores the role of data, particularly human labeling. While self-attention in Transformers is significant, the exploitation of human labeling in datasets like ImageNet and CLIP is also crucial.

Generative AI

Evolution of Generative Models: Fei-Fei notes that generative models have existed for a long time, but their output was not impressive until recently. Justin's PhD work is highlighted as a mini-story of the field's trajectory, from image retrieval to style transfer to generative image creation.
Real-time Style Transfer: Justin's work on real-time style transfer, based on Leon Gatys's paper, is mentioned as an example of academic work having industry impact.
Image Generation from Scene Graphs: Justin's PhD work on generating images from scene graphs (inputting language and getting a whole picture out) is cited as an early example of generative AI.

Spatial Intelligence and World Labs

Fei-Fei's Journey to Spatial Intelligence: Fei-Fei describes her intellectual journey and passion for visual intelligence. She believes that visual spatial intelligence is fundamental for intelligent beings (humans, robots, etc.) to perceive, reason, and interact with the world.
Defining Spatial Intelligence: Spatial intelligence is defined as the ability of machines to perceive, reason, and act in 3D space and time. It involves understanding how objects and events are positioned in 3D space, how interactions affect those positions, and the ability to generate and interact with 3D environments.
Why Now is the Right Time: Fei-Fei argues that the moment is right to focus on spatial intelligence due to advancements in compute, data understanding, and algorithms (including NeRF).
Justin's Perspective on New Data: Justin explains that his interest in spatial intelligence stemmed from the realization that the next decade of AI would be about understanding new data from sensors positioned in the 3D world.
NeRF as a Breakthrough: Ben Mildenhall's NeRF paper is identified as a significant breakthrough in backing out 3D structure from 2D observations.
Merging Reconstruction and Generation: Fei-Fei notes that NeRF has led to a merging of 3D reconstruction and generative methods in computer vision.

Spatial Intelligence vs. Language Models

1D vs. 3D Representation: Justin emphasizes the core difference between language models (which use a 1D representation) and spatial intelligence (which prioritizes a 3D representation). Language models shoehorn other modalities into a 1D sequence of tokens, while spatial intelligence focuses on the 3D nature of the world.
Generated vs. Real-World Signals: Fei-Fei argues that language is a purely generated signal, while the 3D world has its own structures and laws of physics. Extracting and representing information from the 3D world is a fundamentally different problem.
Affordances and User Experience: Justin explains that even if the final output is a 2D image or video, using a 3D representation under the hood can enable better affordances for users, such as moving objects or the camera around.

Use Cases for Spatial Intelligence

World Generation: Generating full simulated, vibrant, and interactive 3D worlds for gaming, virtual photography, education, and other applications.
New Form of Media: Spatial intelligence could enable a new form of media by reducing the cost of creating detailed virtual worlds, making them accessible for niche applications.
Augmented Reality (AR): Spatial intelligence is essential for AR devices to understand the real world and seamlessly blend virtual content with it.
Robotics: Spatial intelligence is crucial for connecting the digital brains of robots with the 3D physical world, enabling them to perform tasks.
Deprecating Screens: The ability to seamlessly blend virtual content with the physical world could deprecate the need for multiple screens.

Team and Future

Deep Tech vs. Application Areas: World Labs is positioned as a deep tech company building a platform that can serve different use cases.
Team Construction: The company focuses on assembling a multidisciplinary team of experts in AI, computer vision, computer graphics, systems engineering, and other areas.
North Star and Long-Term Vision: The ultimate goal is for many people and businesses to use World Labs' models to unlock their needs for spatial intelligence. The journey is expected to take the company to places that cannot even be imagined today.

Notable Quotes

Justin: "You can get these amazingly powerful learning algorithms that are very generic couple them with very large amounts of compute couple them with very large amounts of data and magic things started to happen."
Fei-Fei: "Visual spatial intelligence is so fundamental it's as fundamental as language possibly more ancient and and more fundamental in certain ways."
Justin: "The universe is a giant evolving four-dimensional structure and spatial intelligence r large is just understanding that in all of its depths and figuring out all the applications to that."

Technical Terms

Visual Spatial Intelligence: The ability of machines to perceive, reason, and act in 3D space and time.
Deep Learning: A type of machine learning that uses artificial neural networks with multiple layers to analyze data.
Generative AI: A type of AI that can generate new content, such as images, text, or audio.
3D Computer Vision: A field of computer vision that deals with understanding and reconstructing 3D scenes from images or videos.
Neural Radiance Fields (NeRF): A technique for representing 3D scenes as continuous functions, allowing for photorealistic rendering from novel viewpoints.
Spatial Computing: A type of computing that involves interacting with digital information in a 3D space.
Multimodal Models: AI models that can process and integrate information from multiple modalities, such as text, images, and audio.
1D vs. 3D Representation: Refers to the underlying data structure used by AI models to represent the world. Language models typically use a 1D sequence of tokens, while spatial intelligence models use a 3D representation.
World Generation: The process of creating virtual 3D worlds using AI.
Augmented Reality (AR): A technology that overlays digital information onto the real world.
Deep Tech: Companies that are based on fundamental scientific or engineering breakthroughs.

Synthesis/Conclusion

The conversation provides a comprehensive overview of the evolution of AI, culminating in the current focus on spatial intelligence. It highlights the importance of compute, data, and algorithmic advancements in driving progress. The discussion emphasizes the fundamental differences between language models and spatial intelligence, particularly in terms of representation and the nature of the data being processed. The potential use cases for spatial intelligence are vast, ranging from world generation to augmented reality and robotics. World Labs, with its multidisciplinary team and deep tech approach, aims to unlock the full potential of spatial intelligence and create a new era of computing.