Blog

Home
← Back to Blog

World

State of Robotics Data (May 2026)

A field report on the data that robots are trained on.

Vanessa Li · May 2026

Software was eating the world, now AI is eating software. The next frontier will be physical AI. We're not just concerned about what AI can say, but what AI can do, physically, in the real world.

Waymo crossed 450,000 weekly paid rides. Nvidia declared "the ChatGPT moment for robotics is here". Sora was discontinued in April 2026. Most of the people on the team were video experts. Where do you think their talent was reallocated to? My guess is towards world models, in order to advance robotics. Video is the essential element for training robotics manipulation. Forget about making silly AI video slops to make people laugh, there are better quests such as advancing humanoid robots.

OpenAI trained GPT on trillions of tokens of essentially free text. To teach a humanoid robot to reliably fold a shirt, however, a frontier lab needs roughly 10-20,000 hours of annotated teleoperation data, billed at north of $100/hour, every minute of it physically produced by a human piloting a real robot through a real room. Unfortunately, there is no physical data to be scraped from the web. The result is that robotics research has become one of the most capital-intensive corners of AI, and a quiet race for high-quality training data is now underway across every frontier lab.

The three main types of robotics data being used in training:

The three main types of robotics data being used in training

Teleoperation

Teleoperation remains the industry gold standard. Major research labs rely 95%+ on teleoperation data over the other two types of data. Unfortunately, the best kind of data comes at a high price. Teleop data is the hardest and most expensive to collect because it requires leader arms and follower arms hardware, and dedicated employees who can run the operation. You will become a billionaire if you can find a way to scale up teleoperation data collection. Unfortunately, I don't have any good ideas. Most teleop data companies rely on labor arbitrage in developing countries. The hardware costs in those countries also tend to be lower, therefore teleop data are best collected outside of the US. The problem with this, however, besides ethical consideration, is that the data farms are too far away from the American labs purchasing them. Researchers prefer to be able to iterate quickly and identify problems with the collection as quickly as possible. They also want to receive the data as soon as they want it. Outsourcing it to external laborers means that the timeline to collect doubles. Data providers would first provide samples of the task for validation, ship the correct hardware to those farms, collect and post-process data. Not to mention that every robotics company is using different grippers. Standardization remains an unsolved problem.

Egocentric video

This is the most abundant set of data. Everybody and their mother can collect egocentric video data as long as they have a camera. Data collectors design their own headset to film workers' hands and wrist in various settings. These videos are then used to train VLA models.

The unfortunate thing about ego data is that they are the least useful. Because of the way VLA models digest data, any small variance in the pixel color would affect the training quality. Videos are fed into Vision-Language-Action (VLA) models by dividing them into discrete frames, processing those frames through specialized encoders to extract visual and motion features, and then fusing this data with language commands and robot states to predict physical actions. This means that if the background of the video changes, the same action in the same background might not transfer. Any difference in lighting can dramatically affect the training outcome.

Most of that supply came from India and China at very low labor costs. Pricing is approaching a commodity. Around 2023-2024, when ego-video data vendors first started selling to robotics labs, they typically priced raw footage at $5-10/hr.

As more and more players rush into this gold mine, ego videos face enormous competition while demand remains the same. By April 2026, supply had outrun demand. At this rate, ego video data pricing will settle well under $1/hr by 2027.

Ego video data pricing over time

Vendors are racing to differentiate themselves and trying to sell premium data at a higher price.

Differentiating Factors for Ego Video Data

Multimodal data

The promise of modal data is that they will increase the accuracy of the training data by increasing the number of parameters. Data collectors are including additional sensors such as IMU, pressure sensors, and IR sensors in addition to RGB cameras to extract more information on the movement. These data would need to then be calibrated. The challenge with multimodal data is that we're not sure whether these are actually useful in training. An increased number of parameters doesn't necessarily lead to an increased desirable outcome.

The ability to get into hard-to-reach environments.

One of the biggest values out of egocentric video data is that people can provide a variety of environment data. Anyone with a head-mounted camera can capture a task wherever it naturally happens. Environmental diversity is the real value proposition. A model needs to see thousands of different kitchens, warehouses, hospital rooms, and workshops to generalize. Startups that can get a camera into environments labs can't reach on their own have the most defensible moat. Currently, there seems to be a lack of US-based data. It is easier to access settings in countries with fewer privacy concern, therefore we see the market flooded with data from developing countries. US-based data is more valuable not only because it's more scarce and expensive to collect, but also because deployment will likely happen first in the US. Kitchen layouts in India look very different from ones in the US, even the sizes of household appliances are different. US-based data would be necessary.

How ego data will become more useful

Geometric Foundation Models (GFMs) have shown promising results. GFMs are the strongest reason to think egocentric video isn't a dead end. They take ordinary RGB images or video and directly predict 3D structure without needing depth sensors. They might turn cheap, abundant ego video into something labs can actually use. VLA models are brittle to pixel-level variance. I am bullish that geometry-based policies will displace pure pixel-to-action VLAs as the dominant design.

Synthetic Data

The pitch is irresistible: spin up a robot in a simulator like NVIDIA's Isaac Sim, run thousands of trials in parallel overnight, and generate perfectly labeled trajectories at near-zero marginal cost, with no hardware to ship or operators to pay. The challenge is the sim-to-real gap. Even SOTA models can't reproduce the complexity of the real world. A policy trained in simulation tends to fail when deployed, because even state-of-the-art physics engines can't fully reproduce real-world friction. The field's response has been less about achieving perfect realism and more about engineering around the gap. While world foundation models like NVIDIA's Cosmos add photorealistic textures, weather, and sensor noise on top of simulated scenes to make them harder to distinguish from reality, the current consensus is that synthetic data works best as a complement rather than a replacement. Synthetic data is excellent for perception and navigation, but poor for manipulation skills.

Picking up an ikea bear, May 2026, SF
me picking up an ikea bear. May 2026, SF.

UMI data sits in an odd spot. It has the collection style of ego video and the output quality of teleop. GoPro and a 3D-printed gripper is the whole rig. But you don't walk away with raw footage. A processing step runs SLAM over the video to recover the gripper's full 6-DoF path, so labels are reconstructed, not measured. UMI is promising, but it isn't the proven default for now.

The race in robotics has quietly shifted. The bottleneck is high-quality data. Teleoperation will stay the gold standard, but it can't scale on labor arbitrage alone. Ego video is cheap and abundant but only becomes useful once geometric foundation models strip away the pixel-level noise that breaks today's policies. And synthetic data, for all its appeal, remains a complement rather than a substitute for the real thing. The companies that crack this will decide which humanoids actually fold the shirt, and which stay expensive demos.