Building a Data Pipeline for pi0, OpenVLA, and Octo

What enterprise robotics teams need to know about data collection and formatting requirements for the leading open-source VLA and generalist robot policy models - pi0, OpenVLA, and Octo.

8 min readBy the DataX Power team
Robot arm executing manipulation task for VLA model training data collection pipeline

Why model architecture determines data collection requirements

The emergence of generalist robot policy models - pi0 from Physical Intelligence, OpenVLA from Stanford and Berkeley, and Octo from the Berkeley Robot Learning Lab - has changed the data collection calculus for enterprise robotics teams. These models can be fine-tuned from pre-trained checkpoints, which means a team does not need to collect from scratch to achieve competitive performance on a target task set.

But fine-tuning is not zero-shot. Each model was pre-trained on a specific data distribution with specific sensor configurations, action space representations, and episode structures. Collection programs that do not match those assumptions produce fine-tuning data that fights against the pre-trained representations rather than extending them.

This guide covers the data collection requirements specific to pi0, OpenVLA, and Octo - including sensor configuration, action space representation, episode structure, and the format each model's training pipeline expects.

1pi0 data requirements

pi0 is a flow-matching policy model from Physical Intelligence built on a vision-language backbone (PaliGemma). The architecture processes language instructions, RGB observations from multiple cameras (typically wrist and overhead), and proprioceptive state, and outputs continuous action trajectories rather than discrete action tokens.

Data collection for pi0 fine-tuning requires multi-camera RGB - a minimum of wrist camera and overhead or front-facing camera covering the scene. Single-camera programs cannot produce the view complementarity that pi0's architecture uses. The wrist camera should be positioned for a clear view of the manipulator and the task objects; the overhead or front camera should capture the full workspace including the robot base.

Action representation for pi0 uses end-effector delta positions and gripper state in a continuous action space. Collection programs using teleoperation should record the full kinematic state at each timestep - not just keyframes - because pi0 operates at 50Hz action frequency and needs dense temporal coverage to learn smooth trajectories.

Language instruction pairing is required for pi0 fine-tuning. Each demonstration episode needs a natural language instruction that describes the task. The instruction does not need to be complex - "pick up the red cup and place it on the plate" is sufficient - but it must be consistent with the task being demonstrated and must be recorded per episode, not per batch.

2OpenVLA data requirements

OpenVLA is a 7B-parameter vision-language-action model built on Prismatic VLMs. It tokenizes actions into discrete bins and outputs action tokens autoregressively, which makes it compatible with standard language model training infrastructure but creates specific requirements for action binning at collection time.

OpenVLA was pre-trained on the Open X-Embodiment dataset, which covers a wide range of robot embodiments and tasks but is dominated by tabletop manipulation with single overhead cameras. Fine-tuning on single-camera overhead programs can leverage this distribution closely. Programs using non-standard camera configurations should expect longer fine-tuning to overcome the distributional shift.

The most important collection requirement for OpenVLA fine-tuning is action space normalization. OpenVLA discretizes actions into 256 bins per dimension, with bin boundaries calculated from the training data distribution. For fine-tuning data to integrate correctly, the action space range in the collection program must be compatible with the pre-trained bin boundaries - or the fine-tuning must recompute boundaries from the full combined dataset. This is a data pipeline engineering step that the ML team needs to plan before collection begins.

Episode quality for OpenVLA is more sensitive to action noise than pi0 because discretization amplifies noise at bin boundaries. Collection programs using teleoperation should target smooth, consistent demonstrator behavior and filter noisy episodes before packaging for training.

3Octo data requirements

Octo is a transformer-based generalist policy model that accepts multimodal inputs (language, images from arbitrary camera configurations) and outputs action sequences. Its architecture is designed for flexibility in sensor configuration, which makes it more tolerant of collection programs that do not exactly match the pre-training distribution.

Octo was trained on the Open X-Embodiment dataset and uses a diffusion policy action head for continuous control. Fine-tuning requires episodes in the RLDS (Reinforcement Learning Datasets) format - specifically the tensorflow_datasets format used in Open X-Embodiment. Data that is not in this format requires a preprocessing conversion step that adds engineering overhead but is well-supported by the Octo codebase.

Camera configuration for Octo fine-tuning is more flexible than pi0 or OpenVLA because the model architecture handles variable numbers of camera inputs through attention. Programs with single overhead camera, single wrist camera, or multi-camera setups can all be used for fine-tuning without architectural changes.

Action chunking - predicting sequences of actions rather than single-step actions - is supported by Octo and can improve fine-tuning quality for programs where demonstration actions have natural temporal structure. This is a training configuration choice rather than a collection requirement, but it is worth noting that collection programs should record at sufficient temporal resolution (minimum 5Hz, ideally 10-30Hz) to support action chunking configurations.

4Shared requirements across all three models

Despite their architectural differences, pi0, OpenVLA, and Octo share several data collection requirements that apply to any fine-tuning program.

Episode completeness: each demonstration episode should include the full task trajectory from initial state through task completion. Partial episodes - where collection stopped before task completion, or where the demonstrator failed and the episode was not discarded - add noise to fine-tuning and degrade policy performance on the target task.

Proprioceptive state recording: all three models use robot proprioceptive state (joint positions, velocities, end-effector pose) as part of the observation. Collection programs using teleoperation must log proprioceptive state at the same frequency as the visual observations. Programs that record video only, without proprioceptive state, cannot be used for fine-tuning any of these models.

Task success annotation: episodes should be labeled as successful or failed at the task level. This is used during training to weight or filter demonstrations. Programs that deliver raw collected footage without success annotation create an annotation step that adds cost and delay.

Demonstrator consistency: fine-tuning quality improves when demonstrations within a task type follow consistent motion strategies. Programs using multiple demonstrators should run calibration sessions to align strategy before production collection begins.

Data Collection Service

Need the platform layer to make this stick in production? Our Hanoi-based infrastructure team delivers DevOps, FinOps, SecOps, and AI/MLOps for enterprises on AWS, GCP, Azure, and on-premise.

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.