How to Collect RGB-D Data at Scale for Embodied AI

A technical guide to planning, executing, and quality-controlling RGB-D data collection programs for embodied AI and robotics - covering sensor selection, synchronization, storage, and delivery at production scale.

8 min readBy the DataX Power team
Depth sensor and RGB camera setup for embodied AI robot training data collection

Why RGB-D matters for embodied AI

RGB-D data - paired color and depth frames from a synchronized camera-depth sensor rig - is the dominant input modality for embodied AI models that need to reason about 3D space. Policy models that must estimate object position, grasp geometry, or scene layout from onboard sensors require depth as a primary signal, not an optional supplement. RGB-only programs produce training data that cannot teach 3D spatial reasoning.

Despite this, most video data collection vendors offer RGB programs and treat RGB-D as a premium add-on they have not actually run at scale. The result is programs where depth synchronization is poor, calibration is not validated per session, and depth frames arrive with systematic noise patterns that the annotation team cannot correct retroactively.

This guide covers what makes RGB-D collection technically demanding, how to specify a program that produces annotation-ready output, and what QA checks catch common failure modes before delivery.

1Sensor selection and configuration

The three most common sensor families for managed RGB-D collection programs are Intel RealSense (D400 and D500 series), Microsoft Azure Kinect, and Orbbec Femto series. Each has distinct strengths relevant to specific program types.

Intel RealSense D435i and D455 are the workhorses of field RGB-D collection. They are lightweight, powered over USB, and have an integrated IMU - making them practical for head-mount and wrist-mount egocentric programs. Depth range is 0.3-3m effective for indoor manipulation tasks. The D455 trades some portability for improved depth accuracy and longer range, making it better suited for robot navigation datasets.

Azure Kinect DK offers higher depth resolution and wider field-of-view than RealSense but requires external power and is not suited for wrist or head-mount configurations. It is the best choice for stationary scene capture - table-top manipulation tasks, workstation activity, or fixed-camera environment datasets.

Orbbec Femto Bolt is the most recent generation and directly compatible with Azure Kinect SDKs, making it a drop-in replacement for programs that require the Azure Kinect form factor and software pipeline but want current hardware.

For programs requiring LiDAR-quality depth at outdoor range or for large-scale scene capture, structured-light sensors are replaced by ToF (time-of-flight) sensors. These are less common in current embodied AI training programs but increasingly relevant for outdoor navigation and autonomous vehicle-adjacent datasets.

2Synchronization and calibration requirements

Synchronization is the technical criterion that most distinguishes qualified RGB-D collection vendors from those who cannot execute at production quality. Embodied AI models that fuse RGB and depth frames assume temporal alignment between the two signals. Misalignment of more than 8-10ms produces ghost edges at motion boundaries that teach the model incorrect spatial priors.

Hardware-level synchronization - using a trigger signal to fire the RGB and depth sensors at the same moment - achieves sub-millisecond alignment and should be the default for any manipulation or action recognition program. Software synchronization through timestamp matching achieves 5-15ms alignment depending on sensor and host timing jitter. For programs where the camera is mostly stationary and scene motion is slow, software sync may be acceptable. For manipulation programs with fast hand and object motion, hardware sync is required.

Calibration drift is the other synchronization failure mode. Intrinsic and extrinsic calibration parameters - the values that map depth pixels to 3D coordinates and align depth with RGB - drift over time due to temperature changes and mechanical vibration. A session that starts with valid calibration may have degraded calibration by the end of a 4-hour collection day. Production programs should validate calibration at the start of each session using a calibration target, and flag sessions where calibration drift exceeds a specified threshold.

IMU synchronization applies to programs using RealSense D435i or other sensors with an onboard IMU. The IMU provides acceleration and rotation data that is essential for egomotion estimation in dynamic programs. IMU-camera synchronization requires careful timestamp handling - hardware trigger on the camera does not automatically synchronize the IMU, which operates on an independent polling cycle.

3Storage and data pipeline at scale

RGB-D programs generate substantially more raw data than RGB-only programs. A RealSense D435i at 30fps produces approximately 2.5 GB per minute of synchronized RGB + depth frames at full resolution. A 100-hour collection program produces around 15 TB of raw sensor data before any annotation. Storage, transfer, and preprocessing infrastructure is not optional at this scale - it is a core operational capability.

Collection vendors running programs at scale use on-site NAS storage to buffer daily collection, SSD-to-NAS transfer for immediate backup, and compressed format conversion during overnight processing. Raw RealSense bag files are typically converted to HDF5 or a dataset-specific format before annotation, reducing storage volume by 30-50% through lossless depth compression.

Delivery format should be specified before collection begins. Common formats: ROS2 bag (preferred for teams with ROS infrastructure), HDF5 with documented schema, LeRobot dataset format (increasingly common for policy training programs), and custom formats specified by the buyer. Converting between formats after collection is expensive and error-prone. Specify the delivery format in the program specification document and confirm the vendor can produce it natively.

4QA checks specific to RGB-D programs

Standard video QA - checking for blur, overexposure, and coverage completeness - is necessary but not sufficient for RGB-D programs. Four additional checks are specific to depth data quality.

Depth hole coverage: structured-light depth sensors produce invalid pixels (holes) where the infrared pattern cannot resolve - typically at shiny surfaces, transparent materials, and extreme lighting contrasts. Acceptable hole rates depend on the scene type and the annotation requirements. Programs capturing metallic objects or glass should specify a maximum hole percentage per frame and flag sessions that exceed it.

Depth range compliance: frames where the primary objects of interest fall outside the effective depth range of the sensor are useless for 3D annotation. QA should verify that scene geometry at the distance relevant to the task is within the sensor's valid range for each session.

Calibration validation: as noted above, per-session calibration validation using a known-geometry target is required for production programs. QA should log calibration residual per session and flag for recalibration if residual exceeds the specification.

RGB-depth alignment: visual spot-checks on a random sample of frames to verify that depth edges align with RGB edges at object boundaries. Systematic misalignment indicates a calibration or synchronization problem that propagates through the entire session.

Planning your RGB-D program

A production-ready RGB-D collection program requires four workstreams running in parallel: sensor configuration and calibration infrastructure, capture protocol and scenario scripting, storage and transfer pipeline, and QA process design.

The lead time for a new RGB-D program from vendor selection to first collection is typically 3-4 weeks. Hardware procurement and calibration rig construction account for most of this time if the vendor does not already own the required sensor configuration. Vendors with existing RGB-D programs can typically onboard a new program in 1-2 weeks.

Budget planning: fully-loaded RGB-D managed collection programs in Vietnam run $18-28 per hour of synchronized footage, inclusive of hardware amortization, crew, calibration, and QA. This compares to $45-70 per hour for equivalent US-sourced programs. On a 500-hour program, the differential is $13,500-$21,000 in absolute savings.

Data Collection Service

Need the platform layer to make this stick in production? Our Hanoi-based infrastructure team delivers DevOps, FinOps, SecOps, and AI/MLOps for enterprises on AWS, GCP, Azure, and on-premise.

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.