Why Vietnam for robot and egocentric video data
Robot training data programs have specific requirements that separate them from general video collection. The footage must cover manipulation tasks, human demonstrations, and egocentric perspectives that are hard to crowdsource reliably. You need participants who can execute task scripts consistently, facilities that support controlled environment recording, and engineering staff who can operate multi-sensor rig setups and maintain synchronization across sensor streams.
Vietnam provides all three. Hanoi has a strong engineering and technical talent base, affordable access to dedicated indoor facilities for controlled environment recording, and diverse outdoor environments that cover the pedestrian-dense, high-density-commercial, and logistics contexts where robots in APAC need to generalize.
For teams whose robots deploy in APAC, Vietnamese environments provide deployment-matched training data. Street scenes, market environments, and commercial district contexts from Hanoi cover distributions that US or EU-sourced footage does not. The cost advantage is significant - typically 30 to 50 percent lower than equivalent US or EU programs - without the trade-offs in data quality that unmanaged crowd-platform collection introduces.
1Indoor robot manipulation programs
Indoor programs are the primary format for robot manipulation training data. They require controlled environments - tabletop setups, specific object configurations, defined lighting conditions - and participants who can execute manipulation tasks consistently to protocol. The controlled environment requirement rules out most crowd-platform approaches and favors managed program vendors with dedicated facility access.
In Hanoi, DataX Power operates dedicated indoor facilities for tabletop manipulation recording, kitchen and home environment scenarios, and industrial task demonstrations. Participants are trained on specific task scripts before production recording begins and calibrated on any hardware they will interact with during capture. This calibration phase is not overhead - it is what separates usable training data from footage that cannot be annotated consistently.
Output formats include RGB video, depth map streams, force/torque sensor logs, and proprioceptive data synchronized to millisecond precision. Delivery format is matched to the buyer's training pipeline - HDF5, ROS2 bags, or flat file structures depending on the model framework in use.
2Egocentric and head-mounted camera programs
Egocentric programs - first-person footage from head-mounted or wearable cameras - are the primary data format for embodied AI and VR/AR scene understanding. The footage captures what the agent sees, which is what the model needs to learn from. Programs in Vietnam cover head-mounted rig setups with GoPro-class hardware and smart glasses configurations, participant pools trained on egocentric scenario scripts, and QA workflows for temporal consistency and gaze alignment.
The EPIC-Kitchens and Ego4D protocols - which defined production-quality egocentric collection standards - inform the scenario design and QA approach. These protocols established what "production-ready egocentric data" means: consistent field of view, scenario coverage across defined task categories, and annotation compatibility from collection to label. Programs designed to these standards produce footage that integrates cleanly with existing benchmark-validated training recipes.
3Outdoor and multi-environment collection
Outdoor programs extend collection into the environments where robots and embodied AI systems need to generalize. Hanoi provides pedestrian-dense street environments, market and commercial district scenes, construction and logistics contexts, and accessible outdoor public spaces for scripted human activity scenarios. The density and variety of Hanoi's urban environments is a genuine asset for models that need to generalize across high-activity real-world settings.
For robot programs requiring outdoor scene diversity - yard tasks, loading dock scenarios, last-mile delivery environments - Vietnamese outdoor environments cover the APAC distribution. Programs can be scoped to specific environment types (indoor, street, commercial, industrial) or run across environment types within a single program to maximize scene diversity within budget.
4Multi-sensor fusion programs
Advanced robot training programs combine RGB video with depth sensors (Intel RealSense, Kinect-class hardware), IMU, and force/torque sensors. The multi-sensor requirement is common for manipulation programs where 2D video alone does not capture the depth and force information the model needs to learn contact-rich tasks. Multi-sensor programs require hardware synchronization to under 10ms error for the data to be training-ready - desynchronized sensor streams produce data that looks complete but fails during training.
DataX Power's multi-sensor program capability covers sensor calibration, sync verification at recording time, and HDF5/ROS2 bag delivery formats for direct ingestion into standard robotics training pipelines. Sync verification at recording time - not post-processing - is the critical control. Catching synchronization drift during capture prevents data loss that only becomes visible at training time.
5Teleoperation recording for robot learning from demonstration
Learning from demonstration (LfD) programs require recording human operators controlling robot arms - typically through teleoperation setups - while completing target tasks. The recording captures both the robot's sensor stream and the operator's control inputs. This paired data is what the model learns from: the sensor state and the corresponding human-expert action, frame by frame across thousands of demonstrations.
This is a specialist capability requiring both robotic hardware operation and data recording pipeline management. The operator must be able to execute target tasks to a quality threshold, not just operate the teleoperation hardware. Vietnam-based programs can run teleoperation recording using operator-owned hardware setups or vendor-provided rigs depending on program scope. For teams at early stages of LfD data collection, a vendor-rigs-included program reduces capital requirement and program setup time.
Matching the program to your model requirements
Not all robot training programs need the same collection format. Manipulation models need tabletop demonstrations with diverse objects and grasps. Navigation models need scene diversity and pedestrian interaction. Language-conditioned policies need natural language instruction pairing with task demonstrations. Egocentric models for VR/AR need consistent field-of-view footage across defined scenario types. Each of these has different facility, participant, and hardware requirements.
The collection program design should start with the model architecture and training recipe, not with what a vendor happens to offer. The right starting point is a scoping conversation that covers your model type, the data formats your training pipeline accepts, your volume requirements, and your timeline. From that, the program structure - environment types, sensor configuration, participant protocols, delivery format - follows directly.


