Why egocentric collection has more steps than standard video
Standard video data collection - stationary or operator-held cameras capturing a scene - can be set up and run with minimal specialized knowledge. Egocentric video collection, where the camera is mounted on or worn by the subject, requires additional steps at every stage of the workflow because the spatial relationship between the camera and the subject's body is what the downstream model needs to learn from.
A camera mount that shifts between participants breaks the spatial consistency that the annotation ontology assumes. A participant who performs the task differently from the scenario script produces footage that cannot be annotated to the required label. Lighting conditions that produce shadows on the hands - the primary region of interest in manipulation tasks - degrade depth estimation even on RGB-only programs.
This guide covers the complete workflow for a managed egocentric video collection program, from initial program specification through final delivery.
Step 1: Program specification
The program specification document is the contract between the collection team and the ML team. Everything downstream - scenario scripts, hardware configuration, QA criteria, delivery format - is derived from the specification. Changes to the specification after collection begins are expensive.
The specification should cover: the task set (what activities will be captured, in what sequence, with what objects), the camera configuration (sensor type, mount position, field of view, resolution, frame rate), the participant profile (demographics, physical characteristics, any domain expertise required), the environment list (indoor, outdoor, specific location types), the annotation ontology (what will be labeled and how), the diversity matrix (how many participants, how many environments, how many sessions per scenario), and the delivery format.
The annotation ontology belongs in the specification, not in a separate document delivered after collection. Annotation requirements determine what scenarios are necessary and how participants must execute them. A specification without an annotation ontology produces footage that may not be annotatable to the required precision.
Step 2: Hardware configuration and calibration
Hardware configuration for egocentric collection covers camera selection, mount design, power management, and calibration. For head-mount programs, the critical variables are field of view (95-110 degrees is standard for most manipulation tasks), camera center-to-eye-position offset (which affects the perspective distortion in the footage), and mount rigidity (a mount that shifts during activity produces inconsistent spatial geometry).
For wrist-mount programs, the camera orientation relative to the palm and finger positions is the critical variable. The annotation team needs a consistent camera-to-hand geometry to label manipulation events accurately. Programs using variable-angle wrist mounts - where the angle changes between participants or between sessions - require custom calibration per participant.
Calibration for egocentric programs requires at minimum: intrinsic calibration of the camera (focal length, principal point, distortion coefficients), and for multi-sensor programs, extrinsic calibration between all sensor pairs. Calibration should be validated per session, not just at program startup.
Power management is an operational concern that is easy to underestimate. A RealSense D455 on a wrist mount draws 2.5W continuously. A 4-hour collection day requires either battery packs integrated into the mount design or tethered power - each with ergonomic tradeoffs that affect natural movement and therefore footage quality.
Step 3: Participant recruitment and onboarding
Participant recruitment for egocentric collection programs is more demanding than for standard video programs because the participant's physical characteristics directly affect the footage. For hand-centric programs, hand size, laterality (left or right dominant), and grip patterns all affect what the camera captures. Diversity requirements for embodied AI training datasets typically require participants across multiple hand size categories, age groups, and ethnicity - which requires an established participant pool rather than ad hoc recruitment.
Participant onboarding covers two elements: equipment familiarization and scenario rehearsal. Equipment familiarization typically takes 15-30 minutes per participant and covers how to wear the hardware, how to move to avoid cable management issues, and how to signal equipment problems without disrupting a session. Scenario rehearsal covers the specific task sequences in the capture protocol - participants who have rehearsed the scenario produce more consistent footage than those executing it for the first time.
Consent documentation must be completed before any recording begins. For programs destined for commercial AI training, the consent form must explicitly cover the use of footage for AI model training, the right to share with annotation vendors and cloud infrastructure, data retention duration, and deletion rights. Program-specific consent templates are faster than building from scratch - qualified vendors have templates that cover standard enterprise requirements and can be modified for jurisdiction-specific additions.
Step 4: Collection execution and in-session QA
Collection execution follows the scenario script, with an on-site crew member observing each session and flagging deviations in real time. In-session QA catches problems that cannot be fixed in post: camera mount drift that shifts the perspective mid-session, participant behavior that diverges from the script, equipment failure that corrupts specific clips, and environmental changes (lighting shift, background motion) that affect footage quality.
Session length for egocentric collection programs is typically 2-4 hours per participant per day. Longer sessions produce fatigue-related behavior changes that degrade script fidelity. Programs requiring large total hours should spread collection across multiple shorter sessions rather than maximizing session length.
Metadata logging during collection is an operational step that many programs skip and later regret. Per-session metadata should record: participant ID, session start and end time, scenario sequence, equipment configuration, any in-session anomalies flagged by the crew, and environment conditions. This metadata is essential for QA, for debugging annotation questions, and for stratified sampling of the delivered dataset.
Step 5: Post-collection QA and processing
Post-collection QA reviews the footage from each session against the QA criteria defined in the program specification. Standard QA checks for egocentric programs: completeness (all required scenarios captured), coverage (all specified environments and participant types represented), technical quality (focus, exposure, motion blur within spec), and scenario fidelity (participants executed the task sequences as scripted).
Processing covers format conversion, metadata attachment, and packaging for delivery. Raw footage from most field collection setups is in a vendor-specific format that requires conversion before annotation. Format conversion should be validated against the annotation tooling to confirm the output is compatible before a large batch is processed.
Failed sessions - those that did not meet QA criteria - require recollection scheduling before the program is considered complete. Production programs should build recollection buffer into the timeline: approximately 10-15% of planned session time for programs with new scenario types or first-time participants, 5-8% for established programs.
Step 6: Delivery and handoff
Delivery for a managed egocentric collection program includes the footage, session metadata, QA documentation, consent records, and calibration logs. Each element is necessary for the downstream annotation and training workflow.
The QA documentation - per-session pass/fail records, anomaly logs, and sampling reports - is the element most often omitted by vendors without mature QA processes. It is also the element that matters most when a downstream annotation question requires understanding what happened during a specific session.
Consent records are not optional for commercial programs. Enterprise buyers need documented evidence of participant consent for every recording in the dataset, stored in a form that can be retrieved if a participant exercises deletion rights. Programs that do not maintain consent records create compliance liability that surfaces during procurement audits.


