Virtual Prod
ML Rig Predictor
A two-model machine learning system that classifies 18 human motion actions from 3D skeletal joint data — designed for virtual production rigs where real-time action detection drives animation, FX, and scene logic.
// 01 — Overview
What it does
& why it matters
In virtual production, performers wear marker suits and move through a tracked volume — but raw 3D joint positions are only useful if the system knows what the performer is doing. This project builds that classification layer, training two Random Forest pipelines to map skeletal sequences to discrete action labels with no manual keyframing.
// 02 — Dataset
Kinect Action
Recognition Dataset
Each CSV file in the dataset represents a single motion clip captured with a Microsoft Kinect sensor. Joint positions are stored as real-world metric coordinates (x, y, z in meters), not pixel-space projections, making them directly usable for distance-based normalization.
180 total clips spanning 18 distinct physical actions, each performed by multiple subjects to ensure the model learns action-invariant features rather than person-specific motion signatures. Labels are encoded from filename prefixes (a01–a18).
// 03 — Models
Two pipelines,
one competition
Both models are sklearn.pipeline.Pipeline objects serialized with pickle. Model 1 works in the full feature space; Model 2 first compresses that space with PCA before handing it to the same forest architecture — trading a small accuracy margin for significantly faster inference.
Operates on the complete 4,998-dimensional feature vector (joint positions + frame-to-frame velocities). StandardScaler normalizes each feature before the forest. With 300 trees and unbounded depth, this model captures fine-grained motion differences.
- StandardScaler → full 4,998-D feature space
- 300 decision trees, max_depth=None
- Trained on 80/20 stratified split
- Best for accuracy-critical applications
Adds a PCA stage before the forest, compressing the scaled features to the top 100 principal components. This removes correlated noise in the joint data while preserving the dominant motion variance — useful for real-time inference budgets in live production.
- StandardScaler → PCA (100 components)
- Same 300-tree forest as M1
- Visualizable: 2D & 3D PCA scatter plots
- Best for latency-sensitive VP pipelines
import pickle from pathlib import Path # Load serialized models clf_full = pickle.load(open("finalized_model_M1.sav", "rb")) clf_reduced = pickle.load(open("finalized_model_M2.sav", "rb")) le = pickle.load(open("label_encoder.sav", "rb")) # Evaluate on test set (X_test comes from preprocessing pipeline) score_m1 = clf_full.score(X_test, y_test) score_m2 = clf_reduced.score(X_test, y_test) print(f"M1 Accuracy: {score_m1:.3f}") print(f"M2 Accuracy: {score_m2:.3f}") # Predict on a new sequence (already preprocessed to flat vector) pred_encoded = clf_full.predict([new_feature_vector]) pred_label = le.inverse_transform(pred_encoded) print(f"Detected Action: {pred_label[0]}")
// 04 — Performance
Model metrics
& comparisons
Both models were evaluated on a held-out 20% test set using stratified sampling to ensure all 18 action classes were represented. Confusion matrices and PCA scatter plots were generated to inspect where inter-class confusion occurs.
The notebook produces a 2D PCA scatter of all 180 clips (coloured by encoded label), a 3D PCA scatter, a 2D PCA scatter of test-set predictions through M2's learned component space, and a full confusion matrix using ConfusionMatrixDisplay.from_predictions. These reveal which action pairs the forest struggles to separate — typically slow, low-amplitude motions like phone call vs drink.
// 05 — Why This Approach
Random Forest over
deep learning — intentionally
A transformer or LSTM could theoretically model temporal sequences better — so why Random Forest? In a constrained research and virtual production context, the decision is defensible on multiple grounds.
With only 180 clips across 18 classes, deep learning models would overfit immediately. Random Forests are known to generalize well on small-to-medium tabular datasets where the number of samples per class is low. The hand-crafted feature vector (positions + velocities, resampled to 60 frames) encodes enough temporal structure that a tree ensemble can discriminate effectively without needing raw sequence modelling.
In a production pipeline where a TD or supervisor needs to understand why the classifier fired, Random Forests offer feature importances out of the box. You can inspect which joint, which axis, and which frame window most influenced a prediction — something opaque to RNNs or attention models. This matters when debugging a faulty trigger on a live volume set.
A pickled sklearn pipeline has zero runtime dependencies beyond numpy and scikit-learn — no GPU, no CUDA, no framework version pinning. It loads in milliseconds, runs on CPU in under 1 ms per inference, and integrates into any Python-based DCC tool (Maya, Houdini, Unreal's Python API) with a single pickle.load() call. That operational simplicity is a feature, not a compromise.
| Approach | Strengths | Weaknesses | Viable Here? |
|---|---|---|---|
| Random Forest ✓ | Small data, fast, interpretable, CPU-only | Loses fine temporal ordering | Yes — optimal for this scale |
| LSTM / GRU | Models temporal order natively | Needs 1,000s of clips, GPU, slow to train | Risky — would overfit at 180 clips |
| Transformer | State-of-art on large motion datasets | Data-hungry, heavy, hard to interpret | No — unsuitable at this data scale |
| SVM | Good on high-dim, small data | No feature importances, slow at inference | Comparable — RF preferred for introspection |
| 1D CNN | Captures local temporal patterns | Requires more data, harder to deploy in DCC | Possible next step with augmented data |
// 06 — Future Development
Where this system
could go next
The current build is a strong, documented baseline. These are the concrete next steps that would move it from a research proof-of-concept into a production-grade virtual production tool.
At 180 clips, the model is data-starved. Jitter joint positions with Gaussian noise, mirror sequences along the sagittal plane, randomly offset the root position, and time-warp clips ±20% to synthetically expand to 1,000+ samples. This alone would allow testing an LSTM or 1D-CNN to compare temporal-aware architectures.
- Gaussian noise on joint coords (±2 cm)
- Mirror flip across sagittal plane
- Time-warp ±20% speed variation
- Random root translation offset
Wrap the prediction pipeline in a sliding-window server: buffer the last 60 frames of live joint data from a Kinect or OptiTrack stream, run M2 every frame, and emit OSC or WebSocket events when a new action is detected with confidence above a threshold. This turns the batch classifier into a live trigger system for reactive VP scenes.
- Sliding window buffer (60-frame ring)
- OSC / WebSocket event emission
- Confidence threshold gating
- Integration with Unreal / TouchDesigner
Once data augmentation or an expanded capture session provides 500+ clips per class, replace the flattened feature vector with a raw sequence and train an LSTM, GRU, or skeletal GCN (Graph Convolutional Network). Graph-based approaches treat the skeleton as a graph of joints connected by bones — architectures like ST-GCN have shown state-of-the-art results on NTU-RGB+D, a dataset structurally similar to KARD.
- LSTM on raw (T, J, 3) sequences
- ST-GCN — graph-based spatial + temporal
- Transformer encoder on frame tokens
- Benchmark against M1/M2 baseline
Package the preprocessing + M2 inference as a Maya Python shelf tool. Expose a live playback mode: as an animator scrubs a mocap clip, the tool updates a HUD overlay with the predicted action label and confidence — helping TDs verify retargeting accuracy or flag clips that were mislabelled during capture.
- Maya Python shelf installer
- PySide2 confidence HUD overlay
- Houdini VEX node via Python SOP
- Clip auto-labelling on import
Production volumes capture multiple performers simultaneously. Extend the pipeline to handle multiple skeleton streams in parallel, add a pose estimation step to detect person-to-person interactions, and train an interaction classifier on top of per-person action labels — enabling scene logic like "when performer A is pointing at performer B who is walking, trigger FX sequence C."
- Parallel multi-skeleton streams
- Relative joint features between bodies
- Interaction classification layer
- Scene-graph event routing
The current system classifies a full, pre-segmented clip. A production-ready system needs to detect action boundaries in a continuous stream — transitioning from a classification problem to a temporal segmentation one. Models like MS-TCN (Multi-Scale Temporal Convolutional Network) are designed exactly for this: they ingest a full performance and output a frame-level action label sequence.
- Continuous stream boundary detection
- MS-TCN or CTC-based segmentation
- Frame-level action label sequences
- Onset / offset event timestamps
// 07 — Skills
Built to signal
ML + VP readiness
Every design decision mirrors patterns used in production-grade motion capture and ML pipelines — from the preprocessing choices to the serialization format.
(T, J, 3) tensor. Uses 'Head' row detection to segment frames without relying on a fixed joint count — handles missing or irregular data gracefully.
sklearn.pipeline.Pipeline so that scaler and (optionally) PCA parameters learned on training data are automatically applied consistently at inference time — preventing data leakage and simplifying deployment.
n_components = min(100, n_samples−1, n_features) safely caps components for small datasets. Explained-variance reporting and both 2D and 3D scatter plots confirm the components capture meaningful action-space structure.
LabelEncoder are persisted as .sav files using Python's pickle module — following the assignment specification and a widely-used convention in production ML workflows for model versioning and swap-in deployment.
stratify=y_enc ensures each of the 18 action classes is proportionally represented in both the 80% training split and the 20% held-out evaluation set — critical when class counts are uneven.