Machine Learning · Virtual Production · CPCS 483

Virtual Prod
ML Rig Predictor

A two-model machine learning system that classifies 18 human motion actions from 3D skeletal joint data — designed for virtual production rigs where real-time action detection drives animation, FX, and scene logic.

Python Scikit-learn Random Forest PCA NumPy Pandas Matplotlib Pickle Jupyter KARD Dataset
18
Action Classes
180
Motion Clips
2
RF Models
60
Frames/Clip

// 01 — Overview

What it does
& why it matters

In virtual production, performers wear marker suits and move through a tracked volume — but raw 3D joint positions are only useful if the system knows what the performer is doing. This project builds that classification layer, training two Random Forest pipelines to map skeletal sequences to discrete action labels with no manual keyframing.

// Data → Prediction Pipeline
01 📂 CSV Load KARD joint coords per frame
02 🔧 Parse (T, J, 3) array — frames × joints × xyz
03 📐 Resample Fixed 60 frames via linear interp
04 ⚖️ Normalize Root-center + torso scale
05 🧮 Features Positions + velocities → 4,998-D
06 🌲 Predict Random Forest → action label

// 02 — Dataset

Kinect Action
Recognition Dataset

Source
KARD — Real World Coordinates

Each CSV file in the dataset represents a single motion clip captured with a Microsoft Kinect sensor. Joint positions are stored as real-world metric coordinates (x, y, z in meters), not pixel-space projections, making them directly usable for distance-based normalization.

Structure
18 Actions × 10 Subjects

180 total clips spanning 18 distinct physical actions, each performed by multiple subjects to ensure the model learns action-invariant features rather than person-specific motion signatures. Labels are encoded from filename prefixes (a01a18).

Horizontal Wave High Wave Catch Cap High Throw Draw X Draw Tick Toss Paper Forward Kick Side Kick Take Umbrella Bend Hand Clap Walk Phone Call Drink Sit Down Stand Up Point

// 03 — Models

Two pipelines,
one competition

Both models are sklearn.pipeline.Pipeline objects serialized with pickle. Model 1 works in the full feature space; Model 2 first compresses that space with PCA before handing it to the same forest architecture — trading a small accuracy margin for significantly faster inference.

M1 — Full Feature
Random Forest
finalized_model_M1.sav

Operates on the complete 4,998-dimensional feature vector (joint positions + frame-to-frame velocities). StandardScaler normalizes each feature before the forest. With 300 trees and unbounded depth, this model captures fine-grained motion differences.

  • StandardScaler → full 4,998-D feature space
  • 300 decision trees, max_depth=None
  • Trained on 80/20 stratified split
  • Best for accuracy-critical applications
M2 — PCA Reduced
RF + PCA
finalized_model_M2.sav

Adds a PCA stage before the forest, compressing the scaled features to the top 100 principal components. This removes correlated noise in the joint data while preserving the dominant motion variance — useful for real-time inference budgets in live production.

  • StandardScaler → PCA (100 components)
  • Same 300-tree forest as M1
  • Visualizable: 2D & 3D PCA scatter plots
  • Best for latency-sensitive VP pipelines
Python — Load & Run Both Models
import pickle
from pathlib import Path

# Load serialized models
clf_full    = pickle.load(open("finalized_model_M1.sav", "rb"))
clf_reduced = pickle.load(open("finalized_model_M2.sav", "rb"))
le          = pickle.load(open("label_encoder.sav",      "rb"))

# Evaluate on test set  (X_test comes from preprocessing pipeline)
score_m1 = clf_full.score(X_test, y_test)
score_m2 = clf_reduced.score(X_test, y_test)

print(f"M1 Accuracy: {score_m1:.3f}")
print(f"M2 Accuracy: {score_m2:.3f}")

# Predict on a new sequence (already preprocessed to flat vector)
pred_encoded = clf_full.predict([new_feature_vector])
pred_label   = le.inverse_transform(pred_encoded)
print(f"Detected Action: {pred_label[0]}")

// 04 — Performance

Model metrics
& comparisons

Both models were evaluated on a held-out 20% test set using stratified sampling to ensure all 18 action classes were represented. Confusion matrices and PCA scatter plots were generated to inspect where inter-class confusion occurs.

Model 1
Full Feature Random Forest
Accuracy
Feature Dims 4,998
Estimators 300
Preprocessing StandardScaler
Inference Speed Standard
Model 2
PCA-Reduced RF Pipeline
Accuracy
Feature Dims 100
Estimators 300
Preprocessing Scaler + PCA
Inference Speed Faster
Visualizations Generated
PCA Projections & Confusion Matrix

The notebook produces a 2D PCA scatter of all 180 clips (coloured by encoded label), a 3D PCA scatter, a 2D PCA scatter of test-set predictions through M2's learned component space, and a full confusion matrix using ConfusionMatrixDisplay.from_predictions. These reveal which action pairs the forest struggles to separate — typically slow, low-amplitude motions like phone call vs drink.

// 05 — Why This Approach

Random Forest over
deep learning — intentionally

A transformer or LSTM could theoretically model temporal sequences better — so why Random Forest? In a constrained research and virtual production context, the decision is defensible on multiple grounds.

Reason 01
Dataset Size

With only 180 clips across 18 classes, deep learning models would overfit immediately. Random Forests are known to generalize well on small-to-medium tabular datasets where the number of samples per class is low. The hand-crafted feature vector (positions + velocities, resampled to 60 frames) encodes enough temporal structure that a tree ensemble can discriminate effectively without needing raw sequence modelling.

Reason 02
Interpretability

In a production pipeline where a TD or supervisor needs to understand why the classifier fired, Random Forests offer feature importances out of the box. You can inspect which joint, which axis, and which frame window most influenced a prediction — something opaque to RNNs or attention models. This matters when debugging a faulty trigger on a live volume set.

Reason 03
Deployment Simplicity

A pickled sklearn pipeline has zero runtime dependencies beyond numpy and scikit-learn — no GPU, no CUDA, no framework version pinning. It loads in milliseconds, runs on CPU in under 1 ms per inference, and integrates into any Python-based DCC tool (Maya, Houdini, Unreal's Python API) with a single pickle.load() call. That operational simplicity is a feature, not a compromise.

// Approach tradeoffs — RF vs alternatives
Approach Strengths Weaknesses Viable Here?
Random Forest ✓ Small data, fast, interpretable, CPU-only Loses fine temporal ordering Yes — optimal for this scale
LSTM / GRU Models temporal order natively Needs 1,000s of clips, GPU, slow to train Risky — would overfit at 180 clips
Transformer State-of-art on large motion datasets Data-hungry, heavy, hard to interpret No — unsuitable at this data scale
SVM Good on high-dim, small data No feature importances, slow at inference Comparable — RF preferred for introspection
1D CNN Captures local temporal patterns Requires more data, harder to deploy in DCC Possible next step with augmented data

// 06 — Future Development

Where this system
could go next

The current build is a strong, documented baseline. These are the concrete next steps that would move it from a research proof-of-concept into a production-grade virtual production tool.

Near Term
Data Augmentation

At 180 clips, the model is data-starved. Jitter joint positions with Gaussian noise, mirror sequences along the sagittal plane, randomly offset the root position, and time-warp clips ±20% to synthetically expand to 1,000+ samples. This alone would allow testing an LSTM or 1D-CNN to compare temporal-aware architectures.

  • Gaussian noise on joint coords (±2 cm)
  • Mirror flip across sagittal plane
  • Time-warp ±20% speed variation
  • Random root translation offset
Near Term
Real-Time Inference

Wrap the prediction pipeline in a sliding-window server: buffer the last 60 frames of live joint data from a Kinect or OptiTrack stream, run M2 every frame, and emit OSC or WebSocket events when a new action is detected with confidence above a threshold. This turns the batch classifier into a live trigger system for reactive VP scenes.

  • Sliding window buffer (60-frame ring)
  • OSC / WebSocket event emission
  • Confidence threshold gating
  • Integration with Unreal / TouchDesigner
Longer Term
Temporal Models

Once data augmentation or an expanded capture session provides 500+ clips per class, replace the flattened feature vector with a raw sequence and train an LSTM, GRU, or skeletal GCN (Graph Convolutional Network). Graph-based approaches treat the skeleton as a graph of joints connected by bones — architectures like ST-GCN have shown state-of-the-art results on NTU-RGB+D, a dataset structurally similar to KARD.

  • LSTM on raw (T, J, 3) sequences
  • ST-GCN — graph-based spatial + temporal
  • Transformer encoder on frame tokens
  • Benchmark against M1/M2 baseline
Near Term
Maya / Houdini Integration

Package the preprocessing + M2 inference as a Maya Python shelf tool. Expose a live playback mode: as an animator scrubs a mocap clip, the tool updates a HUD overlay with the predicted action label and confidence — helping TDs verify retargeting accuracy or flag clips that were mislabelled during capture.

  • Maya Python shelf installer
  • PySide2 confidence HUD overlay
  • Houdini VEX node via Python SOP
  • Clip auto-labelling on import
Longer Term
Multi-Person Scenes

Production volumes capture multiple performers simultaneously. Extend the pipeline to handle multiple skeleton streams in parallel, add a pose estimation step to detect person-to-person interactions, and train an interaction classifier on top of per-person action labels — enabling scene logic like "when performer A is pointing at performer B who is walking, trigger FX sequence C."

  • Parallel multi-skeleton streams
  • Relative joint features between bodies
  • Interaction classification layer
  • Scene-graph event routing
Research
Action Segmentation

The current system classifies a full, pre-segmented clip. A production-ready system needs to detect action boundaries in a continuous stream — transitioning from a classification problem to a temporal segmentation one. Models like MS-TCN (Multi-Scale Temporal Convolutional Network) are designed exactly for this: they ingest a full performance and output a frame-level action label sequence.

  • Continuous stream boundary detection
  • MS-TCN or CTC-based segmentation
  • Frame-level action label sequences
  • Onset / offset event timestamps

// 07 — Skills

Built to signal
ML + VP readiness

Every design decision mirrors patterns used in production-grade motion capture and ML pipelines — from the preprocessing choices to the serialization format.

Skeletal Data Parsing
parse_sequence()
Robust CSV-to-array parser that reconstructs per-frame joint data into a (T, J, 3) tensor. Uses 'Head' row detection to segment frames without relying on a fixed joint count — handles missing or irregular data gracefully.
Temporal Resampling
resample_sequence()
Linear interpolation along the time axis resizes every clip to exactly 60 frames. This is the same approach used in retargeting and motion matching — decoupling capture frame rate from model input length.
Root-Space Normalization
root_center_normalize()
Subtracts the hip/root joint position each frame to remove global translation, then scales by the torso-to-head distance to remove global body scale. The model learns motion shape, not where in the volume the performer stood.
Velocity Features
Feature extraction
Frame-to-frame differences (velocities) are appended to raw joint positions before flattening, doubling the dynamic information fed to the classifier. Velocity features are what distinguish, for example, a slow wave from a fast one at identical peak positions.
Scikit-learn Pipelines
Both models
Both classifiers are wrapped in sklearn.pipeline.Pipeline so that scaler and (optionally) PCA parameters learned on training data are automatically applied consistently at inference time — preventing data leakage and simplifying deployment.
PCA Dimensionality Reduction
finalized_model_M2.sav
PCA with n_components = min(100, n_samples−1, n_features) safely caps components for small datasets. Explained-variance reporting and both 2D and 3D scatter plots confirm the components capture meaningful action-space structure.
Model Serialization
.sav via pickle
Trained pipelines and the LabelEncoder are persisted as .sav files using Python's pickle module — following the assignment specification and a widely-used convention in production ML workflows for model versioning and swap-in deployment.
Stratified Train/Test Split
train_test_split
stratify=y_enc ensures each of the 18 action classes is proportionally represented in both the 80% training split and the 20% held-out evaluation set — critical when class counts are uneven.