ML Rig Predictor · AlyArtBar

// 01 — Overview

What it does
& why it matters

In virtual production, performers wear marker suits and move through a tracked volume — but raw 3D joint positions are only useful if the system knows what the performer is doing. This project builds that classification layer, training two Random Forest pipelines to map skeletal sequences to discrete action labels with no manual keyframing.

// Data → Prediction Pipeline

01 📂 CSV Load KARD joint coords per frame

02 🔧 Parse (T, J, 3) array — frames × joints × xyz

03 📐 Resample Fixed 60 frames via linear interp

04 ⚖️ Normalize Root-center + torso scale

05 🧮 Features Positions + velocities → 4,998-D

06 🌲 Predict Random Forest → action label

// 02 — Dataset

Kinect Action
Recognition Dataset

Source

KARD — Real World Coordinates

Each CSV file in the dataset represents a single motion clip captured with a Microsoft Kinect sensor. Joint positions are stored as real-world metric coordinates (x, y, z in meters), not pixel-space projections, making them directly usable for distance-based normalization.

Structure

18 Actions × 10 Subjects

180 total clips spanning 18 distinct physical actions, each performed by multiple subjects to ensure the model learns action-invariant features rather than person-specific motion signatures. Labels are encoded from filename prefixes (a01–a18).

Horizontal Wave High Wave Catch Cap High Throw Draw X Draw Tick Toss Paper Forward Kick Side Kick Take Umbrella Bend Hand Clap Walk Phone Call Drink Sit Down Stand Up Point

// 03 — Models

Two pipelines,
one competition

Both models are sklearn.pipeline.Pipeline objects serialized with pickle. Model 1 works in the full feature space; Model 2 first compresses that space with PCA before handing it to the same forest architecture — trading a small accuracy margin for significantly faster inference.

M1 — Full Feature

Random Forest

finalized_model_M1.sav

Operates on the complete 4,998-dimensional feature vector (joint positions + frame-to-frame velocities). StandardScaler normalizes each feature before the forest. With 300 trees and unbounded depth, this model captures fine-grained motion differences.

StandardScaler → full 4,998-D feature space
300 decision trees, max_depth=None
Trained on 80/20 stratified split
Best for accuracy-critical applications

M2 — PCA Reduced

RF + PCA

finalized_model_M2.sav

Adds a PCA stage before the forest, compressing the scaled features to the top 100 principal components. This removes correlated noise in the joint data while preserving the dominant motion variance — useful for real-time inference budgets in live production.

StandardScaler → PCA (100 components)
Same 300-tree forest as M1
Visualizable: 2D & 3D PCA scatter plots
Best for latency-sensitive VP pipelines

Python — Load & Run Both Models

import pickle
from pathlib import Path

# Load serialized models
clf_full    = pickle.load(open("finalized_model_M1.sav", "rb"))
clf_reduced = pickle.load(open("finalized_model_M2.sav", "rb"))
le          = pickle.load(open("label_encoder.sav",      "rb"))

# Evaluate on test set  (X_test comes from preprocessing pipeline)
score_m1 = clf_full.score(X_test, y_test)
score_m2 = clf_reduced.score(X_test, y_test)

print(f"M1 Accuracy: {score_m1:.3f}")
print(f"M2 Accuracy: {score_m2:.3f}")

# Predict on a new sequence (already preprocessed to flat vector)
pred_encoded = clf_full.predict([new_feature_vector])
pred_label   = le.inverse_transform(pred_encoded)
print(f"Detected Action: {pred_label[0]}")

// 04 — Performance

Model metrics
& comparisons

Both models were evaluated on a held-out 20% test set using stratified sampling to ensure all 18 action classes were represented. Confusion matrices and PCA scatter plots were generated to inspect where inter-class confusion occurs.

Model 1

Full Feature Random Forest

Accuracy—

Feature Dims 4,998

Estimators 300

Preprocessing StandardScaler

Inference Speed Standard

Model 2

PCA-Reduced RF Pipeline

Accuracy—

Feature Dims 100

Estimators 300

Preprocessing Scaler + PCA

Inference Speed Faster

Visualizations Generated

PCA Projections & Confusion Matrix

The notebook produces a 2D PCA scatter of all 180 clips (coloured by encoded label), a 3D PCA scatter, a 2D PCA scatter of test-set predictions through M2's learned component space, and a full confusion matrix using ConfusionMatrixDisplay.from_predictions. These reveal which action pairs the forest struggles to separate — typically slow, low-amplitude motions like phone call vs drink.

// 05 — Why This Approach

Random Forest over
deep learning — intentionally

A transformer or LSTM could theoretically model temporal sequences better — so why Random Forest? In a constrained research and virtual production context, the decision is defensible on multiple grounds.

Reason 01

Dataset Size

With only 180 clips across 18 classes, deep learning models would overfit immediately. Random Forests are known to generalize well on small-to-medium tabular datasets where the number of samples per class is low. The hand-crafted feature vector (positions + velocities, resampled to 60 frames) encodes enough temporal structure that a tree ensemble can discriminate effectively without needing raw sequence modelling.

Reason 02

Interpretability

In a production pipeline where a TD or supervisor needs to understand why the classifier fired, Random Forests offer feature importances out of the box. You can inspect which joint, which axis, and which frame window most influenced a prediction — something opaque to RNNs or attention models. This matters when debugging a faulty trigger on a live volume set.

Reason 03

Deployment Simplicity

A pickled sklearn pipeline has zero runtime dependencies beyond numpy and scikit-learn — no GPU, no CUDA, no framework version pinning. It loads in milliseconds, runs on CPU in under 1 ms per inference, and integrates into any Python-based DCC tool (Maya, Houdini, Unreal's Python API) with a single pickle.load() call. That operational simplicity is a feature, not a compromise.

// Approach tradeoffs — RF vs alternatives


  
    
      Approach
      Strengths
      Weaknesses
      Viable Here?
    
  
  
    
      Random Forest ✓
      Small data, fast, interpretable, CPU-only
      Loses fine temporal ordering
      Yes — optimal for this scale
    
    
      LSTM / GRU
      Models temporal order natively
      Needs 1,000s of clips, GPU, slow to train
      Risky — would overfit at 180 clips
    
    
      Transformer
      State-of-art on large motion datasets
      Data-hungry, heavy, hard to interpret
      No — unsuitable at this data scale
    
    
      SVM
      Good on high-dim, small data
      No feature importances, slow at inference
      Comparable — RF preferred for introspection
    
    
      1D CNN
      Captures local temporal patterns
      Requires more data, harder to deploy in DCC
      Possible next step with augmented data

Approach	Strengths	Weaknesses	Viable Here?
Random Forest ✓	Small data, fast, interpretable, CPU-only	Loses fine temporal ordering	Yes — optimal for this scale
LSTM / GRU	Models temporal order natively	Needs 1,000s of clips, GPU, slow to train	Risky — would overfit at 180 clips
Transformer	State-of-art on large motion datasets	Data-hungry, heavy, hard to interpret	No — unsuitable at this data scale
SVM	Good on high-dim, small data	No feature importances, slow at inference	Comparable — RF preferred for introspection
1D CNN	Captures local temporal patterns	Requires more data, harder to deploy in DCC	Possible next step with augmented data

// 06 — Future Development

Where this system
could go next

The current build is a strong, documented baseline. These are the concrete next steps that would move it from a research proof-of-concept into a production-grade virtual production tool.

Near Term

Data Augmentation

At 180 clips, the model is data-starved. Jitter joint positions with Gaussian noise, mirror sequences along the sagittal plane, randomly offset the root position, and time-warp clips ±20% to synthetically expand to 1,000+ samples. This alone would allow testing an LSTM or 1D-CNN to compare temporal-aware architectures.

Gaussian noise on joint coords (±2 cm)
Mirror flip across sagittal plane
Time-warp ±20% speed variation
Random root translation offset

Near Term

Real-Time Inference

Wrap the prediction pipeline in a sliding-window server: buffer the last 60 frames of live joint data from a Kinect or OptiTrack stream, run M2 every frame, and emit OSC or WebSocket events when a new action is detected with confidence above a threshold. This turns the batch classifier into a live trigger system for reactive VP scenes.

Sliding window buffer (60-frame ring)
OSC / WebSocket event emission
Confidence threshold gating
Integration with Unreal / TouchDesigner

Longer Term

Temporal Models

Once data augmentation or an expanded capture session provides 500+ clips per class, replace the flattened feature vector with a raw sequence and train an LSTM, GRU, or skeletal GCN (Graph Convolutional Network). Graph-based approaches treat the skeleton as a graph of joints connected by bones — architectures like ST-GCN have shown state-of-the-art results on NTU-RGB+D, a dataset structurally similar to KARD.

LSTM on raw (T, J, 3) sequences
ST-GCN — graph-based spatial + temporal
Transformer encoder on frame tokens
Benchmark against M1/M2 baseline

Near Term

Maya / Houdini Integration

Package the preprocessing + M2 inference as a Maya Python shelf tool. Expose a live playback mode: as an animator scrubs a mocap clip, the tool updates a HUD overlay with the predicted action label and confidence — helping TDs verify retargeting accuracy or flag clips that were mislabelled during capture.

Maya Python shelf installer
PySide2 confidence HUD overlay
Houdini VEX node via Python SOP
Clip auto-labelling on import

Longer Term

Multi-Person Scenes

Production volumes capture multiple performers simultaneously. Extend the pipeline to handle multiple skeleton streams in parallel, add a pose estimation step to detect person-to-person interactions, and train an interaction classifier on top of per-person action labels — enabling scene logic like "when performer A is pointing at performer B who is walking, trigger FX sequence C."

Parallel multi-skeleton streams
Relative joint features between bodies
Interaction classification layer
Scene-graph event routing

Research

Action Segmentation

The current system classifies a full, pre-segmented clip. A production-ready system needs to detect action boundaries in a continuous stream — transitioning from a classification problem to a temporal segmentation one. Models like MS-TCN (Multi-Scale Temporal Convolutional Network) are designed exactly for this: they ingest a full performance and output a frame-level action label sequence.

Continuous stream boundary detection
MS-TCN or CTC-based segmentation
Frame-level action label sequences
Onset / offset event timestamps

// 07 — Skills

Built to signal
ML + VP readiness

Every design decision mirrors patterns used in production-grade motion capture and ML pipelines — from the preprocessing choices to the serialization format.

Skeletal Data Parsing

parse_sequence()

Robust CSV-to-array parser that reconstructs per-frame joint data into a (T, J, 3) tensor. Uses 'Head' row detection to segment frames without relying on a fixed joint count — handles missing or irregular data gracefully.

Temporal Resampling

resample_sequence()

Linear interpolation along the time axis resizes every clip to exactly 60 frames. This is the same approach used in retargeting and motion matching — decoupling capture frame rate from model input length.

Root-Space Normalization

root_center_normalize()

Subtracts the hip/root joint position each frame to remove global translation, then scales by the torso-to-head distance to remove global body scale. The model learns motion shape, not where in the volume the performer stood.

Velocity Features

Feature extraction

Frame-to-frame differences (velocities) are appended to raw joint positions before flattening, doubling the dynamic information fed to the classifier. Velocity features are what distinguish, for example, a slow wave from a fast one at identical peak positions.

Scikit-learn Pipelines

Both models

Both classifiers are wrapped in sklearn.pipeline.Pipeline so that scaler and (optionally) PCA parameters learned on training data are automatically applied consistently at inference time — preventing data leakage and simplifying deployment.

PCA Dimensionality Reduction

finalized_model_M2.sav

PCA with n_components = min(100, n_samples−1, n_features) safely caps components for small datasets. Explained-variance reporting and both 2D and 3D scatter plots confirm the components capture meaningful action-space structure.

Model Serialization

.sav via pickle

Trained pipelines and the LabelEncoder are persisted as .sav files using Python's pickle module — following the assignment specification and a widely-used convention in production ML workflows for model versioning and swap-in deployment.

Stratified Train/Test Split

train_test_split

stratify=y_enc ensures each of the 18 action classes is proportionally represented in both the 80% training split and the 20% held-out evaluation set — critical when class counts are uneven.

Virtual ProdML Rig Predictor

What it does& why it matters

Kinect ActionRecognition Dataset

Two pipelines,one competition

Model metrics& comparisons

Random Forest overdeep learning — intentionally

Where this systemcould go next

Built to signalML + VP readiness

Virtual Prod
ML Rig Predictor

What it does
& why it matters

Kinect Action
Recognition Dataset

Two pipelines,
one competition

Model metrics
& comparisons

Random Forest over
deep learning — intentionally

Where this system
could go next

Built to signal
ML + VP readiness