capstone · CPSC 490/491 · unity · ar foundation · llm · 4 Sprints · 5 Contributors

Augmented Object
Intelligence XR

A mobile AR capstone that makes everyday objects intelligent — point your camera at anything and AI-generated context appears anchored to it in 3D space. Built on Google's XR-Objects research, extended into a functional prototype across four sprints. My role was integration engineering and QA: the work that kept the Unity project building, the team unblocked, and every feature tested before it shipped.

Unity 2022.3 C# / URP AR Foundation ARKit ARCore MediaPipe LLM API (REST) GitHub Actions Xcode / iOS Meta Quest 3
4
Sprints
5
Contributors
31
Test Cases
10+
Bugs Fixed
100%
CI Pass Rate

// 01 — The Goal

Making objects
intelligent.

The core idea behind this project is simple: what if you could point your phone at anything and the world explained itself back to you? Most AR experiences today overlay fixed, pre-authored content — a label, a logo, a marker. AOI XR takes a different approach. Instead of pre-authored responses, it uses a live LLM to generate context-aware answers the moment an object is detected. The overlay isn't static — it's a real AI response, tailored to that specific object, in that moment.

This project extends Google's published XR-Objects research, which demonstrated the concept but stopped short of a mobile-deployable prototype. Our goal was to close that gap — to build something that actually runs on an iPhone or Android device, is accessible to everyday users, and can serve real-world needs across education, healthcare, retail, and beyond. The Sprint 4 scavenger hunt demo is a concrete proof of that: a fully playable, LLM-driven AR experience running on a physical device.

🎯

Democratize intelligent AR

XR-Objects existed as a research demo. We wanted to make it real — running on an ordinary smartphone, not a research rig. No special hardware, no pre-authored content, just point and learn.

Accessibility first

A strong emphasis on inclusive design: colorblind-safe confidence palettes, support for visually impaired and neurodiverse users, and gesture/voice as alternative input methods alongside tap.

🔁

Real-world applicability

Not a toy — a foundation for practical use. A student sees an unfamiliar lab instrument and asks what it does. A nurse checks a medication bottle. A warehouse worker identifies a part. The LLM adapts to each context.

Designed for real domains

Education
Students can point at lab equipment, historical artifacts, or biological specimens and receive instant contextual explanations. The LLM can be prompted to match the user's level — elementary to graduate school. Removes the barrier of needing a teacher present at every moment of discovery.
Healthcare
Nurses and technicians can scan medication bottles, medical devices, or anatomical models for dosage info, usage warnings, or procedural reminders — reducing the cognitive load of high-stakes environments where referencing a manual takes precious time.
Retail & Industry
Warehouse workers identify unfamiliar parts. Retail staff pull product specs or inventory status by pointing at items. Field technicians get step-by-step guidance anchored directly to the machinery they're working on, without pulling out a manual or calling for help.
Accessibility
For users with visual impairments, the detected label can be read aloud via text-to-speech. Haptic feedback on detection events creates a non-visual signal. The system's architecture specifically supports adaptive content delivery based on user profile and environmental context.

// 02 — How It Works

Four stages.
Real time.

Every time the camera detects an object, four stages run in sequence — from raw camera frame to a 3D text anchor appearing in the user's physical environment. Each stage is a deliberate architectural choice balancing latency, accuracy, and device constraints.

📷
Capture
AR Camera + AR Session
frames & spatial data
🧠
Detect
MediaPipe on-device
object + hand tracking
💬
Query
AOI Integration Manager
→ LLM REST API
Anchor
AR Foundation places
world-space canvas
👁
Display
Billboard label locked
to physical object
📷

Stage 1 — Detection

AR Camera · XRCameraSubsystem · MediaPipe

AR Foundation's XRCameraSubsystem streams frames from the device camera while XRSessionSubsystem manages AR lifecycle. MediaPipe Unity Plugin runs a TensorFlow Lite object detection model on-device — no cloud vision call — and outputs detected object labels and bounding boxes each frame.

  • ≥5 FPS detection rate target
  • 10+ concurrent detections supported
  • On-device ML — privacy preserved, no streaming
  • Hand tracking enabled for gesture input
💬

Stage 2 — LLM Query

AOI Integration Manager · UnityWebRequest · REST LLM

The AOI Integration Manager receives the detection label, builds a structured natural-language prompt, and fires it to the LLM backend via UnityWebRequest. The backend — evaluated across Google PaLI, LLaMA, and OpenAI-compatible APIs — returns a contextual response. The manager routes that text to the anchor system. Target response time is under 100ms.

  • <100ms response time target
  • Swappable backend — endpoint and auth header only
  • API key secured via environment variables
  • Mock fallback for offline development and testing

Stage 3 — Spatial Anchoring

AR Foundation · World-Space Canvas · UIAnchorManager

The UIAnchorManager creates a world-space Unity Canvas at the detected object's estimated 3D position. A billboard effect ensures it always rotates to face the camera. Object pooling recycles anchor components instead of instantiating and destroying them every frame, eliminating the GC pressure that would cause visible stutters during sustained detection sessions.

  • Billboard effect — readable from any angle
  • Confidence-coded color with colorblind-safe palette
  • Object pool — zero GC allocations at runtime
  • Fade-in/out + pulse animation coroutines
🎮

Sprint 4 — Scavenger Hunt

LLM item gen · Tap-to-detect · AR Marker Spawner

The final sprint pivoted the detection pipeline into an AR scavenger hunt for the live class demo. The LLM generates a 5–10 item list appropriate to the physical environment (classroom, outdoor, etc.), a new TapToDetectionFeeder feeds taps into the detection pipeline, a DetectorBridge formats results for matching, and AR markers spawn on detected surfaces via raycasting. A full session lifecycle runs from main menu through gameplay to end screen.

  • LLM generates context-appropriate item lists
  • Tap → detect → match → highlight → register
  • AR markers placed on detected planes via raycast
  • Session timer · ascending score · clean end state

// 03 — Architecture

Why we built it
this way.

Each technology decision was a deliberate trade-off. Understanding these choices is part of understanding the project — not just what was built, but why.

Unity over native
ARKit / ARCore / OpenXR
Writing native Swift for iOS, Kotlin for Android, and a separate Quest build would have tripled the codebase for the same feature. Unity's AR Foundation abstraction lets a single C# codebase drive ARKit on iOS, ARCore on Android, and OpenXR on Meta Quest — switching providers by changing a build target, not rewriting subsystem logic. The trade-off is Unity's URP pipeline requires explicit configuration for AR camera transparency, which produced real bugs (yellow background, black screen) that had to be diagnosed and fixed during the project.
On-device detection
MediaPipe over cloud vision
Streaming camera frames to a cloud vision API introduces a network round-trip for every detection — unacceptable for AR where you need feedback within a frame or two. MediaPipe runs a TensorFlow Lite model directly on the device CPU/GPU, keeping detection latency well under the 5 FPS target with no network dependency and no privacy risk from transmitting camera data. The limitation is that on-device object detection models are less semantically rich than a full cloud model, which is why the LLM handles interpretation rather than the detector.
Split inference
Local detect + Cloud LLM
Modern multimodal LLMs are billions of parameters — far too large for mobile inference at reasonable speed. The solution is split inference: MediaPipe does the fast, cheap, on-device spatial grounding (what is this object and where is it in 3D space?), and a single lightweight REST call gives that label to a cloud LLM for semantic enrichment (what should I tell the user about this object?). This mirrors the architecture of the XR-Objects paper itself. The backend is intentionally backend-agnostic — swapping from PaLI to LLaMA to any OpenAI-compatible API requires only changing the endpoint URL and authentication header.
World-space anchors
over HUD / screen-space UI
A 2D HUD overlay loses its connection to the physical object the moment the user moves the camera — the label floats on screen while the object slides out of frame. World-space anchors are positioned in the 3D coordinate space that AR Foundation maintains relative to the real environment, so they stay associated with the physical object as the user walks around it. The billboard effect solves the readability problem: no matter what angle you view the anchor from, its text plane rotates to face you.
Feature branches + CI gates
over trunk-based development
Unity projects fail in ways that silent — a missing .meta file, a scene load order issue, or a misplaced serialized field can leave the project compiling but broken at runtime. Feature branch development with mandatory PR reviews caught several of these before they reached main. The CI pipeline adds automated checks specifically designed for Unity failure modes: meta file validation, null safety scanning, script compilation verification, and a smoke test that fails the build on any logged exception during scene initialization.

// 04 — My Role

Integration engineer.
The glue of the project.

My contribution was less about owning a single feature and more about owning reliability. When the Unity project entered Safe Mode in Sprint 4, I diagnosed and fixed the compilation errors that blocked the entire team. When scene transitions broke after a merge, I rewired the Build Settings and inspector references. When the AR camera rendered a yellow background or black screen on an iOS build, I traced the URP clear flag configuration and fixed it. Alongside that integration work, I ran QA across every sprint and contributed to the CI/CD pipeline gates that prevented these issues from recurring.

Primary
Unity Integration & Scene Engineer

AR rendering, scene transitions, inspector wiring, merge conflict resolution, Safe Mode compilation fixes. The work that kept the project in a runnable state.

Secondary
QA & Manual Testing

Test plans, manual test execution, and documented expected vs. actual results across AR rendering, iOS camera permissions, and scene flow — every sprint.

Throughout
Code Reviewer & Team Lead

Reviewed every teammate PR with technical feedback on null safety, coroutine cleanup, and architecture. Managed communication, task division, and sprint coordination.

Contribution What I did Sprint
AR Foundation Setup Set up Unity's AR Foundation subsystem and provider layer in Sprint 1 — wiring XRSessionSubsystem, XRCameraSubsystem, XRPlaneSubsystem, and XRImageTrackingSubsystem to their ARKit (iOS) and ARCore/OpenXR (Quest) providers. Configured Player Settings for AR compatibility across iOS and Android. This established the AR session lifecycle and camera intrinsics pipeline that every later feature relied on. S1
iOS Build Work Committed iterative iOS builds throughout the project to validate AR rendering on physical hardware. Identified the Unity → Xcode connection error in Sprint 1 (documented as a known bug, fixed in Sprint 2). In Sprints 2–3 debugged the full iOS build pipeline — camera clear flags, URP transparency, background color, and Xcode export settings — and documented iOS-specific AR build differences for camera permissions and device compatibility. S1 – S3
AR Rendering Fixes Debugged URP transparency and background rendering for iOS builds — fixed camera clear flags, background color (yellow bug), and scene lighting resets after merges. Resolved black screen after URP build. Maintained consistent AR camera rendering across merges and Player Settings changes across both sprints. S2 · S3
Scene Management Organized Asset and Scene hierarchy for cross-platform builds. Merged MainScene.unity after binary conflict. Verified scene references, prefab integrity, and material assignments after every teammate merge across three sprints to prevent silent breakage. S1 – S3
AOISetupHelper Expansion Expanded AOISetupHelper in Sprint 3 so the full XR stack installs automatically — auto-creating ARSession, ARSessionOrigin, AR Camera, world-space Canvas, and required Prefabs on scene load. This eliminated the manual per-machine setup friction that caused inconsistent environments across the team, and validated camera permissions and anchor pool creation as part of the auto-init flow. S3
Anchor Debug Panel Built UIAnchorDebugPanel — in-editor context menu buttons for running deterministic anchor smoke tests, counting active anchors, and clearing the pool. Validates the full anchor pipeline without needing a physical AR device connected. S3
Anchor Lifecycle Refactored UIAnchor to properly handle fade-in/out, pulse animation, and click callbacks. Fixed coroutine timing conflicts that caused anchors to persist after their lifetime expired. Implemented UpdateAnchor() for dynamic detection refreshes. S3
CI/CD Extensions Added pre-commit C# formatter (whitespace + naming rules), script compilation validation, missing .meta file detection, and pipeline logging. Wired the CI smoke test to fail the build if any exception is logged during scene initialization — turning a manual catch into an automated gate. S3
UI & Scene Flow Implemented Entry/Main Menu screen, gameplay navigation, and Exit/Quit for the Sprint 4 scavenger hunt. Repaired scene management connections across menu → gameplay → end screen, including correcting Build Settings ordering and rewiring broken inspector references after merge. S4
Compilation Unblocking Resolved the Unity Safe Mode startup issue that prevented the project from running — identified invalid class structures, misplaced [SerializeField] attributes, and missing type references introduced by a merge, then corrected them to restore the project for the team before the live demo. S4
Code Reviews Reviewed PRs each sprint: scene merges, MediaPipe simulation updates, AOISetupHelper, UI anchor adjustments. Feedback covered null reference handling around Camera.main, animation coroutine cleanup, confidence threshold hardcoding, and Unity naming conventions. S1 – S4
Project Coordination Managed team communication across all four sprints. Logged merge issues and bug reports clearly so teammates could act without re-investigating the same problem. Divided tasks for Sprint 4, ran recap meetings, and contributed to all sprint deliverable documentation. S1 – S4

// 05 — Quality Assurance

Testing a system
you can't easily mock.

AR systems are notoriously hard to test — the environment is physical, the camera feed is live, and subsystems like ARKit and ARCore don't have simple unit-testable interfaces. The test strategy had to work around this by splitting coverage across four approaches: automated smoke tests in CI, manual functional tests on device, in-editor simulation via mock data, and formal test case documentation with build-linked results for traceability.

8 Testing Methodologies

Exploratory

Open-ended sessions on physical devices to find edge cases the spec didn't anticipate — especially around AR rendering, camera permissions, and lighting changes in real environments.

Smoke Testing

Scene load and AR subsystem initialization verified on every CI run. The UIAnchorDebugPanel enables deterministic in-editor smoke tests without a connected AR device.

Scenario-Based

Full end-to-end flows: object detected → LLM queried → anchor placed → user taps anchor → result displayed. Tests the integration seams between components, not just individual units.

Performance

Frame rate benchmarks targeting ≥5 FPS detection, <100ms LLM response, and memory usage under 2GB during sustained AR sessions. Anchor pooling verified to eliminate GC allocation spikes.

Security

API key exposure scanning (no hardcoded keys in source), input validation for LLM prompts (injection prevention), and camera data privacy (no unnecessary storage or transmission of frames).

Accessibility

Colorblind-safe confidence colour palette verified across three standard colour blindness types. Anchor text contrast ratios checked. Foundation for future text-to-speech and haptic output.

Integration

Verifying the full detection → manager → LLM → anchor chain works together across real build targets, not just in editor. Cross-platform parity between iOS ARKit and Android ARCore builds.

Regression

Post-merge re-runs of all test cases after scene conflicts, PR merges, or Player Settings changes — preventing the category of bug where a fix in one area silently breaks another.

31 Test Cases Across 4 Sprints

Each test case includes component scope, configuration, exact steps, expected result, and a linked build commit for traceability. Sprints 1–3 built the core anchor and detection coverage; Sprint 4 added tap-to-detect and scavenger hunt session flow.

TC-001 Functional
Anchor Creation Smoke Test
TC-002 Functional
Anchor Pooling Load & Lifetime
TC-003 Functional
Capacity & Rate Limiting
TC-004 Integration
LLM Connectivity Test
TC-005 Integration
Detection Loop Stability
TC-006 Performance
Frame Rate Benchmark
TC-007 Performance
Memory Usage Under Load
TC-008 Security
API Key Exposure Scan
TC-009 Security
Input Validation / Injection
TC-010 Accessibility
Colorblind Palette Verification
TC-011 Functional
Billboard Effect
TC-012 Functional
Anchor Lifetime Expiration
TC-013 Functional
Max Anchor Capacity
TC-014 Integration
End-to-End Detection Flow
TC-015 Performance
Network Failure Graceful Degradation
TC-016 Security
Invalid API Key Handling
TC-017 Functional
Anchor Click Callback
TC-018 Integration
AOISetupHelper Auto-Init
TC-019 Functional
Tap-Based Detection Trigger
TC-020 Functional
Tap Detection Accuracy
TC-021 Functional
Object Matching Logic
TC-022 Functional
Tap-to-Register
TC-023 Functional
Proximity Detection Range
TC-024 Integration
Game State Management
TC-025 Functional
Visual Highlighting System
TC-026 Integration
Format Conversion (MediaPipe Bridge)
TC-027 Functional
Item Registration Logic
TC-028 Functional
Duplicate Registration Guard
TC-029 Functional
UI Auto-Creation (UIHelper)
TC-030 Functional
UI State After Registration
TC-031 Integration
End-to-End Scavenger Hunt Flow

// 06 — CI / CD Pipeline

Automated gates
for a Unity project.

Unity projects break in ways that aren't obvious from a diff. A missing .meta file, a scene serialized with a stale GUID, or a missing component reference compiles cleanly but crashes at runtime. The CI/CD pipeline was designed around these Unity-specific failure modes — with quality gates that catch issues before they reach main and cause the team to lose a working build. Every push and pull request runs the full pipeline automatically.

🔀
Trigger
Push or PR to
main / develop
🔍
Quality Gate
Lint · null scan
meta validation
my additions
🧪
Test
Unity Edit Mode
Play Mode tests
💨
Smoke Test
Scene init · subsystems
fail on exception
my additions
🏗️
Build
iOS · Android
GitHub artifact
🚀
Release
Artifact upload
version tag
// checks I added — Sprint 3
  • Pre-commit C# formatter — enforces whitespace and naming rules before a commit lands, so style issues never enter PR review
  • Null safety scan — counts null checks in every C# script; flags components that dereference without guarding (the root cause of BUG-001)
  • Error handling verification — scans for try-catch coverage in async and network-facing code paths
  • Missing .meta file detection — Unity silently breaks when .meta files are absent from commits; this gate catches them before merge
  • Script compilation validation — runs Unity's headless compiler to surface compile errors that only appear on specific platform targets
  • Smoke test gate — boots the Unity scene in Edit Mode via CLI; if any exception is logged during initialization, the build fails immediately
// full pipeline checks (Riya + team)
  • Unity Edit Mode and Play Mode test execution on every push
  • Multi-platform build matrix — iOS (ARKit), Android (ARCore)
  • Component initialization checks — Start/Awake present in MonoBehaviours
  • Input validation patterns — IsNullOrEmpty / IsNullOrWhiteSpace coverage
  • API key security scan — regex-based detection, excludes valid placeholder strings
  • Code quality checks — TODO/FIXME detection, script file counting
  • Artifact upload — build outputs stored per run for download and device deployment
  • Cross-platform compatibility — Ubuntu, macOS, and Windows runners tested

Security & Threat Model

The project included a formal threat model identifying four attack surfaces and their mitigations — integrated into the CI pipeline rather than left as documentation only.

Asset / SurfaceRiskMitigation
AR Camera Data Data leakage — raw camera frames captured or intercepted TLS encryption on all API calls; frames never stored or transmitted beyond the LLM prompt; no cloud vision API used for detection
LLM API Key Key exposure in source code or logs Environment variables only — never hardcoded; CI security scan with regex detection on every push; excluded placeholder strings from false-positive triggering
LLM Prompt Input Prompt injection via crafted object labels or user input Input validation on all user-generated strings before they enter the prompt template; IsNullOrEmpty guards on detection labels; CI scan for validation patterns
Cloud Storage / Backend Unauthorized access to backend APIs or stored data Role-based access control; Firebase authentication for backend endpoints; TLS/SSL on all client-server communication

// 07 — Team

Started as 3.
Grew to 5 for the demo.

The project ran through CPSC 490/491 with Alyssa Barrientos and Riya Jain as the primary two contributors after a third original team member withdrew following Sprint 1. Three additional contributors joined for Sprint 4 to build the scavenger hunt prototype for the live class demo.

Alyssa Barrientos
Sprints 1–4  ·  me
  • AR rendering & URP fixes
  • Scene management & merges
  • Anchor debug tooling
  • UI / scene flow (Sprint 4)
  • Compilation unblocking
  • Code reviews · QA · coordination
Riya Jain
Sprints 1–4
  • MediaPipe integration
  • AOI Integration Manager & LLM backend
  • GitHub Actions CI/CD pipeline
  • Formal test cases (TC-001–005, TC-019–031)
  • Bug tracking & operations docs
Sprint 4 Team
Sprint 4 only · 3 contributors
  • David — LLM item list gen, timer, screens
  • Mohamed — session lifecycle & interaction flow
  • Marco — offline generator, AR marker spawner, UI manager

// 08 — Future Work

Where this goes
from here.

The prototype proves the concept — detection, LLM query, and spatial anchor rendering all running in a real AR session. These are the natural next engineering steps to close the gap between research prototype and production system.

// 01

On-Device LLM

Replace the REST call with a quantized model (Phi-3, Gemma 2B, or LLaMA 3 8B) running via llama.cpp or ExecuTorch on the device. Eliminates network latency, API key exposure, and the offline failure mode entirely. Feasible on iPhone 15 Pro and modern Android flagships.

// 02

Vision-Language Input

Upgrade from label-based prompting (MediaPipe outputs "chair" → send "chair" to LLM) to sending a cropped image patch to a vision-language model (LLaVA, PaLI-Gemma). Richer semantic context, better responses for ambiguous or unusual objects the detector labels poorly.

// 03

Persistent Spatial Memory

Use ARKit Scene Reconstruction or ARCore's Geospatial API to persist anchor positions between sessions — so the same physical object in the same room shows its cached LLM response on re-entry without re-querying the backend.

// 04

Multi-User Shared AR

Synchronize anchor state across devices via ARCore Cloud Anchors or Niantic Lightship. Multiple users in the same physical space could collaboratively tag and annotate objects — turning the tool into a shared, persistent knowledge layer for classrooms or field teams.

// 05

Full Accessibility Layer

Text-to-speech output of anchor labels for users with visual impairments, haptic feedback on detection events, high-contrast and large-text anchor modes. The colorblind-safe palette from Sprint 3 is the first layer of this work.

// 06

Known Issues to Fix

Unsubscribe AR events in OnDisable as well as OnDestroy in AOIIntegrationManager to close the memory leak. Replace the Arial.ttf runtime font reference with a bundled asset. Add structured fallback caching when the LLM endpoint is unreachable.