AOI XR · AlyArtBar

// 01 — The Goal

Making objects
intelligent.

The core idea behind this project is simple: what if you could point your phone at anything and the world explained itself back to you? Most AR experiences today overlay fixed, pre-authored content — a label, a logo, a marker. AOI XR takes a different approach. Instead of pre-authored responses, it uses a live LLM to generate context-aware answers the moment an object is detected. The overlay isn't static — it's a real AI response, tailored to that specific object, in that moment.

This project extends Google's published XR-Objects research, which demonstrated the concept but stopped short of a mobile-deployable prototype. Our goal was to close that gap — to build something that actually runs on an iPhone or Android device, is accessible to everyday users, and can serve real-world needs across education, healthcare, retail, and beyond. The Sprint 4 scavenger hunt demo is a concrete proof of that: a fully playable, LLM-driven AR experience running on a physical device.

🎯

Democratize intelligent AR

XR-Objects existed as a research demo. We wanted to make it real — running on an ordinary smartphone, not a research rig. No special hardware, no pre-authored content, just point and learn.

♿

Accessibility first

A strong emphasis on inclusive design: colorblind-safe confidence palettes, support for visually impaired and neurodiverse users, and gesture/voice as alternative input methods alongside tap.

🔁

Real-world applicability

Not a toy — a foundation for practical use. A student sees an unfamiliar lab instrument and asks what it does. A nurse checks a medication bottle. A warehouse worker identifies a part. The LLM adapts to each context.

Designed for real domains

Education

Students can point at lab equipment, historical artifacts, or biological specimens and receive instant contextual explanations. The LLM can be prompted to match the user's level — elementary to graduate school. Removes the barrier of needing a teacher present at every moment of discovery.

Healthcare

Nurses and technicians can scan medication bottles, medical devices, or anatomical models for dosage info, usage warnings, or procedural reminders — reducing the cognitive load of high-stakes environments where referencing a manual takes precious time.

Retail & Industry

Warehouse workers identify unfamiliar parts. Retail staff pull product specs or inventory status by pointing at items. Field technicians get step-by-step guidance anchored directly to the machinery they're working on, without pulling out a manual or calling for help.

Accessibility

For users with visual impairments, the detected label can be read aloud via text-to-speech. Haptic feedback on detection events creates a non-visual signal. The system's architecture specifically supports adaptive content delivery based on user profile and environmental context.

// 02 — How It Works

Four stages.
Real time.

Every time the camera detects an object, four stages run in sequence — from raw camera frame to a 3D text anchor appearing in the user's physical environment. Each stage is a deliberate architectural choice balancing latency, accuracy, and device constraints.

📷

Capture

AR Camera + AR Session
frames & spatial data

→

🧠

Detect

MediaPipe on-device
object + hand tracking

→

💬

Query

AOI Integration Manager
→ LLM REST API

→

⚓

Anchor

AR Foundation places
world-space canvas

→

👁

Display

Billboard label locked
to physical object

📷

Stage 1 — Detection

AR Camera · XRCameraSubsystem · MediaPipe

AR Foundation's XRCameraSubsystem streams frames from the device camera while XRSessionSubsystem manages AR lifecycle. MediaPipe Unity Plugin runs a TensorFlow Lite object detection model on-device — no cloud vision call — and outputs detected object labels and bounding boxes each frame.

≥5 FPS detection rate target
10+ concurrent detections supported
On-device ML — privacy preserved, no streaming
Hand tracking enabled for gesture input

💬

Stage 2 — LLM Query

AOI Integration Manager · UnityWebRequest · REST LLM

The AOI Integration Manager receives the detection label, builds a structured natural-language prompt, and fires it to the LLM backend via UnityWebRequest. The backend — evaluated across Google PaLI, LLaMA, and OpenAI-compatible APIs — returns a contextual response. The manager routes that text to the anchor system. Target response time is under 100ms.

<100ms response time target
Swappable backend — endpoint and auth header only
API key secured via environment variables
Mock fallback for offline development and testing

⚓

Stage 3 — Spatial Anchoring

AR Foundation · World-Space Canvas · UIAnchorManager

The UIAnchorManager creates a world-space Unity Canvas at the detected object's estimated 3D position. A billboard effect ensures it always rotates to face the camera. Object pooling recycles anchor components instead of instantiating and destroying them every frame, eliminating the GC pressure that would cause visible stutters during sustained detection sessions.

Billboard effect — readable from any angle
Confidence-coded color with colorblind-safe palette
Object pool — zero GC allocations at runtime
Fade-in/out + pulse animation coroutines

🎮

Sprint 4 — Scavenger Hunt

LLM item gen · Tap-to-detect · AR Marker Spawner

The final sprint pivoted the detection pipeline into an AR scavenger hunt for the live class demo. The LLM generates a 5–10 item list appropriate to the physical environment (classroom, outdoor, etc.), a new TapToDetectionFeeder feeds taps into the detection pipeline, a DetectorBridge formats results for matching, and AR markers spawn on detected surfaces via raycasting. A full session lifecycle runs from main menu through gameplay to end screen.

LLM generates context-appropriate item lists
Tap → detect → match → highlight → register
AR markers placed on detected planes via raycast
Session timer · ascending score · clean end state

// 03 — Architecture

Why we built it
this way.

Each technology decision was a deliberate trade-off. Understanding these choices is part of understanding the project — not just what was built, but why.

Unity over native

ARKit / ARCore / OpenXR

Writing native Swift for iOS, Kotlin for Android, and a separate Quest build would have tripled the codebase for the same feature. Unity's AR Foundation abstraction lets a single C# codebase drive ARKit on iOS, ARCore on Android, and OpenXR on Meta Quest — switching providers by changing a build target, not rewriting subsystem logic. The trade-off is Unity's URP pipeline requires explicit configuration for AR camera transparency, which produced real bugs (yellow background, black screen) that had to be diagnosed and fixed during the project.

On-device detection

MediaPipe over cloud vision

Streaming camera frames to a cloud vision API introduces a network round-trip for every detection — unacceptable for AR where you need feedback within a frame or two. MediaPipe runs a TensorFlow Lite model directly on the device CPU/GPU, keeping detection latency well under the 5 FPS target with no network dependency and no privacy risk from transmitting camera data. The limitation is that on-device object detection models are less semantically rich than a full cloud model, which is why the LLM handles interpretation rather than the detector.

Split inference

Local detect + Cloud LLM

Modern multimodal LLMs are billions of parameters — far too large for mobile inference at reasonable speed. The solution is split inference: MediaPipe does the fast, cheap, on-device spatial grounding (what is this object and where is it in 3D space?), and a single lightweight REST call gives that label to a cloud LLM for semantic enrichment (what should I tell the user about this object?). This mirrors the architecture of the XR-Objects paper itself. The backend is intentionally backend-agnostic — swapping from PaLI to LLaMA to any OpenAI-compatible API requires only changing the endpoint URL and authentication header.

World-space anchors

over HUD / screen-space UI

A 2D HUD overlay loses its connection to the physical object the moment the user moves the camera — the label floats on screen while the object slides out of frame. World-space anchors are positioned in the 3D coordinate space that AR Foundation maintains relative to the real environment, so they stay associated with the physical object as the user walks around it. The billboard effect solves the readability problem: no matter what angle you view the anchor from, its text plane rotates to face you.

Feature branches + CI gates

over trunk-based development

Unity projects fail in ways that silent — a missing .meta file, a scene load order issue, or a misplaced serialized field can leave the project compiling but broken at runtime. Feature branch development with mandatory PR reviews caught several of these before they reached main. The CI pipeline adds automated checks specifically designed for Unity failure modes: meta file validation, null safety scanning, script compilation verification, and a smoke test that fails the build on any logged exception during scene initialization.

// 04 — My Role

Integration engineer.
The glue of the project.

My contribution was less about owning a single feature and more about owning reliability. When the Unity project entered Safe Mode in Sprint 4, I diagnosed and fixed the compilation errors that blocked the entire team. When scene transitions broke after a merge, I rewired the Build Settings and inspector references. When the AR camera rendered a yellow background or black screen on an iOS build, I traced the URP clear flag configuration and fixed it. Alongside that integration work, I ran QA across every sprint and contributed to the CI/CD pipeline gates that prevented these issues from recurring.

Primary

Unity Integration & Scene Engineer

AR rendering, scene transitions, inspector wiring, merge conflict resolution, Safe Mode compilation fixes. The work that kept the project in a runnable state.

Secondary

QA & Manual Testing

Test plans, manual test execution, and documented expected vs. actual results across AR rendering, iOS camera permissions, and scene flow — every sprint.

Throughout

Code Reviewer & Team Lead

Reviewed every teammate PR with technical feedback on null safety, coroutine cleanup, and architecture. Managed communication, task division, and sprint coordination.

Contribution	What I did	Sprint
AR Foundation Setup	Set up Unity's AR Foundation subsystem and provider layer in Sprint 1 — wiring `XRSessionSubsystem`, `XRCameraSubsystem`, `XRPlaneSubsystem`, and `XRImageTrackingSubsystem` to their ARKit (iOS) and ARCore/OpenXR (Quest) providers. Configured Player Settings for AR compatibility across iOS and Android. This established the AR session lifecycle and camera intrinsics pipeline that every later feature relied on.	S1
iOS Build Work	Committed iterative iOS builds throughout the project to validate AR rendering on physical hardware. Identified the Unity → Xcode connection error in Sprint 1 (documented as a known bug, fixed in Sprint 2). In Sprints 2–3 debugged the full iOS build pipeline — camera clear flags, URP transparency, background color, and Xcode export settings — and documented iOS-specific AR build differences for camera permissions and device compatibility.	S1 – S3
AR Rendering Fixes	Debugged URP transparency and background rendering for iOS builds — fixed camera clear flags, background color (yellow bug), and scene lighting resets after merges. Resolved black screen after URP build. Maintained consistent AR camera rendering across merges and Player Settings changes across both sprints.	S2 · S3
Scene Management	Organized Asset and Scene hierarchy for cross-platform builds. Merged `MainScene.unity` after binary conflict. Verified scene references, prefab integrity, and material assignments after every teammate merge across three sprints to prevent silent breakage.	S1 – S3
AOISetupHelper Expansion	Expanded `AOISetupHelper` in Sprint 3 so the full XR stack installs automatically — auto-creating `ARSession`, `ARSessionOrigin`, AR Camera, world-space Canvas, and required Prefabs on scene load. This eliminated the manual per-machine setup friction that caused inconsistent environments across the team, and validated camera permissions and anchor pool creation as part of the auto-init flow.	S3
Anchor Debug Panel	Built `UIAnchorDebugPanel` — in-editor context menu buttons for running deterministic anchor smoke tests, counting active anchors, and clearing the pool. Validates the full anchor pipeline without needing a physical AR device connected.	S3
Anchor Lifecycle	Refactored `UIAnchor` to properly handle fade-in/out, pulse animation, and click callbacks. Fixed coroutine timing conflicts that caused anchors to persist after their lifetime expired. Implemented `UpdateAnchor()` for dynamic detection refreshes.	S3
CI/CD Extensions	Added pre-commit C# formatter (whitespace + naming rules), script compilation validation, missing `.meta` file detection, and pipeline logging. Wired the CI smoke test to fail the build if any exception is logged during scene initialization — turning a manual catch into an automated gate.	S3
UI & Scene Flow	Implemented Entry/Main Menu screen, gameplay navigation, and Exit/Quit for the Sprint 4 scavenger hunt. Repaired scene management connections across menu → gameplay → end screen, including correcting Build Settings ordering and rewiring broken inspector references after merge.	S4
Compilation Unblocking	Resolved the Unity Safe Mode startup issue that prevented the project from running — identified invalid class structures, misplaced `[SerializeField]` attributes, and missing type references introduced by a merge, then corrected them to restore the project for the team before the live demo.	S4
Code Reviews	Reviewed PRs each sprint: scene merges, MediaPipe simulation updates, AOISetupHelper, UI anchor adjustments. Feedback covered null reference handling around `Camera.main`, animation coroutine cleanup, confidence threshold hardcoding, and Unity naming conventions.	S1 – S4
Project Coordination	Managed team communication across all four sprints. Logged merge issues and bug reports clearly so teammates could act without re-investigating the same problem. Divided tasks for Sprint 4, ran recap meetings, and contributed to all sprint deliverable documentation.	S1 – S4

// 05 — Quality Assurance

Testing a system
you can't easily mock.

AR systems are notoriously hard to test — the environment is physical, the camera feed is live, and subsystems like ARKit and ARCore don't have simple unit-testable interfaces. The test strategy had to work around this by splitting coverage across four approaches: automated smoke tests in CI, manual functional tests on device, in-editor simulation via mock data, and formal test case documentation with build-linked results for traceability.

8 Testing Methodologies

Exploratory

Open-ended sessions on physical devices to find edge cases the spec didn't anticipate — especially around AR rendering, camera permissions, and lighting changes in real environments.

Smoke Testing

Scene load and AR subsystem initialization verified on every CI run. The UIAnchorDebugPanel enables deterministic in-editor smoke tests without a connected AR device.

Scenario-Based

Full end-to-end flows: object detected → LLM queried → anchor placed → user taps anchor → result displayed. Tests the integration seams between components, not just individual units.

Performance

Frame rate benchmarks targeting ≥5 FPS detection, <100ms LLM response, and memory usage under 2GB during sustained AR sessions. Anchor pooling verified to eliminate GC allocation spikes.

Security

API key exposure scanning (no hardcoded keys in source), input validation for LLM prompts (injection prevention), and camera data privacy (no unnecessary storage or transmission of frames).

Accessibility

Colorblind-safe confidence colour palette verified across three standard colour blindness types. Anchor text contrast ratios checked. Foundation for future text-to-speech and haptic output.

Integration

Verifying the full detection → manager → LLM → anchor chain works together across real build targets, not just in editor. Cross-platform parity between iOS ARKit and Android ARCore builds.

Regression

Post-merge re-runs of all test cases after scene conflicts, PR merges, or Player Settings changes — preventing the category of bug where a fix in one area silently breaks another.

31 Test Cases Across 4 Sprints

Each test case includes component scope, configuration, exact steps, expected result, and a linked build commit for traceability. Sprints 1–3 built the core anchor and detection coverage; Sprint 4 added tap-to-detect and scavenger hunt session flow.

TC-001 Functional

Anchor Creation Smoke Test

TC-002 Functional

Anchor Pooling Load & Lifetime

TC-003 Functional

Capacity & Rate Limiting

TC-004 Integration

LLM Connectivity Test

TC-005 Integration

Detection Loop Stability

TC-006 Performance

Frame Rate Benchmark

TC-007 Performance

Memory Usage Under Load

TC-008 Security

API Key Exposure Scan

TC-009 Security

Input Validation / Injection

TC-010 Accessibility

Colorblind Palette Verification

TC-011 Functional

Billboard Effect

TC-012 Functional

Anchor Lifetime Expiration

TC-013 Functional

Max Anchor Capacity

TC-014 Integration

End-to-End Detection Flow

TC-015 Performance

Network Failure Graceful Degradation

TC-016 Security

Invalid API Key Handling

TC-017 Functional

Anchor Click Callback

TC-018 Integration

AOISetupHelper Auto-Init

TC-019 Functional

Tap-Based Detection Trigger

TC-020 Functional

Tap Detection Accuracy

TC-021 Functional

Object Matching Logic

TC-022 Functional

Tap-to-Register

TC-023 Functional

Proximity Detection Range

TC-024 Integration

Game State Management

TC-025 Functional

Visual Highlighting System

TC-026 Integration

Format Conversion (MediaPipe Bridge)

TC-027 Functional

Item Registration Logic

TC-028 Functional

Duplicate Registration Guard

TC-029 Functional

UI Auto-Creation (UIHelper)

TC-030 Functional

UI State After Registration

TC-031 Integration

End-to-End Scavenger Hunt Flow

// 06 — CI / CD Pipeline

Automated gates
for a Unity project.

Unity projects break in ways that aren't obvious from a diff. A missing .meta file, a scene serialized with a stale GUID, or a missing component reference compiles cleanly but crashes at runtime. The CI/CD pipeline was designed around these Unity-specific failure modes — with quality gates that catch issues before they reach main and cause the team to lose a working build. Every push and pull request runs the full pipeline automatically.

🔀

Trigger

Push or PR to
main / develop

→

🔍

Quality Gate

Lint · null scan
meta validation

my additions

→

🧪

Test

Unity Edit Mode
Play Mode tests

→

💨

Smoke Test

Scene init · subsystems
fail on exception

my additions

→

🏗️

Build

iOS · Android
GitHub artifact

→

🚀

Release

Artifact upload
version tag

// checks I added — Sprint 3

Pre-commit C# formatter — enforces whitespace and naming rules before a commit lands, so style issues never enter PR review
Null safety scan — counts null checks in every C# script; flags components that dereference without guarding (the root cause of BUG-001)
Error handling verification — scans for try-catch coverage in async and network-facing code paths
Missing .meta file detection — Unity silently breaks when .meta files are absent from commits; this gate catches them before merge
Script compilation validation — runs Unity's headless compiler to surface compile errors that only appear on specific platform targets
Smoke test gate — boots the Unity scene in Edit Mode via CLI; if any exception is logged during initialization, the build fails immediately

// full pipeline checks (Riya + team)

Unity Edit Mode and Play Mode test execution on every push
Multi-platform build matrix — iOS (ARKit), Android (ARCore)
Component initialization checks — Start/Awake present in MonoBehaviours
Input validation patterns — IsNullOrEmpty / IsNullOrWhiteSpace coverage
API key security scan — regex-based detection, excludes valid placeholder strings
Code quality checks — TODO/FIXME detection, script file counting
Artifact upload — build outputs stored per run for download and device deployment
Cross-platform compatibility — Ubuntu, macOS, and Windows runners tested

Security & Threat Model

The project included a formal threat model identifying four attack surfaces and their mitigations — integrated into the CI pipeline rather than left as documentation only.

Asset / Surface	Risk	Mitigation
AR Camera Data	Data leakage — raw camera frames captured or intercepted	TLS encryption on all API calls; frames never stored or transmitted beyond the LLM prompt; no cloud vision API used for detection
LLM API Key	Key exposure in source code or logs	Environment variables only — never hardcoded; CI security scan with regex detection on every push; excluded placeholder strings from false-positive triggering
LLM Prompt Input	Prompt injection via crafted object labels or user input	Input validation on all user-generated strings before they enter the prompt template; `IsNullOrEmpty` guards on detection labels; CI scan for validation patterns
Cloud Storage / Backend	Unauthorized access to backend APIs or stored data	Role-based access control; Firebase authentication for backend endpoints; TLS/SSL on all client-server communication

// 07 — Team

Started as 3.
Grew to 5 for the demo.

The project ran through CPSC 490/491 with Alyssa Barrientos and Riya Jain as the primary two contributors after a third original team member withdrew following Sprint 1. Three additional contributors joined for Sprint 4 to build the scavenger hunt prototype for the live class demo.

Alyssa Barrientos

Sprints 1–4 · me

AR rendering & URP fixes
Scene management & merges
Anchor debug tooling
UI / scene flow (Sprint 4)
Compilation unblocking
Code reviews · QA · coordination

Riya Jain

Sprints 1–4

MediaPipe integration
AOI Integration Manager & LLM backend
GitHub Actions CI/CD pipeline
Formal test cases (TC-001–005, TC-019–031)
Bug tracking & operations docs

Sprint 4 Team

Sprint 4 only · 3 contributors

David — LLM item list gen, timer, screens
Mohamed — session lifecycle & interaction flow
Marco — offline generator, AR marker spawner, UI manager

// 08 — Future Work

Where this goes
from here.

The prototype proves the concept — detection, LLM query, and spatial anchor rendering all running in a real AR session. These are the natural next engineering steps to close the gap between research prototype and production system.

// 01

On-Device LLM

Replace the REST call with a quantized model (Phi-3, Gemma 2B, or LLaMA 3 8B) running via llama.cpp or ExecuTorch on the device. Eliminates network latency, API key exposure, and the offline failure mode entirely. Feasible on iPhone 15 Pro and modern Android flagships.

// 02

Vision-Language Input

Upgrade from label-based prompting (MediaPipe outputs "chair" → send "chair" to LLM) to sending a cropped image patch to a vision-language model (LLaVA, PaLI-Gemma). Richer semantic context, better responses for ambiguous or unusual objects the detector labels poorly.

// 03

Persistent Spatial Memory

Use ARKit Scene Reconstruction or ARCore's Geospatial API to persist anchor positions between sessions — so the same physical object in the same room shows its cached LLM response on re-entry without re-querying the backend.

// 04

Multi-User Shared AR

Synchronize anchor state across devices via ARCore Cloud Anchors or Niantic Lightship. Multiple users in the same physical space could collaboratively tag and annotate objects — turning the tool into a shared, persistent knowledge layer for classrooms or field teams.

// 05

Full Accessibility Layer

Text-to-speech output of anchor labels for users with visual impairments, haptic feedback on detection events, high-contrast and large-text anchor modes. The colorblind-safe palette from Sprint 3 is the first layer of this work.

// 06

Known Issues to Fix

Unsubscribe AR events in OnDisable as well as OnDestroy in AOIIntegrationManager to close the memory leak. Replace the Arial.ttf runtime font reference with a bundled asset. Add structured fallback caching when the LLM endpoint is unreachable.

Augmented ObjectIntelligence XR

Making objectsintelligent.