PUBLIC BENCHMARK EVIDENCE

Deterministic safety evidence, replayable.

FieldSpace now has a public-data evidence trail across openpilot / comma.ai replay, Waymo observer-mode scenarios, official nuPlan closed-loop simulation, and the first shared nuPlan neural-baseline smoke scenario.

comma.ai replayWaymo observer modenuPlan closed loopUrbanDriver + PlanCNN smoke
Why this is useful now. FieldSpace can be evaluated as a deterministic safety layer without first standing up a fleet-scale training pipeline, labeling workflow, retraining loop, or GPU-heavy model program.
WHITE PAPER · PUBLIC BENCHMARK PACKAGE

Four public evidence paths, one deterministic layer

The latest package reframes FieldSpace as an independent safety observer around neural ADAS stacks. The point is practical: replay public or partner-selected scenes, compare behavior against baselines, and inspect the audit trail behind every safety judgment.

openpilot / comma.ai
60,019 frames

White-paper replay slice with sub-millisecond CPU latency.

Waymo observer mode
50 scenarios

4,550 frames and 15 exported trigger windows.

nuPlan classical
64 scenarios

0 runner failures against SimplePlanner and IDMPlanner.

nuPlan neural · 64 scenarios
FieldSpace ≥ neural

Full 64-scenario closed-loop vs UrbanDriver + PlanCNN. FieldSpace 0.977 collision, 8.4 ms/step CPU.

NUPLAN CLOSED-LOOP · 64 SCENARIOS · CPU ONLY

FieldSpace vs UrbanDriver vs PlanCNN

Full 64-scenario official nuPlan closed-loop suite. Identical scenario set across all five planners (_comparison.same_scenarios = true). Engine frozen, no nuPlan-specific tuning. UrbanDriver and PlanCNN run from the public tuPlan-Garage checkpoints (SHA256-allowlisted). Numbers below are metric_score means across the 64 scenarios, 1.0 = best.

Metric (higher = safer / better)FieldSpaceUrbanDriverPlanCNN
No at-fault collision0.9770.3750.938
TTC within bound0.9220.2970.922
Speed-limit compliance1.0000.6951.000
Comfort (lat-accel + jerk thresholds)0.9381.0000.844
Progress along route0.9660.9800.981
CPU compute / planning step8.4 ms158.6 ms150.9 ms
64/64 scenarios succeeded · worker=sequential · SHA256 manifest at reproducibility/nuplan_official_neural_64.sha256
HEADLINE

FieldSpace matches or beats both neural planners on the safety metrics a homologation team cares about (collisions, TTC, speed-limit compliance), at roughly 18× less compute on CPU.

HONEST CAVEAT

UrbanDriver's public checkpoint is trained open-loop and is a known-weak closed-loop baseline — read the result as FieldSpace vs PlanCNN, where FieldSpace wins on safety and comfort at 18× less compute. Full note: reproducibility/nuplan_official_neural_64_results.md.

Want these numbers on your own data?

The eval-kit runs the same scoring pipeline against your scenarios, air-gapped, in your environment. Click-through EULA, no NDA at the door.

Get the Eval-Kit →
RUNTIME · NFS-TRAFFIC + NFS-DRIVE

Control loop, per tick

End-to-end drive loop: HD map localize → predict → PDE field solve → control. Measured over 10 000 consecutive ticks on each platform.

PlatformMeanp95p99.9Budget
Raspberry Pi 5 (8 GB)3.1 ms4.8 ms7.2 ms50 ms
Jetson Orin Nano (CPU mode)1.9 ms2.6 ms4.1 ms50 ms
x86 laptop (i7-12700H)0.8 ms1.2 ms2.3 ms50 ms
Grid 256 × 64 · 20 Hz control loop · no GPU · numbers from cargo bench -p nfs-drive
DETERMINISM

Bit-identical over 10 000 runs

Compare the byte-for-byte output of a scenario replay against a stored golden trace. Any diff is a regression.

10 000
consecutive replays
0
hazard-output diffs
0
PDE-state diffs
0
MRM-phase diffs
PROTOCOL REGRESSION SUITES · NHTSA 37 + EURO NCAP

In-ODD protocol checks passed

Internal implementations of NHTSA 37 pre-crash families and Euro NCAP AEB/FCW-style protocols, implemented scenario-by-scenario in harness/scenarios/. Pass criterion: hazard trigger TTC meets or beats the configured protocol threshold.

Euro NCAP AEB/FCW

79 / 79 pass
  • · Car-to-Car Rear (stationary / moving / braking)
  • · Car-to-Car Front (head-on / turn-across / cut-in)
  • · Pedestrian (near-side / far-side / longitudinal)
  • · Cyclist (near-side / far-side / longitudinal)
15 scenario codes · 0 skipped · reproducibility/ncap_battery.json

NHTSA 37 pre-crash

121 / 121 pass
  • · Rear-end (decelerating / stopped / lower constant)
  • · VRU (pedestrian / cyclist / animal crashes)
  • · Cut-in / opposite-direction encroachment
  • · 21 scenarios skipped for out-of-ODD (HD map, V2X, reverse, ECU failure)
16 / 37 scenario families in ODD · reproducibility/nhtsa_battery.json
What "skip" means. FieldSpace is a longitudinal-TTC safety observer. Scenarios that require HD-map traffic-light perception, V2X, reverse gearing, or ECU-failure simulation are outside our ODD and marked skipped, not failed. We list them explicitly so the ODD boundary is visible.
REAL-WORLD · COMMA OPENPILOT REPLAY

182 505 frames of public drive data

We run the FieldSpace Safety Observer frame-for-frame against comma.ai's openpilot CI route bucket — real cars, real roads, real radar and vision, with openpilot's own carControl log as the counterfactual. Lead times are measured, not asserted.

MetricTier-3 hazardous bucket
Routes replayed2 routes / 31 segments
Frames processed182 505
Wall-clock replay time184 s (≈ 993× faster than real-time)
FieldSpace hazard events1 warning · 0 false critical
FP reduction vs. prior observer−85% (7 events → 1)
Routes: a74b011b32b51b56|2020-07-26 (Honda Civic, 155 781 frames) · 3cfdec54aa035f3f|2022-07-19 (Toyota, 26 724 frames) · report in reproducibility/tier3_replay_hazardous_provided_vy.json
How to read this. Recall vs. openpilot is not the headline — openpilot's "hazardous" classification is driven by comfort-braking thresholds that confound driver overrides with genuine collision avoidance. The defensible metric is event count under real inputs: FieldSpace produced one 34-frame warning event on the Toyota route (no criticals) and zero events on the Honda route, across 182 k frames of real driving. No spurious bursts.
SAFETY SUITE v1 · CARLA 0.9.14

5 synthetic scenarios, 5 earlier detections

Original closed-loop CARLA suite. Kept here because it isolates the PDE-propagation lead-time contribution from confounds that real-world data introduces.

ScenarioBaseline TTCFieldSpace TTCLead-time
Cut-in, 40 km/h1.2 s2.1 s+0.9 s
Sudden brake, lead car1.6 s2.4 s+0.8 s
Pedestrian step-out0.9 s1.7 s+0.8 s
Oncoming lane drift1.1 s1.9 s+0.8 s
Occluded cyclist0.7 s1.3 s+0.6 s
CARLA 0.9.14 · town10HD · ego speed 40-50 km/h · scripts in harness/carla/
Why this still matters. Closed-loop simulation against a reactive baseline isolates the safety-observer behavior: map-aware prediction plus PDE propagation surfaces hazards earlier than a frozen-frame detector on the same input. The public log and nuPlan runs above extend that evidence into real and benchmarked driving data.
DATASET ADAPTERS

Bring your own log

Replay adapters and benchmark harnesses are checked in for the public datasets below. They exist so customers can point us at fleet data or a public benchmark and replay the same observer.

comma openpilot
harness/comma/
Full replay (rlog + carControl)
Argoverse 2
harness/argoverse/
Track replay
nuScenes
harness/nuscenes/
Track replay
Waymo Open
harness/waymo/
Observer-mode replay
nuPlan
harness/nuplan/
Closed-loop simulation

Bring your drive logs. We'll ship the evidence.

Send us an MCAP or rlog. We run the Safety Observer and send back a side-by-side replay report your technical team can review.