Autonomous Driving - AI Infrastructure
Training infrastructure is the industrial backbone of the autonomous driving race. The ability to train larger models, on more data, in less time, is now a direct competitive variable — as important as the vehicle hardware itself. This page covers the purpose-built AI compute clusters operated by leading AV developers, the silicon powering them, and how training scale translates into deployed model capability.
Datacenter build-out, power infrastructure, cooling architecture, and hyperscale facility design are covered in depth on DatacentersX. See: Autonomous Vehicle AI Datacenters — DatacentersX.
Why Training Scale Is a Competitive Variable
In the era of rule-based ADAS, software capability was largely a function of engineering hours. In the era of end-to-end neural networks, capability is a function of three compounding inputs: training data volume, model architecture, and compute scale. Of these, compute scale is the most capital-intensive and hardest to replicate quickly.
- Iteration speed: Larger clusters run training jobs faster — a model that takes 4 weeks on a small cluster may take 3 days on a large one; faster iteration means more experiments per quarter
- Model size: Larger models require more compute to train; Tesla FSD v14.3's neural network is reportedly 10× larger than prior versions — this is only feasible with a proportionally larger training cluster
- Simulation scale: Waymo and Tesla run millions of simulated miles per day; simulation requires dedicated GPU/TPU capacity separate from model training
- Auto-labeling throughput: Auto-labeling pipelines (which process raw fleet video into training data) require sustained high-throughput compute; bottlenecks here constrain the entire OTA loop
- Reinforcement learning: RL-based training (introduced in Tesla FSD v14.x, XPeng VLA 2.0) is significantly more compute-intensive than supervised learning — it requires continuous rollouts and reward computation at scale
Tesla — Cortex and Dojo
- Cluster name: Cortex (primary training cluster); Dojo (custom training supercomputer)
- Location: Giga Texas, Austin; additional capacity at Palo Alto
- Silicon — Cortex: Nvidia H100 (80 GB HBM3); approximately 50,000 H100-equivalent GPUs as of 2024; continuing expansion
- Silicon — Dojo: Tesla D1 custom training chip (362 TFLOPS BF16); ExaPOD architecture — 120 D1 chips per training tile; custom high-bandwidth die-to-die interconnect (9 TB/s per tile)
- AI5 (next-gen): Tesla's AI5 inference chip announced 2025–2026; designed for both vehicle inference and datacenter training; described by Musk as co-designed with FSD software stack for maximum efficiency
- TERAFAB: Tesla/SpaceX/xAI joint venture announced March 2026 for vertically integrated chip fabrication at Giga Texas; long-term play to reduce dependence on TSMC and Nvidia supply chains
- Workloads: FSD end-to-end neural network training; auto-labeling pipeline; simulation; Optimus robot model training
- Training data input: ~6 million global fleet vehicles; shadow mode extends data collection beyond enrolled FSD users
- Update cadence enabled: FSD major releases every 2–6 weeks; minor point releases more frequently
Waymo — TPU-Based Training Infrastructure
- Silicon: Google TPU v4 and v5 pods via Google Cloud; Waymo is an Alphabet company with direct access to Google's TPU infrastructure
- Architecture advantage: TPU pods are purpose-built for the large matrix multiplications that dominate neural network training; Google's TPU v5p delivers ~459 TFLOPS per chip with high-bandwidth interconnect between chips in a pod
- Simulation City: Waymo's primary workload is simulation — reconstructing real-world scenarios and running them as synthetic variants; simulation generates the majority of Waymo's effective training miles
- Real-world fleet: Relatively small physical fleet (~1,500 vehicles across all cities as of 2025); simulation compensates for limited physical scale
- Data strategy: Quality over quantity — each Waymo vehicle is sensor-rich (LiDAR + cameras + radar) and generates high-fidelity 3D training data; Waymo argues this is more valuable per mile than camera-only fleet data
- Key constraint: Dependent on Google Cloud infrastructure; less vertically integrated than Tesla; TPU access shared with other Alphabet workloads
XPeng — Turing Cluster
- Cluster name: Turing Cluster (named after in-house Turing AI chip)
- Silicon: Nvidia H800 GPUs (China export-compliant variant of H100); approximately 3,000 H800 units as of 2024; continuing expansion
- In-house chip: XPeng Turing AI chip — in-house designed; used for vehicle inference (XNGP system); roadmap includes Turing chip integration into training cluster to reduce Nvidia dependence
- Iteration speed: XNGP model updates every 1–2 days; enabled by efficient training pipeline and dedicated cluster capacity
- Training data: 1 billion+ km of video data; 429,000 vehicles delivered in 2025 (125% YoY growth) expanding the data flywheel rapidly
- VW leverage: VLA 2.0 open-sourced to Volkswagen; Turing chip supplied to VW models from 2026; this extends XPeng's training data input beyond its own fleet
- Export control exposure: H800 GPUs are a China-specific Nvidia product designed to comply with US export controls; further tightening of export restrictions could constrain XPeng's cluster expansion
Huawei — Ascend Cluster
- Silicon: Huawei Ascend 910B and 910C AI chips; domestic alternative to Nvidia developed in response to US export controls; Ascend 910C reported at ~256 TFLOPS (FP16)
- Scale: Exact cluster size undisclosed; Huawei Cloud provides AI compute services broadly; HIMA alliance training runs on dedicated Ascend infrastructure
- Strategic position: Huawei is the only major AV player with full-stack vertical integration from chip (Ascend) to inference chip (HiSilicon/Kirin) to training cluster to ADS software to cockpit OS; this insulates it from US export control disruption more than any other Chinese player
- Fleet data: 7.28 billion km accumulated mileage from HIMA alliance vehicles as of January 2026; 28 vehicle models across 5 OEM partners; 80+ models targeted by end 2026
- Ascend vs. Nvidia: Ascend 910C performance is broadly competitive with Nvidia A100 for training workloads; gap vs. H100 remains; software ecosystem (CANN vs. CUDA) is less mature but improving rapidly under domestic pressure
- Key constraint: CANN (Compute Architecture for Neural Networks) software stack less mature than Nvidia CUDA; smaller third-party model ecosystem; but a captive workload (ADS training) reduces need for broad ecosystem compatibility
NIO — Adam Supercomputer
- Cluster name: Adam (announced 2021; expanded since)
- Silicon: Nvidia A100 GPUs; approximately 1,000 A100 units at initial announcement
- Paired sensor platform: NIO Aquila — 33 high-performance sensors including LiDAR, cameras, radar, ultrasonic; generates high-fidelity training data per vehicle
- Business model: NIO NAD (NIO Autonomous Driving) sold as a subscription service (NIO Aquila hardware + Adam compute as a service); differentiates NIO's AV stack as a recurring revenue product
- Key constraint: Smaller fleet than XPeng or Tesla; Adam cluster scale modest relative to leading players; dependent on Nvidia supply chain
Rivian — RAP1 and Training Infrastructure
- In-house chip: RAP1 (Rivian Autonomy Processor 1); dual-chip configuration in Gen 3 platform; 1,600 TOPS total inference compute; designed for both vehicle inference and as the basis for training acceleration
- Training cluster: Details not publicly disclosed; VW joint venture (announced 2024, $5.8B) provides additional software infrastructure investment
- Data flywheel strategy: LiDAR point clouds from R1 and R2 consumer fleet feed training pipeline with 3D ground truth data — higher quality per mile than camera-only fleets; Uber partnership (50,000 robotaxi vehicles) will significantly expand fleet data volume post-2028
- Large Driving Model: Trained using Group-Relative Policy Optimization (GRPO) — same RL technique used in DeepSeek R1 and adapted for driving policy; compute-intensive relative to pure supervised learning
- Key constraint: Consumer fleet currently small relative to Tesla/XPeng; training cluster scale not publicly disclosed; Georgia robotaxi factory still under construction
Training Cluster Comparison Table
| Cluster | Operator | Primary Silicon | Reported Scale | In-House Chip | Fleet Data Input | Export Control Risk |
|---|---|---|---|---|---|---|
| Cortex / Dojo | Tesla | Nvidia H100 + Tesla D1 | ~50,000 H100-equiv. | Yes — D1, AI5 (pending) | ~6M vehicles global | Low — US-based OEM |
| TPU Infrastructure | Waymo | Google TPU v4/v5 | Undisclosed (Google Cloud) | No — uses Google TPU | ~1,500 vehicles + sim | Low — US-based |
| Turing Cluster | XPeng | Nvidia H800 | ~3,000 H800 GPUs | Yes — Turing AI Chip | ~429K vehicles (2025) | Medium — H800 supply at risk |
| Ascend Cluster | Huawei | Ascend 910B/910C | Undisclosed | Yes — fully domestic | 7.28B km (HIMA fleet) | None — fully domestic |
| Adam | NIO | Nvidia A100 | ~1,000 A100 GPUs | No | NIO fleet (China) | Medium — A100 restricted |
| Li Auto Cluster | Li Auto | Nvidia H800 | ~5,000 H800 GPUs | No | Li Auto fleet (China) | Medium — H800 supply at risk |
| RAP1 Platform | Rivian | Undisclosed (training) | Not publicly disclosed | Yes — RAP1 (inference) | R1/R2 consumer fleet | Low — US-based OEM |
Nvidia — AV Training and Simulation Platform
Nvidia occupies a unique position in the autonomous vehicle landscape: unlike every other player on this page, it does not build vehicles or develop its own AV software stack. Instead, it supplies the compute infrastructure, simulation environment, and increasingly the pre-trained foundation models that other OEMs build on. Its automotive business is architecture-agnostic — it supplies across the HD map camp, the vision-neural camp, and the sensor fusion camp simultaneously.
Nvidia frames its AV offering as a three-computer model spanning the full development pipeline: DGX for AI model training, Omniverse + Cosmos for simulation and synthetic data generation, and DRIVE AGX for in-vehicle inference compute. An OEM can theoretically use Nvidia for every layer of the stack except the vehicle itself. Nvidia's automotive vertical is projected to reach approximately $5 billion revenue in fiscal year 2026.
DGX — Training Compute (Cloud and On-Premise)
- What it is: Purpose-built AI supercomputing platform for training AV perception, mapping, prediction, and decision models; available as on-premise DGX systems or via DGX Cloud as a managed service
- DGX Cloud: Hosted on Microsoft Azure, Google Cloud, Oracle Cloud, and AWS; delivers H100 GPU clusters optimized for multi-node training without OEM capital expenditure on hardware; priced at approximately $36,999 per instance per month
- Physical AI Data Factory Blueprint: Software toolkit for curating, augmenting, and evaluating the large datasets required for AV model training; sits between raw fleet upload and training run
- Confirmed OEM users: Volvo Cars / Zenseact (DGX Cloud for model training); NIO (DGX platform for perception model training efficiency); Waabi (DGX for complex simulation training runs); Uber (DGX Cloud for AV AI acceleration, early 2025)
- Software stack: CUDA-X AI, GPU-optimized kernels via NGC containers, TensorRT and Triton for inference optimization; Run:ai (acquired by Nvidia, ~$700M, 2024) for GPU workload orchestration and resource management across hybrid infrastructure
Cosmos — World Foundation Model and Synthetic Data Generation
- What it is: Generative world foundation model platform that reconstructs real-world sensor data into interactive simulation and generates physically accurate synthetic sensor data for AV training and validation
- Core capability: An OEM with a small physical fleet can use Cosmos to synthetically generate rare edge-case scenarios — construction zones, unusual pedestrian behavior, adverse weather, sensor failure modes — that would require millions of real-world miles to encounter at statistical frequency; directly addresses the long-tail problem in AV training
- Cosmos Curator on DGX Cloud: GPU-accelerated service that ingests raw fleet video, segments it into meaningful clips, generates embeddings, and auto-labels with text prompts; sits at Stage 3 of the OTA loop (ingestion and auto-labeling)
- Omniverse NuRec: Reconstructs real-world driving data into interactive 3D simulation environments; enables closed-loop testing where the simulated world responds dynamically to the vehicle's actions rather than replaying a fixed scenario
- OVX systems: Dedicated compute hardware for running Omniverse and Cosmos simulation workloads; separate from DGX training systems; optimized for real-time 3D rendering and physics simulation at scale
- Confirmed users: Waabi uses Cosmos for data curation and simulation; Toyota, Aurora, and Continental announced as Cosmos simulation partners (September 2025)
Alpamayo — Open AV Foundation Models
- What it is: Open portfolio of pre-trained AI models, simulation frameworks, and physical AI datasets for autonomous vehicle development; released by Nvidia for OEM fine-tuning on proprietary fleet data
- Includes: Vision-Language-Action (VLA) models with contextual reasoning and decision-making; designed to be fine-tuned rather than trained from scratch, significantly lowering the compute and data barrier for smaller AV developers
- Strategic implication: Nvidia moves from pure infrastructure supplier to model provider; OEMs that adopt Alpamayo as a foundation layer become more deeply embedded in the Nvidia ecosystem
DRIVE AGX — In-Vehicle Inference Compute
- DRIVE AGX Orin: Current production SoC; 254 TOPS; safety-certified; in production vehicles including Volvo EX90, JLR (from 2026), and others
- DRIVE AGX Thor: Next-generation SoC; 2,000 TOPS; in production ramp 2025–2026; designed for L4 autonomous driving workloads
- DriveOS: Safety-certified operating system running on DRIVE AGX; provides real-time processing, security, and functional safety compliance (ISO 26262); runs CUDA, TensorRT, NvMedia, and NvStreams
- DRIVE Hyperion: Full reference architecture combining DRIVE AGX compute with a validated multimodal sensor suite (cameras, radar, ultrasonic); production-ready platform from L2++ to L4; used by OEMs as a validated starting point rather than designing sensor integration from scratch
- Confirmed vehicle platforms: Volvo EX90 (Orin, migrating to Thor); JLR fleet from 2026; Rivian (DRIVE AGX partnership confirmed)
Halos — Cross-Stack Safety System
- What it is: Full-stack safety framework that spans training, simulation, and in-vehicle deployment; unifies hardware, software, tools, and AI models under a common safety architecture
- Purpose: Addresses the certification gap in neural-network-based AV systems — provides a structured safety layer from DGX training through Cosmos simulation through DRIVE AGX deployment; designed to support ISO 26262 and SOTIF (ISO 21448) compliance paths
- Strategic role: Nvidia's answer to the hardest unsolved problem in AV commercialization — proving that an end-to-end neural system is safe enough for regulatory approval
Nvidia AV Platform Summary
| Platform | Function | OTA Loop Stage | Deployment Model | Key OEM Users |
|---|---|---|---|---|
| DGX (on-premise) | Large-scale AV model training | Stage 4 — Training Cluster | On-premise hardware purchase | NIO, Li Auto, Aurora, Waabi |
| DGX Cloud | Managed cloud training compute | Stage 4 — Training Cluster | SaaS via Azure / GCP / AWS / OCI | Volvo / Zenseact, Uber, NIO |
| Cosmos | Synthetic data generation; world model simulation | Stage 3 — Ingestion / Labeling | Cloud service; OVX on-premise | Waabi, Toyota, Aurora, Continental |
| Cosmos Curator | Automated video segmentation and auto-labeling | Stage 3 — Ingestion / Labeling | DGX Cloud managed service | Waabi; broad OEM availability |
| Omniverse / NuRec | Closed-loop simulation; real-world data reconstruction | Stage 5 — Validation | Cloud and on-premise (OVX) | Toyota, Aurora, Continental, Waabi |
| Alpamayo | Open pre-trained VLA foundation models for AV | Stage 4 — Training (fine-tuning) | Open model; fine-tune on proprietary data | Open release; broad availability |
| DRIVE AGX Orin | In-vehicle inference compute (current gen) | Stage 1 — Edge Capture; Stage 6 — OTA Push | SoC in production vehicles | Volvo EX90, JLR, Rivian |
| DRIVE AGX Thor | In-vehicle inference compute (next gen, 2,000 TOPS) | Stage 1 — Edge Capture; Stage 6 — OTA Push | SoC in production vehicles (2025–2026 ramp) | Volvo (migration), broad OEM pipeline |
| Halos | Cross-stack AV safety framework | All stages | Integrated across DGX / Cosmos / DRIVE AGX | All DRIVE platform adopters |
Cross-Site Links
- OTA Loop mechanics: The OTA Loop: Autonomous Vehicle Learning Cycle — ElectronsX
- Autonomous driving architectures: Autonomous Driving: Technology Approaches — ElectronsX