Lessons from AVs on safety in end-to-end pipelines

Recent developments in autonomous vehicles on recognizing and handling distribution shift

Mar 20, 2026

This short post covers a couple of recent updates from the autonomous vehicle (AV) industry with connections to broader and more general safety in robotics.

Recognizing performance deterioration

This Verge article from March 19 reports that there could be an impending recall of Tesla’s Full-Self Driving (FSD) service. I’m not interested in making any judgments about self-driving capability, but rather whether the root cause has anything we can learn from in broader robotics.

Source: The Verge article linked above. Emphasis mine.

The issue appears to be that the system didn’t know when it wasn’t working well (causing the issues in the NHTSA filing), or that it did and didn’t notify the driver (which is unlikely, so we’ll assume the former).

Tesla Full Self-Driving Beta 10.69 barrier — Tesla FSD (source)

This phenomenon isn’t isolated to AVs. The latest article in my Vision-Language-Action (VLA) robotics pipeline series went hands-on into debugging one, and while we found some techniques that can aid developers, they didn’t directly help at inference time. Item 1 in Rui Xu’s candid post-mortem of K-Scale Labs mentions the pitfalls of trusting a “large model” vs. dedicated safety features. Recent papers on VLAs mention the fragility when moving away from the training distribution (e.g. Fang et al Jun 2025, Hu et al Jan 2026).

Potential solutions: redundancy, confidence, architecture

NVIDIA recently announced their new Alpamayo model and accompanying AV stack as a reference open model and toolchain. During the CES 2026 keynote, Jensen Huang said something intriguing about safety:

This parallel or hybrid architecture with a classical stack and a policy arbitrator were also covered in this CounterPoint research article. Interestingly, I can’t find references from NVIDIA themselves about this parallel system other than Jensen’s keynote — it’s possible it is just early in development.

A related approach is to have the VLA output some kind of confidence (vs. a separate “policy arbitrator”). Zollo et al (Dec 2025) formalizes the problem of confidence calibration for VLA policies, describes how to extract confidence estimates from contemporary VLA architectures, and notes that current VLAs lack a reliable mechanism for quantifying the uncertainty of their chosen action sequences. It also introduces two potential remedies: prompt ensembles and action-wise Platt scaling.

Lastly, inserting some debuggable interfaces into end-to-end pipelines can facilitate inspection and safety — lower-level controllers can apply dedicated safety constraints based on the information passed down from a higher-level controller. This appears to still be possible in most successful humanoid robotics demonstrations of today due to a combination of factors. Keeping that architectural feature around may have long-standing benefits, based on current events in the AV industry!

Thanks for reading! I have been working on the next part of the end-to-end pipeline series, with a deep dive into the action head and closed-loop behavior. If you liked this post, please share and subscribe.

Roy Xing

Mar 20

This reminds me of some ponderings I've had about where safety checks should reside in an hierarchical architecture. It seems that in biological systems the safety checks are higher up in the chain. For example, you are able to pull ligaments and break bones on accident and need to be consciously mindful in sports not to injure yourself.

Now of course, for robots, I still believe that we should make them as safe as possible and we have the ability to inject safety checks in all layers of the stack. But it makes me wonder why in biological systems that the safety checks are so high up, is it because for nature's optimization: evolution, that this was just enough to survive?

1 reply by Avik De

1 more comment...

min{power}

Discussion about this post

Ready for more?