Excellent post. For the open-source component, will you only be comparing what you call end-to-end methods? Or will you also have some classical/hybrid baselines as well?
This is a great post, I especially liked the clear hardware mapping for each example.
This may be an oversimplification, but from what I understand, there seems to be a clear split between two layers - planning and control.
Planning seems to be "heavy compute, low frequency" - like running VLMs or a World model on a GPU. Control seems to be more lightweight, but needs higher frequency, hence mostly running on CPUs.
So is it realistic to expect a hybrid architecture with some, or all of the planning to move to "the cloud." I know that some applications cannot tolerate the unpredictability of cloud latency, but there could be many applications where it could be fine - like robots in a controlled factory environment. If this is realistic, then the hybrid architecture might be better economically (by sharing compute for a group of robots) and might also make the robots lighter + power efficient.
Do you think that's a viable direction for robotics?
That's a good observation, and in fact, the 5G push from a few years back was exactly about things like this - networked cars and robots with their brains in the cloud, leveraging 5G networking features for managed latency. This push has somewhat died down, but I think the potential is still there.
For mobile robots, this kind of heterogeneous architecture makes sense even today with the diving line drawn at higher-level decision-making. For example, industrial inspection robots could have their AI anomaly detection algorithms running on a networked local or cloud computer. This way the robots don't need to run power hungry GPUs and algorithms sapping their battery life. However, any algorithms that *actuate* the robot (like controls, or basic autonomy like collision avoidance) should probably be local for safety reasons.
Really enjoyed this breakdown. Reading this through a compute architecture lens, my takeaway is that while there’s a strong push to bring learning closer to the control boundary, every successful system still seems to retain a structured low-level control layer running at kHz rates, backed by deterministic compute (CPUs, controllers) rather than fully relying on GPUs.
The implementation choices you described clearly address the determinism and real-time constraints, but they also seem to limit how much learning can realistically live in that layer. If feels like there’s a growing tension here: pressure to move learning inward, but control-plane compute isn’t designed to support it.
I’m wondering how you think about that gap – is the expectation that control-plane learning stays minimal and highly constrained, or do you see room for a different class of AI compute emerging closer to the control boundary?
Some of the computational hardware that's been developed so far for neural network operations are obviously good accelerators of matrix multiplication, which features in some way in many of these "control layer" calculations as well. However, there are a few issues with including these control layer calculations to hardware designed for neural networks:
1) Many of these control layer calculations (at least naively implemented) are not as regular as a neural network, and may have branching, irregular data sizes, etc. This doesn't obviously map as well to systolic arrays or SIMT, though perhaps better to VLIW.
2) Since the calculations are relatively small, moving data across a bus to SOC-level or last-level cache, or worse, via DRAM, and then back to the CPU to eventually send to actuators, seems like it will be the bottleneck. Additionally, since small calculations run well on a CPU already (especially taking advantage of SVE / SME), a large effort to map them to other hardware may not be justifiable.
Neural-network accelerator hardware has become increasing specialized (NPUs, TPUs), and it will take some algorithmic upheaval before control layer calculations can map to that. With GPUs, I am not 100% sure, but the point 2 definitely holds. There's also some possibility of specialized VLIW coprocessors or CPUs with SME instructions could be a good middle ground.
Either way, again, this is a great question, and I'll hope to revisit it in future posts--possibly sooner than later.
Thanks again for another great article! On the topic of generalization I’m curious on your take of the policies seen in Skild.AI and their blogs. Will you be covering their hierarchical approach too in the future?
I’m curious as they seem to be tackling the generalization bit, though with training multiple embodiments in one go. I’m hesitant on if multi-embodiment pre-training really does help single embodiment training unlike how TRI’s report of multi-task pre-training improves upon single-task performance.
Thanks for the comment! Yes, Skild's work is also very impressive, but I couldn't find as much detail in their blog posts. Based on what I see here https://www.skild.ai/blogs/one-policy-all-scenarios, the architecture sounds like the Figure Helix Mar 2025 blog post.
About the multi-embodiment pretraining, to my knowledge it seems like there's some agreement but it isn't fully understood (e.g. see Scott Kuindersma's statement here https://spectrum.ieee.org/boston-dynamics-atlas-scott-kuindersma )--to me it seems a bit like the "scaling law" for LLMs. This may not be a relevant comparison, but from a biological standpoint, I'm not sure if a human brain (e.g.) needs to know how to actuate a cat body in order to have its general locomotion capabilities.
I definitely missed Scott Kuindersma's comment on that from that article. I wonder how close the embodiments need to be in terms of morphology for the pretraining to help. I can see how pretraining a whole humanoid can help with upper body manipulation, but what about going from dual manipulators to a humanoid? I'm speculating at this point, but it could be similar how it's hard for us to imagine how to use a tail or extremities that we don't have.
Regardless, I'm looking forward to your next posts!
I've been searching for some of these answers for several months now. I had hoped to build an intuition for real-world speed of these models assuming some fixed hardware constraints but given the ambiguity of where "end-to-end" begins and ends, it's been challenging. Based on your article, I'm not hopeful that's soon to change but at least I can put that search to rest.
I'm curious about your intuition on the long-term evolution of the control layer. Do you see a path toward standardized "motor APIs" where the control layer becomes an adapter that any reasoning layer can plug into via a well-defined interface?
If that abstraction emerges, what does the error interface look like? How does the control layer signal constraint violations back to the reasoning layer for replanning?
In terms of the interface, in model-based methods, it is common to have a reference trajectory or reference velocity to track passed from a higher-level controller. Those are still reasonable candidates, but I think there is some feeling that more information is needed in the interface and it isn’t clear what. It is possible to modularly develop the HL and LL controllers if that interface is fixed, but then there is necessarily some structure being imposed. I think this will need to continue getting researched.
Also very good point about the error signal - the HL controller does eventually get feedback for replanning from its own observations, but if it is running slower, it will not react very quickly. I had wanted to cover this point in the part 2 post (https://www.avikde.me/p/is-it-learning-online-motor-adaptation) but just covering adaptation got too long as is. There is some biology context for this (e.g. see Fig. 1.3 in https://escholarship.org/uc/item/279092tz#page=11), I’ll hope to cover in a future post.
This is a wonderfully insightfull piece, thank you for outlining the architectural shift so clearly. Regarding the 'actions' side discussed in Part 1, could you elaborate further on the inherent challenges and specific architectural patterns that emerge when these learned systems need to tightly integrate with precise physical control mechanisms, especially when considering robustness in varied environments?
Thanks! Other than some of the points in the "why not end-to-end" section of part 1, there are more issues to consider when adaptation is required to unexpected environmental or external factors, as you say. I plan to dig into that particularly in part 2!
Excellent post. For the open-source component, will you only be comparing what you call end-to-end methods? Or will you also have some classical/hybrid baselines as well?
I am not fully sure yet, but adding a model-based option may be one of the easier parts of this so probably!
This is a great post, I especially liked the clear hardware mapping for each example.
This may be an oversimplification, but from what I understand, there seems to be a clear split between two layers - planning and control.
Planning seems to be "heavy compute, low frequency" - like running VLMs or a World model on a GPU. Control seems to be more lightweight, but needs higher frequency, hence mostly running on CPUs.
So is it realistic to expect a hybrid architecture with some, or all of the planning to move to "the cloud." I know that some applications cannot tolerate the unpredictability of cloud latency, but there could be many applications where it could be fine - like robots in a controlled factory environment. If this is realistic, then the hybrid architecture might be better economically (by sharing compute for a group of robots) and might also make the robots lighter + power efficient.
Do you think that's a viable direction for robotics?
That's a good observation, and in fact, the 5G push from a few years back was exactly about things like this - networked cars and robots with their brains in the cloud, leveraging 5G networking features for managed latency. This push has somewhat died down, but I think the potential is still there.
For mobile robots, this kind of heterogeneous architecture makes sense even today with the diving line drawn at higher-level decision-making. For example, industrial inspection robots could have their AI anomaly detection algorithms running on a networked local or cloud computer. This way the robots don't need to run power hungry GPUs and algorithms sapping their battery life. However, any algorithms that *actuate* the robot (like controls, or basic autonomy like collision avoidance) should probably be local for safety reasons.
Really enjoyed this breakdown. Reading this through a compute architecture lens, my takeaway is that while there’s a strong push to bring learning closer to the control boundary, every successful system still seems to retain a structured low-level control layer running at kHz rates, backed by deterministic compute (CPUs, controllers) rather than fully relying on GPUs.
The implementation choices you described clearly address the determinism and real-time constraints, but they also seem to limit how much learning can realistically live in that layer. If feels like there’s a growing tension here: pressure to move learning inward, but control-plane compute isn’t designed to support it.
I’m wondering how you think about that gap – is the expectation that control-plane learning stays minimal and highly constrained, or do you see room for a different class of AI compute emerging closer to the control boundary?
That's a really good question.
Some of the computational hardware that's been developed so far for neural network operations are obviously good accelerators of matrix multiplication, which features in some way in many of these "control layer" calculations as well. However, there are a few issues with including these control layer calculations to hardware designed for neural networks:
1) Many of these control layer calculations (at least naively implemented) are not as regular as a neural network, and may have branching, irregular data sizes, etc. This doesn't obviously map as well to systolic arrays or SIMT, though perhaps better to VLIW.
2) Since the calculations are relatively small, moving data across a bus to SOC-level or last-level cache, or worse, via DRAM, and then back to the CPU to eventually send to actuators, seems like it will be the bottleneck. Additionally, since small calculations run well on a CPU already (especially taking advantage of SVE / SME), a large effort to map them to other hardware may not be justifiable.
Neural-network accelerator hardware has become increasing specialized (NPUs, TPUs), and it will take some algorithmic upheaval before control layer calculations can map to that. With GPUs, I am not 100% sure, but the point 2 definitely holds. There's also some possibility of specialized VLIW coprocessors or CPUs with SME instructions could be a good middle ground.
Either way, again, this is a great question, and I'll hope to revisit it in future posts--possibly sooner than later.
Thanks again for another great article! On the topic of generalization I’m curious on your take of the policies seen in Skild.AI and their blogs. Will you be covering their hierarchical approach too in the future?
I’m curious as they seem to be tackling the generalization bit, though with training multiple embodiments in one go. I’m hesitant on if multi-embodiment pre-training really does help single embodiment training unlike how TRI’s report of multi-task pre-training improves upon single-task performance.
Thanks for the comment! Yes, Skild's work is also very impressive, but I couldn't find as much detail in their blog posts. Based on what I see here https://www.skild.ai/blogs/one-policy-all-scenarios, the architecture sounds like the Figure Helix Mar 2025 blog post.
About the multi-embodiment pretraining, to my knowledge it seems like there's some agreement but it isn't fully understood (e.g. see Scott Kuindersma's statement here https://spectrum.ieee.org/boston-dynamics-atlas-scott-kuindersma )--to me it seems a bit like the "scaling law" for LLMs. This may not be a relevant comparison, but from a biological standpoint, I'm not sure if a human brain (e.g.) needs to know how to actuate a cat body in order to have its general locomotion capabilities.
I appreciate the reply!
I definitely missed Scott Kuindersma's comment on that from that article. I wonder how close the embodiments need to be in terms of morphology for the pretraining to help. I can see how pretraining a whole humanoid can help with upper body manipulation, but what about going from dual manipulators to a humanoid? I'm speculating at this point, but it could be similar how it's hard for us to imagine how to use a tail or extremities that we don't have.
Regardless, I'm looking forward to your next posts!
Excellent post Avik.
I've been searching for some of these answers for several months now. I had hoped to build an intuition for real-world speed of these models assuming some fixed hardware constraints but given the ambiguity of where "end-to-end" begins and ends, it's been challenging. Based on your article, I'm not hopeful that's soon to change but at least I can put that search to rest.
I'm curious about your intuition on the long-term evolution of the control layer. Do you see a path toward standardized "motor APIs" where the control layer becomes an adapter that any reasoning layer can plug into via a well-defined interface?
If that abstraction emerges, what does the error interface look like? How does the control layer signal constraint violations back to the reasoning layer for replanning?
Thanks, and good questions!
In terms of the interface, in model-based methods, it is common to have a reference trajectory or reference velocity to track passed from a higher-level controller. Those are still reasonable candidates, but I think there is some feeling that more information is needed in the interface and it isn’t clear what. It is possible to modularly develop the HL and LL controllers if that interface is fixed, but then there is necessarily some structure being imposed. I think this will need to continue getting researched.
Also very good point about the error signal - the HL controller does eventually get feedback for replanning from its own observations, but if it is running slower, it will not react very quickly. I had wanted to cover this point in the part 2 post (https://www.avikde.me/p/is-it-learning-online-motor-adaptation) but just covering adaptation got too long as is. There is some biology context for this (e.g. see Fig. 1.3 in https://escholarship.org/uc/item/279092tz#page=11), I’ll hope to cover in a future post.
This is a wonderfully insightfull piece, thank you for outlining the architectural shift so clearly. Regarding the 'actions' side discussed in Part 1, could you elaborate further on the inherent challenges and specific architectural patterns that emerge when these learned systems need to tightly integrate with precise physical control mechanisms, especially when considering robustness in varied environments?
Thanks! Other than some of the points in the "why not end-to-end" section of part 1, there are more issues to consider when adaptation is required to unexpected environmental or external factors, as you say. I plan to dig into that particularly in part 2!