Great overview — I really appreciated the thoughtful comparison between scaling in LLMs and recent trends in robotics / physical AI.
This resonates strongly with patterns we’ve seen repeatedly in semiconductors over the last 20+ years. “One size fits all” compute tends to break down once you hit hard real-time loops, tight power envelopes, and latency/jitter constraints — especially in embedded systems.
GPUs (and NVIDIA’s ecosystem in particular) will clearly dominate for a long time due to tooling, legacy code, and scale, but historically we’ve seen specialized solutions emerge around the edges where general-purpose architectures struggle. Curious how you see this playing out as perception and control loops get tighter.
Thanks for the really insightful comment, and it’s helpful to hear about your semiconductor industry experience. For perception and control, I think we’ll need a bit more time to see what wins out, but there’s a case for more capable units than linear layers. For example, as I wrote in my latest post about von Neumann’s last writings, it is a striking comparison between a hundreds layer artificial vision encoder and a 3-synapse retina in biology performing similar functions. It isn’t clear what kind of functionality more complex units should have without more research though. On the control side it’s possible that having simple analog control loops (maybe for example PD control,LQR) in dedicated units could be beneficial, and I’d like to explore that in some future posts.
This is a really detailed post, Avik. I'm quite new to this, so this might be a naive question: are world models being built as foundation models + some domain/environment specific post training/fine tuning? I'm trying to draw a comparison between language models which have one foundational model like GPT, but many domain specific implementations that use the foundation model.
As far as I know, I don't believe there is a consistent or agreed-upon way to introduce the added structure for the world model. This is actually one of the things that makes it difficult - there is no agreed upon way, and different problem domains may disagree about what is needed. It's safe to say that much more research will be needed to realize the benefits of models with added structure. I am not sure how big the architectural changes made by the "next wave" of AI companies will be, but we will know more by the end of this year, I'm sure.
Fine-tuning a foundation model would be a different approach; while it may add constraints or bias the behavior of a foundation model, it would still have the same generic architecture. This would mean that it would not be able to drastically change the model efficiency or fundamentally affect things like safety or semantic representation.
Great overview — I really appreciated the thoughtful comparison between scaling in LLMs and recent trends in robotics / physical AI.
This resonates strongly with patterns we’ve seen repeatedly in semiconductors over the last 20+ years. “One size fits all” compute tends to break down once you hit hard real-time loops, tight power envelopes, and latency/jitter constraints — especially in embedded systems.
GPUs (and NVIDIA’s ecosystem in particular) will clearly dominate for a long time due to tooling, legacy code, and scale, but historically we’ve seen specialized solutions emerge around the edges where general-purpose architectures struggle. Curious how you see this playing out as perception and control loops get tighter.
Thanks for the really insightful comment, and it’s helpful to hear about your semiconductor industry experience. For perception and control, I think we’ll need a bit more time to see what wins out, but there’s a case for more capable units than linear layers. For example, as I wrote in my latest post about von Neumann’s last writings, it is a striking comparison between a hundreds layer artificial vision encoder and a 3-synapse retina in biology performing similar functions. It isn’t clear what kind of functionality more complex units should have without more research though. On the control side it’s possible that having simple analog control loops (maybe for example PD control,LQR) in dedicated units could be beneficial, and I’d like to explore that in some future posts.
This is a really detailed post, Avik. I'm quite new to this, so this might be a naive question: are world models being built as foundation models + some domain/environment specific post training/fine tuning? I'm trying to draw a comparison between language models which have one foundational model like GPT, but many domain specific implementations that use the foundation model.
As far as I know, I don't believe there is a consistent or agreed-upon way to introduce the added structure for the world model. This is actually one of the things that makes it difficult - there is no agreed upon way, and different problem domains may disagree about what is needed. It's safe to say that much more research will be needed to realize the benefits of models with added structure. I am not sure how big the architectural changes made by the "next wave" of AI companies will be, but we will know more by the end of this year, I'm sure.
Fine-tuning a foundation model would be a different approach; while it may add constraints or bias the behavior of a foundation model, it would still have the same generic architecture. This would mean that it would not be able to drastically change the model efficiency or fundamentally affect things like safety or semantic representation.