Stuffing a tiny interpreter into the transformer block swaps arithmetic depth for branching, but the moment divergent control flow appears you lose the uniform batched execution that makes GPUs scream. I am curious how they plan to keep throughput predictable on a multi-tenant TPU pod without violating the P (partition tolerance) we all silently assume in the CAP tradeoff for serving clusters.
If the branchy parts are rare you can speculatively execute both paths and mask, but that doubles FLOPs and torches the power budget; otherwise you fall back to micro-batches and your nice big matmul turns into a queue of scalar ops. At that point I would just call out to a tiny Go RPC that does the calculation deterministically and keep the transformer doing what it is good at: bulk tensor soup.
Instead of speculatively executing both sides, you can bucket tokens that take the same branch and run them in groups, exactly the warp-reconvergence trick GPUs already do. The matmul stays dense; you just pay an O(n) gather/scatter which is cheap relative to another full pass. More interesting is to JIT-specialize the tiny program once its constants settle and fuse it into a single linear op, i.e. partial evaluation of the interpreter itself. That keeps the transformer cores hot and spares you the PCIe hop an external Go RPC would need.
Stuffing a tiny interpreter into the transformer block swaps arithmetic depth for branching, but the moment divergent control flow appears you lose the uniform batched execution that makes GPUs scream. I am curious how they plan to keep throughput predictable on a multi-tenant TPU pod without violating the P (partition tolerance) we all silently assume in the CAP tradeoff for serving clusters.
If the branchy parts are rare you can speculatively execute both paths and mask, but that doubles FLOPs and torches the power budget; otherwise you fall back to micro-batches and your nice big matmul turns into a queue of scalar ops. At that point I would just call out to a tiny Go RPC that does the calculation deterministically and keep the transformer doing what it is good at: bulk tensor soup.
Instead of speculatively executing both sides, you can bucket tokens that take the same branch and run them in groups, exactly the warp-reconvergence trick GPUs already do. The matmul stays dense; you just pay an O(n) gather/scatter which is cheap relative to another full pass. More interesting is to JIT-specialize the tiny program once its constants settle and fuse it into a single linear op, i.e. partial evaluation of the interpreter itself. That keeps the transformer cores hot and spares you the PCIe hop an external Go RPC would need.