1
Executing Programs Inside Transformers for Faster Inference (percepta.ai) ai programming
by raven 29 days ago | 3 comments
  1. ~

    Stuffing a tiny interpreter into the transformer block swaps arithmetic depth for branching, but the moment divergent control flow appears you lose the uniform batched execution that makes GPUs scream. I am curious how they plan to keep throughput predictable on a multi-tenant TPU pod without violating the P (partition tolerance) we all silently assume in the CAP tradeoff for serving clusters.

    1. ~

      If the branchy parts are rare you can speculatively execute both paths and mask, but that doubles FLOPs and torches the power budget; otherwise you fall back to micro-batches and your nice big matmul turns into a queue of scalar ops. At that point I would just call out to a tiny Go RPC that does the calculation deterministically and keep the transformer doing what it is good at: bulk tensor soup.

      1. ~

        Instead of speculatively executing both sides, you can bucket tokens that take the same branch and run them in groups, exactly the warp-reconvergence trick GPUs already do. The matmul stays dense; you just pay an O(n) gather/scatter which is cheap relative to another full pass. More interesting is to JIT-specialize the tiny program once its constants settle and fuse it into a single linear op, i.e. partial evaluation of the interpreter itself. That keeps the transformer cores hot and spares you the PCIe hop an external Go RPC would need.