Unlocking 2.5x Faster LLM Pre-Training: Nous Research's Token Superposition Training Explained

Pre-training large language models (LLMs) is notoriously expensive, so even small efficiency gains can lead to big savings in time and money. Nous Research has introduced Token Superposition Training (TST), a novel method that slashes pre-training wall-clock time by up to 2.5x without changing the model architecture, optimizer, tokenizer, or training data. Curious how it works? Dive into our Q&A to understand the problem, the two-phase solution, and the impressive results at scales from 270M to 10B parameters. Read the first question to get started.

What problem does Token Superposition Training solve?

Modern LLM pre-training is extremely data-hungry, often exceeding compute-optimal estimates. A key lever for efficiency is raw text throughput — how much text the model can process per unit of compute. Subword tokenizers like BPE improve throughput by compressing sequences, but TST asks whether we can pull this lever further during training itself, without permanently altering the tokenizer or architecture. The result is a method that reduces the total pre-training time for a 10B-parameter mixture-of-experts model from 12,311 B200-GPU-hours to just 4,768 — roughly a 2.5x speedup — while even achieving a lower final training loss. This savings translates directly into lower costs and faster experimentation cycles for AI labs. By decoupling throughput gains from tokenizer changes, TST offers a drop-in enhancement to any existing training pipeline.

Unlocking 2.5x Faster LLM Pre-Training: Nous Research's Token Superposition Training Explained — Source: www.marktechpost.com

How does TST achieve faster pre-training?

TST modifies the standard training loop in two sequential phases. In Phase 1 – Superposition (covering the first 20–40% of training steps), the model does not receive individual tokens. Instead, the input sequence is divided into non-overlapping bags of s contiguous tokens. Within the embedding layer, each bag is collapsed into a single latent “s-token” by averaging the s embeddings. The transformer then processes a sequence that is L/s tokens long. Crucially, each TST step uses the same FLOPs as a standard step, because the data sequence length is increased by a factor of s. This means the model ingests s times more text per unit of compute — the core driver of the throughput gain. On the output side, each latent position predicts the next bag of s tokens using a multi-hot cross-entropy loss, which is simply the average of standard cross-entropy terms over the s targets. This design requires no new kernels or auxiliary heads, as it leverages existing fused CE kernels. After Phase 1, training enters Phase 2 – Recovery, where it resumes standard next-token prediction from the saved checkpoint for the remaining steps, with all TST code removed.

What is the multi-hot cross-entropy loss and why is it important?

The multi-hot cross-entropy (MCE) loss replaces the usual next-token prediction loss during the superposition phase. In standard training, the model predicts a single next token and computes cross-entropy against that token. But in TST, each latent position must predict a bag of s tokens. The MCE loss assigns equal probability mass of 1/s to each token in the target bag, effectively treating them all as equally likely. Mathematically, MCE reduces to a simple mean of standard cross-entropy terms over the s targets. This is crucial because it allows TST to be implemented using the same fused cross-entropy kernels already present in any major pre-training library — no custom kernel development or extra heads needed. The elegance of this design means TST can be integrated into existing training stacks with minimal engineering effort, making it accessible to a wide range of research groups and companies.

What are the key results and at what scales were they tested?

Nous Research tested TST across models ranging from 270M to 10B parameters, including dense and mixture-of-experts architectures. The most striking result came at the 10B-A1B MoE scale, where TST reached a lower final training loss than a matched-FLOPs baseline while consuming only 4,768 B200-GPU-hours compared to the baseline’s 12,311 — a roughly 2.5x reduction in total pre-training time. At other scales, similar trends were observed, with optimal superposition fractions (the r parameter) found between 0.2 and 0.4. Importantly, the gains held across different model sizes and training configurations, and the method required no changes to the architecture, optimizer, tokenizer, parallelism strategy, or training data. This consistency suggests TST is a broadly applicable efficiency boost.

Why is TST a practical drop-in improvement without trade-offs?

Unlike many efficiency techniques that require fundamental architectural changes or compromise final model quality, TST is designed to be a drop-in improvement for any existing LLM pre-training pipeline. It does not touch the model architecture, optimizer, tokenizer, parallelism strategy, or training data. The superposition phase uses only existing fused cross-entropy kernels, so no new CUDA kernels or auxiliary heads are needed. The recovery phase then removes all TST code and returns to standard training, meaning the final model is fully compatible with conventional inference and fine-tuning. Moreover, the method reliably achieves a higher throughput during the superposition phase, leading to overall time savings without degrading loss — in fact, at larger scales the TST-trained model even shows slightly lower loss. This combination of simplicity, compatibility, and performance makes TST an attractive option for AI labs looking to accelerate their training timelines without risk.

How does TST compare to other throughput-improving methods like BPE?

Subword tokenizers like BPE improve throughput by compressing sequences, but they are a permanent architectural choice — once trained, the model is locked into that tokenizer. TST, in contrast, works temporarily during training only and is completely removed afterward. This gives it two advantages. First, it can be applied to any existing model or tokenizer without modification. Second, the gains from TST are additive to whatever compression the tokenizer already provides. The paper argues that much of BPE’s advantage over byte-level models comes simply from shorter sequences, which means more text per FLOP; TST pushes that same lever further by artificially shortening sequences by a factor of s during the superposition phase. In effect, TST borrows the core insight behind tokenizer speedups — more text per compute — and applies it dynamically during training, without permanently altering the input representation. This makes it a complementary technique that can stack on top of existing efficiency methods.

Tags: