Kog has released the weights and model code of Laneformer 2B on Hugging Face Hub, a 2.3B-parameter instruction-tuned coding model designed for high-speed decoding. Most LLM research optimizes for benchmark quality first, and inference metrics like speed are often treated as a serving problem that comes later: train the model, then quantize it, shard it, batch inputs, cache inputs, and write better kernels. Kog took a different route and treated speed as our first objective. What changes when a model is designed from the ground up with decoding speed maximization in mind? Which architectural choices does that rule out, and which ones still preserve strong model performance? This blog post is the story of how Kog trained Laneformer 2B from scratch into a capable coding model while respecting the hardware constraints required by our Kog Inference Engine and the budget constraints of a startup. Kog is a Paris-based AI infrastructure startup building a real-time inference engine for AI agents with innovative low-level GPU engineering and LLM architecture research. For more background, see Kog’s website and introductory blog post: Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request). TL;DR Kog designed a lane-structured Transformer architecture for high-speed single-request decoding on our inference stack. Kog validated the custom architectural changes at small scale, then trained the final 2.3B model from scratch on about 4T pre-training tokens, continued on about 2T code/reasoning-heavy tokens, and instruction-tuned on about 210M tokens. Kog shows that, even with moderate resources, it is possible to build and deploy a custom small language model with competitive coding benchmark results in its size range. Laneformer 2B reaches 45.1% HumanEval+ and 51.6% MBPP+ in greedy decoding. Kog releases the weights, Hugging Face model code implementation, and documentation as kogai/laneformer-2b-it. You can experience the accelerated version thanks to our Kog Inference Engine in our playground. The idea At low batch sizes, decode speed is not just a FLOPs problem.
A lot of time goes into moving weights, synchronizing kernels, and paying communication costs layer after layer. This overhead increases even more in multi-GPU setups, where inter-GPU communication is introduced. At the model architecture level, Tensor Parallelism (TP) is a well-known way to split work across GPUs, but each layer forces the devices to stop and exchange results before moving on to the next layer. This led us to a simple question: can we hide those communication costs instead of paying them at every layer? Naive attempts to solve this problem can introduce ad hoc architectural changes that hurt model quality, and make the method difficult to apply to an existing pre-trained architecture without leaving performance on the table. Fast inference does not require training a new model from scratch and Kog’s inference engine already achieves very high decoding speeds on standard pre-trained architectures through low-level GPU optimization. But to go further, the runtime can no longer be treated as a separate serving layer: the model architecture itself has to expose the right structure for the engine to exploit. Those observations left us with a single conclusion: for the fastest single-request inference, architecture and runtime should be designed together. Laneformer is our first model trained from scratch to explore that co-design point. As a small startup, we could not solve this by scaling indefinitely yet. The target had to be deliberately constrained: Design and train a small-scale model with strong coding capabilities and extreme inference decoding speed. The story Hiding overhead Tensor Parallelism (TP) is effective because it splits large matrix operations across GPUs and pays it with inter-GPU synchronization. At batch-size-one decoding, this cost is especially painful. The obvious idea is to delay the communication introduced by TP. In practice, doing this naively leads to subpar model quality: once hidden states are no longer synchronized at the usual boundaries, model quality starts to drop off sharply and finding architectural ideas becomes necessary for training stability and maintaining model quality. We spent this phase testing variants at small scale.
Interestingly, many of our more complex ideas either degraded quality or made the implementation too brittle. The useful lesson was almost embarrassingly simple: try the obvious thing first! Understand why it fails and fix it with only the most minimal architectural change needed. That path led to the mechanism we now call Delayed Tensor Parallelism (DTP). For the full mechanism, see our DTP deep dive. Designing the architecture Once DTP had a viable shape, the rest of the model design had to stay conservative. DTP was already changing one important assumption of a standard Tensor-Parallel Transformer, so we did not want to introduce unrelated architectural novelty at the same time. There was also a speed budget. In the idealized limit where communication and other overheads are hidden, batch-size-one throughput becomes bounded by GPU memory bandwidth. Each generated token requires streaming model weights and the KV cache from GPU memory to compute units, so the maximum theoretical tokens per second depends strongly on how much data must be read per forward pass. Since we were not targeting very long contexts in this release, the main practical question became: how large can the model be? In theory, this question could be answered from hardware bandwidth alone. In practice, another constraint mattered just as much: what budget and compute could we actually sustain, and how long could we train? The final size was therefore chosen at the intersection of three constraints: small enough to train from scratch with our resources, large enough to make coding benchmarks and post-training meaningful, able to support DTP and our inference engine to reach the highest speed possible. The 2B scale ended up being perfect for us.
Below is our architecture card, showing how DTP and lanes fit into a mostly standard decoder Transformer. Some notes: The strongest architectural change is the new 8-lane system to support DTP. To perfectly hide the TP communication overhead, we calibrated the delay to 2. We use causal Grouped-Query Attention (GQA) with 32 query heads and 16 key/value heads, allowing us to shard the heads evenly across our 8 lanes. We used Sliding Window Attention (SWA) for 10 of the 15 layers of the model to make sure streaming the KV cache would never introduce a non-negligible latency at our context size. Laneformer 2B architecture overview, showing the eight-lane structure used to support DTP. Pre-training and mid-training Pre-training, even at the 2B scale, is a mountain to climb! Fortunately, today there are many useful resources, such as the Smol training playbook to help you get started. So first, let's devise the data recipe! Data mixtures and recipes Our goal was to produce a capable coding model. Since we did not have access to proprietary data, we approached the problem under an open-source constraint. Thanks to the current open ecosystem, and particularly NVIDIA's work on Nemotron, we were able to gather more than 20TB of high-quality filtered open-source data and avoid the long and costly data mixture search. Remark Getting 20TB of data from HF might sound simple, but working at this scale becomes a systems problem in its own right. Moving, filtering, validating, and reshuffling the data can take days, so small mistakes in the data pipeline can quickly become expensive. Following the NVIDIA Nemotron paper, we used a