Model Release

Startup Subquadratic Claims Breakthrough in LLM Efficiency

Miami-based startup Subquadratic claims its new model SubQ is 56 times faster than FlashAttention models in speed tests, with a context window up to 12 million tokens.

A white robotic arm operating indoors with a modern design and advanced technology.

Photo: Magda Ehlers / Pexels

Subquadratic, a Miami-based AI startup, emerged from stealth mode with a bold claim: it had solved a decade-old mathematical bottleneck hindering large language models. The company introduced its new model, SubQ, which it asserts is faster, cheaper, and more energy-efficient than existing models. SubQ can process up to 12 times more text simultaneously than most other models, enabling data-heavy tasks like analyzing hundreds of documents or entire code bases. According to Subquadratic, SubQ matches the performance of top models from Google DeepMind, OpenAI, and Anthropic on key tasks like coding. However, the company initially faced skepticism due to a lack of concrete evidence beyond self-published test scores. Subquadratic has since released independent evaluations by third-party firm Appen, which appear to support many of its claims. Source: mittr

The core of Subquadratic’s innovation lies in its use of sparse attention, a technique that reduces the computational burden of dense attention used in most large language models. Dense attention requires multiplying each token’s number with every other token’s number, leading to a quadratic expansion in computations as text length increases. This results in high energy consumption and computational costs. Subquadratic’s sparse attention method selects only some tokens for multiplication, significantly cutting down on computations. According to cofounder and CTO Alex Whedon, this approach is not new, but the company claims it has developed a dynamic mechanism that outperforms previous sparse-attention techniques. Source: mittr

Subquadratic’s claims about cost savings are difficult to verify without broader access to SubQ, but the company estimates that running SubQ costs $8, compared to $2,600 for Anthropic's LLM Opus 4.6 on Nvidia's RULER 128 test. SubQ also boasts a context window of up to 12 million tokens, far exceeding the one-million-token limit of most top models. In a demo, SubQ successfully processed information from 400 documents in seconds, while Perplexity, a popular search engine, failed the same task. Source: mittr

Key points

Subquadratic claims its new model SubQ is 56 times faster than FlashAttention models in speed tests.
SubQ can process up to 12 times as much text at once as most other models.
SubQ matches the performance of top models from Google DeepMind, OpenAI, and Anthropic on coding tasks.
SubQ has a context window up to 12 million tokens, far exceeding the one-million-token limit of most top models.
Subquadratic claims its sparse-attention approach dynamically selects important tokens for multiplication.
Appen’s tests found SubQ scored 89.7% on LiveCodeBench, comparable to other top coding models.
Subquadratic estimates running SubQ costs $8, compared to $2,600 for Anthropic's LLM Opus 4.6.

Source: MIT Technology Review Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.