Research

Falconer Outperforms Competitors in Enterprise AI Benchmarks

Falconer achieved a 64% win rate against Notion in real support questions, outperforming other tools in two critical benchmarks.

Image: Hugging Face

A recent benchmarking study compared Falconer, Notion, Atlassian Rovo, Claude Code, and Codex on real-world support and engineering tasks. Falconer demonstrated superior performance, winning 64% of the head-to-head matchups against Notion and outperforming all other tools in both documentation and code tests. The study evaluated 200 real questions from public datasets and an open-source project's codebase, with results judged by multiple scoring methods and three AI judges: Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Falconer's performance was measured against human reference answers, ensuring accuracy and relevance in its responses.

Falconer's strengths were evident in both support and engineering tasks. In the Wix help center test, Falconer answered 100 real customer questions with a 70.5% win rate against Notion and an 88.4% win rate against Atlassian Rovo. On the Apache Spark code test, Falconer achieved a 57.7% win rate against Notion and a 97.1% win rate against Atlassian Rovo. The study used three scoring methods—weighted-sum, holistic, and Pareto—to ensure comprehensive evaluation, with Falconer consistently leading in weighted-sum scores. The results highlight Falconer's efficiency and accuracy in delivering concise, actionable answers.

The benchmarking process was designed to be transparent and reproducible, using public datasets and disabling web access for all systems to ensure results reflected retrieval and grounding rather than external resources. The study emphasized quality metrics such as faithfulness, helpfulness, completeness, and relevance, with Falconer scoring highly across all categories. The results are available for public review, allowing others to replicate the benchmarks and validate the findings.

Source: huggingface

Key points

Falconer achieved a 64% win rate against Notion in real support questions.
Falconer outperformed all competitors in both documentation and code tests.
The study evaluated 200 real questions from public datasets and an open-source project's codebase.
Falconer's performance was judged by three AI judges: Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro.
Falconer achieved a 70.5% win rate against Notion in the Wix help center test.
Falconer achieved a 97.1% win rate against Atlassian Rovo in the Apache Spark code test.
The benchmarking process used public datasets and disabled web access for all systems.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.