A recent benchmarking study compared Falconer, Notion, Atlassian Rovo, Claude Code, and Codex on real-world support and engineering tasks. Falconer demonstrated superior performance, winning 64% of the head-to-head matchups against Notion and outperforming all other tools in both documentation and code tests. The study evaluated 200 real questions from public datasets and an open-source project's codebase, with results judged by multiple scoring methods and three AI judges: Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Falconer's performance was measured against human reference answers, ensuring accuracy and relevance in its responses.

Falconer's strengths were evident in both support and engineering tasks. In the Wix help center test, Falconer answered 100 real customer questions with a 70.5% win rate against Notion and an 88.4% win rate against Atlassian Rovo. On the Apache Spark code test, Falconer achieved a 57.7% win rate against Notion and a 97.1% win rate against Atlassian Rovo. The study used three scoring methods—weighted-sum, holistic, and Pareto—to ensure comprehensive evaluation, with Falconer consistently leading in weighted-sum scores. The results highlight Falconer's efficiency and accuracy in delivering concise, actionable answers.

The benchmarking process was designed to be transparent and reproducible, using public datasets and disabling web access for all systems to ensure results reflected retrieval and grounding rather than external resources. The study emphasized quality metrics such as faithfulness, helpfulness, completeness, and relevance, with Falconer scoring highly across all categories. The results are available for public review, allowing others to replicate the benchmarks and validate the findings.

Source: huggingface