Model Release

Tsinghua University Unveils 'Count Anything' AI Model

Tsinghua University researchers introduced 'Count Anything,' an AI model that counts and labels objects across diverse image types with high accuracy, outperforming competitors in benchmark tests.

A white robotic arm operating indoors with a modern design and advanced technology.

Photo: Magda Ehlers / Pexels

A new AI model called 'Count Anything' has been developed by researchers at Tsinghua University and other institutions. The model is designed to count and label objects across a wide variety of image types, from satellite imagery and medical scans to everyday photos, using only a text prompt. The system combines two approaches to improve accuracy: drawing boxes around large objects and placing points on small, dense targets, then merging the results without double counting. According to the paper, the model outperforms many competitors in tests but still struggles with ambiguous terms and extremely dense scenes. Source: thedecoder

The key idea behind 'Count Anything' is combining two complementary approaches. One focuses on large, clearly visible objects by drawing bounding boxes, while the other handles small, densely packed objects by placing a dot on each detected target. The model merges these predictions into a final point set, ensuring that the same object is not counted twice. The system builds on a pretrained model from Meta called SAM3, adding small adapter components for the counting task instead of retraining the whole model from scratch. This approach allows the model to handle a wide range of image types effectively. Source: thedecoder

The researchers built a custom dataset called CLOC, which spans six different image domains, including everyday photos, satellite imagery, medical tissue samples, microscopic cell images, agricultural images, and bacterial culture photos. The dataset contains about 220,000 images, 619 categories, and 15 million labeled objects. Both error metrics drop sharply as CLOC training data grows, showing the value of large cross-domain counting datasets. Source: thedecoder

Key points

Count Anything counts and labels objects across a wide variety of image types, from satellite imagery and medical scans to everyday photos, using nothing more than a text prompt.
The system builds on Meta's SAM3 and combines two approaches: it draws boxes around large objects and places points on small, dense targets, then merges the results without double counting.
Trained on the custom-built CLOC dataset, the model outperforms many competitors in tests but still struggles with ambiguous terms and extremely dense scenes.
The CLOC dataset bundles six very different image domains, from everyday photos and satellite imagery to microscopy and histopathology.
It contains about 220,000 images, 619 categories, and 15 million labeled objects across six domains.
In the team's own comparison tests, Count Anything sits well ahead of competing systems like CountGD, CLIP-Count, and Grounding DINO, according to the paper.

Source: The Decoder Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.