Hugging Face has released Pulpie, a family of Pareto-optimal models designed to extract main content from HTML pages efficiently. The models achieve state-of-the-art extraction quality at a fraction of the cost compared to existing solutions. Pulpie-orange-small, the smallest model in the family, scores 0.862 ROUGE-5 F1 on the WebMainBench, matching the performance of Dripper, the leading extractor. Despite being a third the size of Dripper, Pulpie achieves this performance through a more efficient architecture that labels every HTML block as content or boilerplate in a single forward pass. This design also contributes to its speed, with pulpie-orange-small processing 13.7 pages per second on an NVIDIA L4 GPU, compared to Dripper's 0.68 pages per second. The cost savings are significant, with cleaning 1 billion pages costing $7,900 with Pulpie versus $159,000 with Dripper at $0.39/hr for an L4 instance. Pulpie's efficiency and quality make it suitable for large-scale web content cleaning, which is critical for both pre-training and inference in language models. The models are open source and available on Hugging Face. Source: huggingface
Hugging Face highlights that poor extraction practices can significantly impact model quality. A study by AICC (Ma et al., 2025) showed that using cleaner data during pre-training led to a 1.08 percentage point increase in average accuracy across 13 benchmarks. This improvement was achieved by training models on corpora generated with model-based parsers instead of heuristic methods. The study also found that the same model trained on cleaner data outperformed models trained on highly filtered datasets like FineWeb and RefinedWeb. These results underscore the importance of high-quality data in training effective language models. Source: huggingface
The development of Pulpie involved creating a large, labeled dataset of HTML pages to train the models. Hugging Face sampled 16,670 English pages from Common Crawl, labeled them using DeepSeek V3.2, and further refined the dataset by removing inconsistent or corrupted pages. The final training set included 14,959 pages where at least 70% of the blocks were labeled consistently by both DeepSeek and Dripper. This dataset was used to fine-tune a teacher model, EuroBERT-2.1B, which was then distilled into smaller models like Pulpie Orange Small and Orange Base. These distilled models retained most of the teacher's performance while being more cost-effective for production use. Source: huggingface