Hugging Face has released DoctoBERT, a state-of-the-art French medical encoder, alongside a new pretraining corpus called FineMed. The model is designed to improve medical natural language processing (NLP) by leveraging a curated dataset of web content, rather than traditional hand-curated corpora. The release includes a rephrased version of the corpus, FineMed-rephrased, and a suite of tools for reproducible data curation. The project is part of a broader effort to address the limitations of small, manually curated datasets in specialized domains like medicine. Source: huggingface
The new corpus, FineMed, is built using a three-stage pipeline that begins with extracting medical documents from web sources like FineWeb-2, FinePDFs, and FineWiki. These sources are already cleaned and filtered for quality, but the team further refined them using a multilingual domain classifier to isolate medical content. The process also includes lightweight annotators that score each document on three axes: subdomain, educational quality, and medical-term density. These scores are used to filter and rephrase the data, enhancing its utility for pretraining. Source: huggingface
The team found that medical-term density was the most effective single-axis filter, outperforming educational quality and subdomain classification. This approach contrasts with decoder large language models (LLMs), where educational quality is often prioritized. For encoders, especially in domains with dense terminology like medicine, the density of domain-specific terms plays a crucial role in training effectiveness. The rephrasing process, inspired by Massive Genre–Audience (MGA) reformulation, also helps by increasing the variety of contexts and styles in which medical concepts appear, making the data more robust for downstream tasks. Source: huggingface