Other-ai
Hugging Face Introduces Trimming Technique for Lightweight Models
Hugging Face's trimming technique reduces model size by up to 25% without performance loss, as demonstrated with GPT2-small.
Hugging Face has introduced a new technique called trimming, which allows for the creation of lighter models without sacrificing performance. This method, which requires no retraining and can be run on a simple CPU, produces a model that is lighter than the original while maintaining its performance. In the Practice section, where experiments were conducted, key points are listed in 🧠 Key Takeaways boxes. A summary of all the advantages of this approach can also be found in the conclusion. To support the discussion, the blog post reveals 5,526 models resulting from the application of this technique.
Trimming can be viewed as a subset of pruning. Like pruning, the goal is to modify or remove model weights to make the model lighter, whether in terms of number of parameters or memory size. However, the specificity of trimming is that it focuses exclusively on the parts of the architecture related to vocabulary, whereas in pruning, the weights or layers of the rest of the architecture (i.e., the backbone) are generally modified. For trimming, tokens are removed from the model's original vocabulary (and consequently, the tokenizer must also be updated), and the final embedding layer managing the probability distribution of the model's vocabulary is also modified (likewise for the input layer if embeddings are tied).
The technique is particularly useful for multilingual models. If a model is multilingual, the vocabulary size may not be relevant for two main reasons: first, not all languages are necessarily of interest; second, the vocabulary size may not be a multiple of 8 or 64, which is preferred for optimizing GPU usage. According to Karpathy's observations, this simple recommendation can speed up model training by 25%. Since 2023-2024, this has become common practice, with models generally using a multiple of 8 or 64 for vocabulary by default. However, for older models, it may be worth modifying the size.
The reduction in vocabulary size (to languages of interest and to a multiple of 8 or 64) allows for reducing the model size in terms of both number of parameters and memory size. This is illustrated with the example of GPT2-small by RADFORD, WU et al. (2019). This model makes it easy to understand case 2) described, and is also an example that trimming works on monolingual models as well (whereas in the Practice section most examples focus on multilingual models).
Source: [huggingface](https://huggingface.co/blog/lbourdois/introduction-to-trimming)
Viktiga punkter
- Hugging Face introduced trimming, a technique that creates lighter models without performance loss.
- Trimming requires no retraining and can be run on a simple CPU.
- Trimming focuses exclusively on the parts of the architecture related to vocabulary.
- Trimming removes tokens from the model's original vocabulary and updates the tokenizer.
- The technique is particularly useful for multilingual models.
- A vocabulary size that is not a multiple of 8 or 64 can slow down GPU usage optimization.
- Karpathy's observations indicate that this recommendation can speed up model training by 25%.