Other-ai
Amazon SageMaker AI Helps Train Azerbaijani Language Model
Azercell Telecom used Amazon SageMaker AI to develop an Azerbaijani large language model, achieving 23% higher training throughput and 58% lower GPU memory usage in six weeks.
Azercell Telecom LLC, Azerbaijan’s leading telecommunications provider, collaborated with the AWS Generative AI Innovation Center to build an Azerbaijani large language model (LLM) on Amazon SageMaker AI. The project aimed to address challenges in adapting foundation models (FMs) to a morphologically rich language with limited training data and no existing blueprint for efficient LLM training in Azerbaijani. Over six weeks, the team established a production-ready framework that delivered a 23% higher training throughput and 58% lower peak GPU memory usage through kernel-level optimizations on an ml.p5.48xlarge instance. The framework also achieved a 2× improvement in tokens per word using a custom tokenizer, effectively doubling the amount of Azerbaijani text that fits within the model’s context window.
The solution follows three sequential stages: tokenizer development, continued pre-training (CPT), and supervised fine-tuning with Low-Rank Adaptation (LoRA). Stage 1 focused on creating an efficient tokenizer for Azerbaijani, which reduced tokens per word by half compared to baseline English-optimized tokenizers. Stage 2 adapted an FM (Llama 3.2 1B) to understand Azerbaijani using distributed training and Liger Kernel optimizations on Amazon SageMaker AI training jobs. Stage 3 transformed the pre-trained model into a conversational assistant through LoRA, significantly reducing trainable parameters. Training jobs were launched from Amazon SageMaker Unified Studio, each provisioning fresh Amazon EC2 instances and terminating after completion to avoid idle cluster costs.
Developing the Azerbaijani tokenizer involved training a custom tokenizer using Byte-Level Byte-Pair Encoding (BBPE) with a vocabulary size of 100k tokens. This approach improved encoding efficiency by 2×, reducing the average tokens per word from 3.22 to 1.59. The custom tokenizer also achieved a Bits-Per-Byte (BPB) score of 0.5795 on the validation set, compared to the baseline’s 0.6830, confirming no quality trade-off. The framework’s modular architecture allowed independent optimization of each stage, with tokenizer improvements benefiting subsequent training phases. *Source: [awsml](https://aws.amazon.com/blogs/machine-learning/training-azerbaijani-language-models-on-amazon-sagemaker-ai/)*
Viktiga punkter
- Azercell Telecom used Amazon SageMaker AI to develop an Azerbaijani large language model.
- The framework achieved 23% higher training throughput and 58% lower peak GPU memory usage.
- A custom tokenizer improved encoding efficiency by 2×, reducing tokens per word from 3.22 to 1.59.
- The framework’s modular architecture allows independent optimization of each training stage.
- Training jobs provisioned fresh Amazon EC2 instances to avoid idle cluster costs.
- The custom tokenizer achieved a Bits-Per-Byte (BPB) score of 0.5795 on the validation set.
- The solution follows three sequential stages: tokenizer development, continued pre-training, and supervised fine-tuning with LoRA.