# Training Configuration Files This directory contains configuration files for different model sizes and use cases. ## Available Configurations ### Small Models (Testing) - `training_config.yaml` - Default configuration for small models (DialoGPT-small) - Memory: ~1GB VRAM - Batch size: 8 - No quantization ### Medium Models (8B) - `training_config_large.yaml` - Configuration for 8B models (Llama-3.2-8B) - Memory: ~12GB VRAM with 4-bit quantization - Batch size: 1, gradient accumulation: 16-64 - 4-bit quantization enabled ### Large Models (13B) - `training_config_13b.yaml` - Configuration for 13B models - Memory: ~16GB VRAM with 4-bit quantization - Batch size: 1, gradient accumulation: 32-128 - Higher LoRA ranks (32-128) ### Extra Large Models (70B) - `training_config_70b.yaml` - Configuration for 70B models - Memory: ~40GB+ VRAM with 4-bit quantization - Batch size: 1, gradient accumulation: 64-256 - Maximum LoRA ranks (64-256) - Multi-GPU support with FSDP ## Configuration Parameters ### Model Settings - `load_in_4bit`: Enable 4-bit quantization (recommended for large models) - `gradient_checkpointing`: Trade compute for memory - `use_flash_attention_2`: Faster attention computation if available ### Adapter Settings - `r`: LoRA rank (higher = more parameters but better capacity) - `lora_alpha`: LoRA scaling factor (typically 2x the rank) - `init_lora_weights`: Set to `true` for identity initialization ### Training Settings - `per_device_batch_size`: Usually 1 for large models - `gradient_accumulation_steps`: Effective batch size multiplier - `learning_rate`: Lower for larger models - `bf16`: Use bfloat16 for better numerical stability ## Usage ```bash # For 8B models python scripts/train_progressive.py --config config/training_config_large.yaml # For 13B models python scripts/train_progressive.py --config config/training_config_13b.yaml # For 70B models (requires multiple GPUs) python scripts/train_progressive.py --config config/training_config_70b.yaml ``` ## Memory Requirements | Model Size | VRAM (4-bit) | VRAM (16-bit) | GPUs Recommended | |------------|--------------|---------------|------------------| | 8B | 12-16GB | 32GB | 1x RTX 4090 | | 13B | 16-20GB | 52GB | 1x A100 | | 70B | 40-60GB | 140GB | 2x A100 | ## Tips for Large Models 1. **Start with smaller models** to validate your approach 2. **Use gradient checkpointing** to reduce memory usage 3. **Monitor GPU memory** during training 4. **Use lower learning rates** for stability 5. **Consider multi-GPU setup** for 70B+ models 6. **Enable flash attention** if available for speed ## Troubleshooting - **OOM errors**: Reduce batch size or enable gradient checkpointing - **Slow training**: Enable flash attention, use bf16 - **Poor convergence**: Adjust learning rate or warmup steps - **Multi-GPU issues**: Check FSDP configuration