85 lines
No EOL
2.9 KiB
Markdown
85 lines
No EOL
2.9 KiB
Markdown
# Training Configuration Files
|
|
|
|
This directory contains configuration files for different model sizes and use cases.
|
|
|
|
## Available Configurations
|
|
|
|
### Small Models (Testing)
|
|
- `training_config.yaml` - Default configuration for small models (DialoGPT-small)
|
|
- Memory: ~1GB VRAM
|
|
- Batch size: 8
|
|
- No quantization
|
|
|
|
### Medium Models (8B)
|
|
- `training_config_large.yaml` - Configuration for 8B models (Llama-3.2-8B)
|
|
- Memory: ~12GB VRAM with 4-bit quantization
|
|
- Batch size: 1, gradient accumulation: 16-64
|
|
- 4-bit quantization enabled
|
|
|
|
### Large Models (13B)
|
|
- `training_config_13b.yaml` - Configuration for 13B models
|
|
- Memory: ~16GB VRAM with 4-bit quantization
|
|
- Batch size: 1, gradient accumulation: 32-128
|
|
- Higher LoRA ranks (32-128)
|
|
|
|
### Extra Large Models (70B)
|
|
- `training_config_70b.yaml` - Configuration for 70B models
|
|
- Memory: ~40GB+ VRAM with 4-bit quantization
|
|
- Batch size: 1, gradient accumulation: 64-256
|
|
- Maximum LoRA ranks (64-256)
|
|
- Multi-GPU support with FSDP
|
|
|
|
## Configuration Parameters
|
|
|
|
### Model Settings
|
|
- `load_in_4bit`: Enable 4-bit quantization (recommended for large models)
|
|
- `gradient_checkpointing`: Trade compute for memory
|
|
- `use_flash_attention_2`: Faster attention computation if available
|
|
|
|
### Adapter Settings
|
|
- `r`: LoRA rank (higher = more parameters but better capacity)
|
|
- `lora_alpha`: LoRA scaling factor (typically 2x the rank)
|
|
- `init_lora_weights`: Set to `true` for identity initialization
|
|
|
|
### Training Settings
|
|
- `per_device_batch_size`: Usually 1 for large models
|
|
- `gradient_accumulation_steps`: Effective batch size multiplier
|
|
- `learning_rate`: Lower for larger models
|
|
- `bf16`: Use bfloat16 for better numerical stability
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# For 8B models
|
|
python scripts/train_progressive.py --config config/training_config_large.yaml
|
|
|
|
# For 13B models
|
|
python scripts/train_progressive.py --config config/training_config_13b.yaml
|
|
|
|
# For 70B models (requires multiple GPUs)
|
|
python scripts/train_progressive.py --config config/training_config_70b.yaml
|
|
```
|
|
|
|
## Memory Requirements
|
|
|
|
| Model Size | VRAM (4-bit) | VRAM (16-bit) | GPUs Recommended |
|
|
|------------|--------------|---------------|------------------|
|
|
| 8B | 12-16GB | 32GB | 1x RTX 4090 |
|
|
| 13B | 16-20GB | 52GB | 1x A100 |
|
|
| 70B | 40-60GB | 140GB | 2x A100 |
|
|
|
|
## Tips for Large Models
|
|
|
|
1. **Start with smaller models** to validate your approach
|
|
2. **Use gradient checkpointing** to reduce memory usage
|
|
3. **Monitor GPU memory** during training
|
|
4. **Use lower learning rates** for stability
|
|
5. **Consider multi-GPU setup** for 70B+ models
|
|
6. **Enable flash attention** if available for speed
|
|
|
|
## Troubleshooting
|
|
|
|
- **OOM errors**: Reduce batch size or enable gradient checkpointing
|
|
- **Slow training**: Enable flash attention, use bf16
|
|
- **Poor convergence**: Adjust learning rate or warmup steps
|
|
- **Multi-GPU issues**: Check FSDP configuration |