| .. | ||
| README.md | ||
| training_config.yaml | ||
| training_config_8gpu.yaml | ||
| training_config_8gpu_deepspeed.yaml | ||
| training_config_8gpu_fsdp.yaml | ||
| training_config_13b.yaml | ||
| training_config_70b.yaml | ||
| training_config_gemma2_small.yaml | ||
| training_config_gemma3_1b.yaml | ||
| training_config_gemma3_1b_8gpu_ddp.yaml | ||
| training_config_gemma3_1b_8gpu_deepspeed.yaml | ||
| training_config_gemma3_1b_8gpu_fsdp.yaml | ||
| training_config_gemma3_1b_cpu_offload.yaml | ||
| training_config_large.yaml | ||
| training_config_llama_auth.yaml | ||
| training_config_public.yaml | ||
Training Configuration Files
This directory contains configuration files for different model sizes and use cases.
Available Configurations
Small Models (Testing)
training_config.yaml- Default configuration for small models (DialoGPT-small)- Memory: ~1GB VRAM
- Batch size: 8
- No quantization
Medium Models (8B)
training_config_large.yaml- Configuration for 8B models (Llama-3.2-8B)- Memory: ~12GB VRAM with 4-bit quantization
- Batch size: 1, gradient accumulation: 16-64
- 4-bit quantization enabled
Large Models (13B)
training_config_13b.yaml- Configuration for 13B models- Memory: ~16GB VRAM with 4-bit quantization
- Batch size: 1, gradient accumulation: 32-128
- Higher LoRA ranks (32-128)
Extra Large Models (70B)
training_config_70b.yaml- Configuration for 70B models- Memory: ~40GB+ VRAM with 4-bit quantization
- Batch size: 1, gradient accumulation: 64-256
- Maximum LoRA ranks (64-256)
- Multi-GPU support with FSDP
Configuration Parameters
Model Settings
load_in_4bit: Enable 4-bit quantization (recommended for large models)gradient_checkpointing: Trade compute for memoryuse_flash_attention_2: Faster attention computation if available
Adapter Settings
r: LoRA rank (higher = more parameters but better capacity)lora_alpha: LoRA scaling factor (typically 2x the rank)init_lora_weights: Set totruefor identity initialization
Training Settings
per_device_batch_size: Usually 1 for large modelsgradient_accumulation_steps: Effective batch size multiplierlearning_rate: Lower for larger modelsbf16: Use bfloat16 for better numerical stability
Usage
# For 8B models
python scripts/train_progressive.py --config config/training_config_large.yaml
# For 13B models
python scripts/train_progressive.py --config config/training_config_13b.yaml
# For 70B models (requires multiple GPUs)
python scripts/train_progressive.py --config config/training_config_70b.yaml
Memory Requirements
| Model Size | VRAM (4-bit) | VRAM (16-bit) | GPUs Recommended |
|---|---|---|---|
| 8B | 12-16GB | 32GB | 1x RTX 4090 |
| 13B | 16-20GB | 52GB | 1x A100 |
| 70B | 40-60GB | 140GB | 2x A100 |
Tips for Large Models
- Start with smaller models to validate your approach
- Use gradient checkpointing to reduce memory usage
- Monitor GPU memory during training
- Use lower learning rates for stability
- Consider multi-GPU setup for 70B+ models
- Enable flash attention if available for speed
Troubleshooting
- OOM errors: Reduce batch size or enable gradient checkpointing
- Slow training: Enable flash attention, use bf16
- Poor convergence: Adjust learning rate or warmup steps
- Multi-GPU issues: Check FSDP configuration