# Training Configuration Files

This directory contains configuration files for different model sizes and use cases.

## Available Configurations

### Small Models (Testing)
- `training_config.yaml` - Default configuration for small models (DialoGPT-small)
  - Memory: ~1GB VRAM
  - Batch size: 8
  - No quantization

### Medium Models (8B)
- `training_config_large.yaml` - Configuration for 8B models (Llama-3.2-8B)
  - Memory: ~12GB VRAM with 4-bit quantization
  - Batch size: 1, gradient accumulation: 16-64
  - 4-bit quantization enabled

### Large Models (13B)
- `training_config_13b.yaml` - Configuration for 13B models
  - Memory: ~16GB VRAM with 4-bit quantization
  - Batch size: 1, gradient accumulation: 32-128
  - Higher LoRA ranks (32-128)

### Extra Large Models (70B)
- `training_config_70b.yaml` - Configuration for 70B models
  - Memory: ~40GB+ VRAM with 4-bit quantization
  - Batch size: 1, gradient accumulation: 64-256
  - Maximum LoRA ranks (64-256)
  - Multi-GPU support with FSDP

## Configuration Parameters

### Model Settings
- `load_in_4bit`: Enable 4-bit quantization (recommended for large models)
- `gradient_checkpointing`: Trade compute for memory
- `use_flash_attention_2`: Faster attention computation if available

### Adapter Settings
- `r`: LoRA rank (higher = more parameters but better capacity)
- `lora_alpha`: LoRA scaling factor (typically 2x the rank)
- `init_lora_weights`: Set to `true` for identity initialization

### Training Settings
- `per_device_batch_size`: Usually 1 for large models
- `gradient_accumulation_steps`: Effective batch size multiplier
- `learning_rate`: Lower for larger models
- `bf16`: Use bfloat16 for better numerical stability

## Usage

```bash
# For 8B models
python scripts/train_progressive.py --config config/training_config_large.yaml

# For 13B models
python scripts/train_progressive.py --config config/training_config_13b.yaml

# For 70B models (requires multiple GPUs)
python scripts/train_progressive.py --config config/training_config_70b.yaml
```

## Memory Requirements

| Model Size | VRAM (4-bit) | VRAM (16-bit) | GPUs Recommended |
|------------|--------------|---------------|------------------|
| 8B         | 12-16GB      | 32GB          | 1x RTX 4090      |
| 13B        | 16-20GB      | 52GB          | 1x A100          |
| 70B        | 40-60GB      | 140GB         | 2x A100          |

## Tips for Large Models

1. **Start with smaller models** to validate your approach
2. **Use gradient checkpointing** to reduce memory usage
3. **Monitor GPU memory** during training
4. **Use lower learning rates** for stability
5. **Consider multi-GPU setup** for 70B+ models
6. **Enable flash attention** if available for speed

## Troubleshooting

- **OOM errors**: Reduce batch size or enable gradient checkpointing
- **Slow training**: Enable flash attention, use bf16
- **Poor convergence**: Adjust learning rate or warmup steps
- **Multi-GPU issues**: Check FSDP configuration