Configuration Guide
Learn how to configure your machine learning jobs in Trainwave for optimal performance and cost-efficiency.
Quick Start
Here’s a complete example of a trainwave.toml
file for a PyTorch training job:
# Basic Information
name = "bert-finetuning"
project = "p-abc123"
description = "Fine-tuning BERT for text classification"
# Resource Configuration
gpu_type = "RTX A5000"
gpus = 1
cpu_cores = 4
memory_gb = 16
hdd_size_mb = 51200 # 50GB
# Runtime Configuration
image = "trainwave/pytorch:2.3.1"
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "python train.py"
# Environment Variables
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
# Optional Settings
expires = "4h"
compliance_soc2 = true
exclude_gitignore = true
exclude_regex = "data/raw/.*"
Configuration Options
Required Fields
Option | Type | Description | Example |
---|---|---|---|
name | String | Job name (doesn’t need to be unique) | "mnist-training" |
project | String | Project ID | "p-abc123" |
setup_command | String | Environment setup command | "pip install -r requirements.txt" |
run_command | String | Training command | "python train.py" |
image | String | Docker image | "trainwave/pytorch:2.3.1" |
hdd_size_mb | Integer | Disk space in MB | 51200 (50GB) |
Optional Fields
Option | Type | Description | Example |
---|---|---|---|
description | String | Job description | "Training MNIST classifier" |
expires | String | Auto-termination time | "4h" , "1d" , "30m" |
env_vars | Object | Environment variables | See examples below |
exclude_gitignore | Boolean | Respect .gitignore | true |
exclude_regex | String | File exclusion pattern | "data/raw/.*" |
memory_gb | Integer | RAM in GB | 16 |
cpu_cores | Integer | CPU core count | 4 |
gpus | Integer | GPU count | 1 |
gpu_type | String | GPU model | "RTX A5000" |
compliance_soc2 | Boolean | SOC2 compliance | true |
Common Configurations
1. Basic PyTorch Training
name = "mnist-basic"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 10240 # 10GB
gpu_type = "RTX 3080"
gpus = 1
setup_command = "pip install -r requirements.txt"
run_command = "python train.py"
2. Distributed Training
name = "bert-distributed"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 102400 # 100GB
gpu_type = "A100"
gpus = 4
cpu_cores = 16
memory_gb = 64
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "torchrun --nproc_per_node=4 train.py"
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
MASTER_PORT = "29500"
3. Large Language Model Training
name = "llm-training"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 512000 # 500GB
gpu_type = "A100"
gpus = 8
cpu_cores = 32
memory_gb = 256
setup_command = """
pip install -r requirements.txt
huggingface-cli login ${HF_TOKEN}
"""
run_command = "python train.py --model gpt3"
[env_vars]
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
WANDB_API_KEY = "${WANDB_API_KEY}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
Environment Variables
Local Variable Interpolation
Use ${VAR_NAME}
to reference local environment variables:
[env_vars]
API_KEY = "${MY_API_KEY}" # Uses MY_API_KEY from your environment
DATABASE_URL = "${DB_URL}" # Uses DB_URL from your environment
Fixed Values
Set fixed values directly:
[env_vars]
BATCH_SIZE = "32"
LEARNING_RATE = "0.001"
DEBUG = "true"
Resource Optimization
GPU Selection
Choose the right GPU based on your needs:
GPU Type | Best For | Example Use Case |
---|---|---|
RTX 3080 | Small-medium models | MNIST, CIFAR, small CNNs |
RTX A5000 | Medium models | BERT, ResNet, medium-scale training |
A100 | Large models | GPT, T5, large-scale distributed training |
Memory Configuration
Optimize memory usage:
# For memory-intensive workloads
memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
# For distributed training
memory_gb = 64
env_vars.NCCL_P2P_DISABLE = "1" # If P2P causes issues
Storage Management
Control what gets uploaded:
# Exclude common development files
exclude_gitignore = true
# Exclude specific patterns
exclude_regex = """
data/raw/.*
*.tmp
logs/.*
"""
# Specify minimum storage
hdd_size_mb = 51200 # 50GB
Best Practices
-
Resource Allocation
- Start with minimum required resources
- Scale up based on monitoring data
- Use
expires
to prevent runaway costs
-
Environment Setup
- Keep
setup_command
idempotent - Use requirements.txt with fixed versions
- Cache heavy downloads when possible
- Keep
-
Security
- Use environment variables for secrets
- Enable
compliance_soc2
for sensitive data - Regularly rotate API keys
-
Performance
- Match GPU type to workload
- Configure appropriate memory limits
- Use distributed training for large models
Troubleshooting
Common Issues
-
Out of Memory
# Increase memory and add CUDA config memory_gb = 32 env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
-
Slow Training
# Upgrade GPU and increase CPU cores gpu_type = "A100" cpu_cores = 8
-
Upload Timeout
# Exclude unnecessary files exclude_gitignore = true exclude_regex = "data/raw/.*"
Advanced Topics
Custom Docker Images
Use your own Docker images:
image = "your-registry.com/your-image:tag"
[env_vars]
DOCKER_USERNAME = "${DOCKER_USER}"
DOCKER_PASSWORD = "${DOCKER_PASS}"
Multi-node Training
Configure for multiple nodes:
name = "multi-node-training"
gpus = 8
cpu_cores = 32
run_command = "python -m torch.distributed.launch --nproc_per_node=8 train.py"
[env_vars]
MASTER_ADDR = "localhost"
MASTER_PORT = "29500"
WORLD_SIZE = "8"
Support
- Configuration issues: support@trainwave.ai
- Resource requests: resources@trainwave.ai
- Custom solutions: enterprise@trainwave.ai
Last updated on