Configuration Guide
Learn how to configure your machine learning jobs in Trainwave using trainwave.toml.
Quick Start
Here’s a complete example of a trainwave.toml file for a PyTorch training job:
# Basic Information
name = "bert-finetuning"
project = "p-abc123"
description = "Fine-tuning BERT for text classification"
# Resource Configuration
gpu_type = "RTX A5000"
gpus = 1
cpu_cores = 4
memory_gb = 16
hdd_size_mb = 51200 # 50GB
# Runtime Configuration
image = "trainwave/pytorch:2.3.1"
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "python train.py"
# Environment Variables
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
# Optional Settings
expires = "4h"
exclude_gitignore = true
exclude_regex = "data/raw/.*"Configuration Options
Required Fields
| Option | Type | Description | Example |
|---|---|---|---|
name | String | Job name (doesn’t need to be unique) | "mnist-training" |
project | String | Project ID | "p-abc123" |
setup_command | String | Environment setup command | "pip install -r requirements.txt" |
run_command | String | Training command | "python train.py" |
image | String | Docker image | "trainwave/pytorch:2.3.1" |
hdd_size_mb | Integer | Disk space in MB | 51200 (50GB) |
Optional Fields
| Option | Type | Description | Example |
|---|---|---|---|
description | String | Job description | "Training MNIST classifier" |
expires | String | Auto-termination time | "4h", "1d", "30m" |
env_vars | Object | Environment variables | See examples below |
exclude_gitignore | Boolean | Respect .gitignore | true |
exclude_regex | String | File exclusion pattern | "data/raw/.*" |
memory_gb | Integer | RAM in GB | 16 |
cpu_cores | Integer | CPU core count | 4 |
gpus | Integer | GPU count | 1 |
gpu_type | String | GPU model | "RTX A5000" |
Common Configurations
Basic PyTorch Training
name = "mnist-basic"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 10240 # 10GB
gpu_type = "RTX 3080"
gpus = 1
setup_command = "pip install -r requirements.txt"
run_command = "python train.py"Distributed Training
name = "bert-distributed"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 102400 # 100GB
gpu_type = "A100"
gpus = 4
cpu_cores = 16
memory_gb = 64
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "torchrun --nproc_per_node=4 train.py"
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
MASTER_PORT = "29500"Large Language Model Training
name = "llm-training"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 512000 # 500GB
gpu_type = "A100"
gpus = 8
cpu_cores = 32
memory_gb = 256
setup_command = """
pip install -r requirements.txt
huggingface-cli login ${HF_TOKEN}
"""
run_command = "python train.py --model gpt3"
[env_vars]
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
WANDB_API_KEY = "${WANDB_API_KEY}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"Environment Variables
Local Variable Interpolation
Use ${VAR_NAME} to reference environment variables from your local shell at launch time:
[env_vars]
API_KEY = "${MY_API_KEY}" # Uses MY_API_KEY from your local environment
DATABASE_URL = "${DB_URL}" # Uses DB_URL from your local environmentFixed Values
Set fixed values directly:
[env_vars]
BATCH_SIZE = "32"
LEARNING_RATE = "0.001"
DEBUG = "true"Resource Optimization
GPU Selection
Choose the right GPU based on your needs:
| GPU Type | Best For | Example Use Case |
|---|---|---|
| RTX 3080 | Small-medium models | MNIST, CIFAR, small CNNs |
| RTX A5000 | Medium models | BERT, ResNet, medium-scale training |
| A100 | Large models | GPT, T5, large-scale distributed training |
Storage Management
Control which files get uploaded to the job:
# Exclude common development files
exclude_gitignore = true
# Exclude specific patterns
exclude_regex = "data/raw/.*"
# Allocate enough storage for your data and artifacts
hdd_size_mb = 51200 # 50GBBest Practices
-
Resource Allocation
- Start with the minimum required resources and scale up
- Use
expiresto prevent runaway costs
-
Environment Setup
- Keep
setup_commandidempotent - Use
requirements.txtwith pinned versions
- Keep
-
Security
- Use environment variables for secrets — never hardcode them
- Regularly rotate API keys
-
Performance
- Match GPU type to workload
- Use distributed training for large models
Troubleshooting
Out of Memory
memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"Slow Training
gpu_type = "A100"
cpu_cores = 8Upload Timeout
exclude_gitignore = true
exclude_regex = "data/raw/.*"