Configuration Guide

Learn how to configure your machine learning jobs in Trainwave for optimal performance and cost-efficiency.

Quick Start

Here’s a complete example of a trainwave.toml file for a PyTorch training job:

# Basic Information
name = "bert-finetuning"
project = "p-abc123"
description = "Fine-tuning BERT for text classification"
 
# Resource Configuration
gpu_type = "RTX A5000"
gpus = 1
cpu_cores = 4
memory_gb = 16
hdd_size_mb = 51200  # 50GB
 
# Runtime Configuration
image = "trainwave/pytorch:2.3.1"
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "python train.py"
 
# Environment Variables
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
 
# Optional Settings
expires = "4h"
compliance_soc2 = true
exclude_gitignore = true
exclude_regex = "data/raw/.*"

Configuration Options

Required Fields

Option	Type	Description	Example
`name`	String	Job name (doesn’t need to be unique)	`"mnist-training"`
`project`	String	Project ID	`"p-abc123"`
`setup_command`	String	Environment setup command	`"pip install -r requirements.txt"`
`run_command`	String	Training command	`"python train.py"`
`image`	String	Docker image	`"trainwave/pytorch:2.3.1"`
`hdd_size_mb`	Integer	Disk space in MB	`51200` (50GB)

Optional Fields

Option	Type	Description	Example
`description`	String	Job description	`"Training MNIST classifier"`
`expires`	String	Auto-termination time	`"4h"`, `"1d"`, `"30m"`
`env_vars`	Object	Environment variables	See examples below
`exclude_gitignore`	Boolean	Respect .gitignore	`true`
`exclude_regex`	String	File exclusion pattern	`"data/raw/.*"`
`memory_gb`	Integer	RAM in GB	`16`
`cpu_cores`	Integer	CPU core count	`4`
`gpus`	Integer	GPU count	`1`
`gpu_type`	String	GPU model	`"RTX A5000"`
`compliance_soc2`	Boolean	SOC2 compliance	`true`

Common Configurations

1. Basic PyTorch Training

name = "mnist-basic"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 10240  # 10GB
gpu_type = "RTX 3080"
gpus = 1
setup_command = "pip install -r requirements.txt"
run_command = "python train.py"

2. Distributed Training

name = "bert-distributed"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 102400  # 100GB
gpu_type = "A100"
gpus = 4
cpu_cores = 16
memory_gb = 64
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "torchrun --nproc_per_node=4 train.py"
 
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
MASTER_PORT = "29500"

3. Large Language Model Training

name = "llm-training"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 512000  # 500GB
gpu_type = "A100"
gpus = 8
cpu_cores = 32
memory_gb = 256
setup_command = """
pip install -r requirements.txt
huggingface-cli login ${HF_TOKEN}
"""
run_command = "python train.py --model gpt3"
 
[env_vars]
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
WANDB_API_KEY = "${WANDB_API_KEY}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Environment Variables

Local Variable Interpolation

Use ${VAR_NAME} to reference local environment variables:

[env_vars]
API_KEY = "${MY_API_KEY}"  # Uses MY_API_KEY from your environment
DATABASE_URL = "${DB_URL}"  # Uses DB_URL from your environment

Fixed Values

Set fixed values directly:

[env_vars]
BATCH_SIZE = "32"
LEARNING_RATE = "0.001"
DEBUG = "true"

Resource Optimization

GPU Selection

Choose the right GPU based on your needs:

GPU Type	Best For	Example Use Case
RTX 3080	Small-medium models	MNIST, CIFAR, small CNNs
RTX A5000	Medium models	BERT, ResNet, medium-scale training
A100	Large models	GPT, T5, large-scale distributed training

Memory Configuration

Optimize memory usage:

# For memory-intensive workloads
memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
 
# For distributed training
memory_gb = 64
env_vars.NCCL_P2P_DISABLE = "1"  # If P2P causes issues

Storage Management

Control what gets uploaded:

# Exclude common development files
exclude_gitignore = true
 
# Exclude specific patterns
exclude_regex = """
data/raw/.*
*.tmp
logs/.*
"""
 
# Specify minimum storage
hdd_size_mb = 51200  # 50GB

Best Practices

Resource Allocation
- Start with minimum required resources
- Scale up based on monitoring data
- Use expires to prevent runaway costs
Environment Setup
- Keep setup_command idempotent
- Use requirements.txt with fixed versions
- Cache heavy downloads when possible
Security
- Use environment variables for secrets
- Enable compliance_soc2 for sensitive data
- Regularly rotate API keys
Performance
- Match GPU type to workload
- Configure appropriate memory limits
- Use distributed training for large models

Troubleshooting

Common Issues

Out of Memory

# Increase memory and add CUDA config
memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Slow Training

# Upgrade GPU and increase CPU cores
gpu_type = "A100"
cpu_cores = 8

Upload Timeout

# Exclude unnecessary files
exclude_gitignore = true
exclude_regex = "data/raw/.*"

Advanced Topics

Custom Docker Images

Use your own Docker images:

image = "your-registry.com/your-image:tag"
 
[env_vars]
DOCKER_USERNAME = "${DOCKER_USER}"
DOCKER_PASSWORD = "${DOCKER_PASS}"

Multi-node Training

Configure for multiple nodes:

name = "multi-node-training"
gpus = 8
cpu_cores = 32
run_command = "python -m torch.distributed.launch --nproc_per_node=8 train.py"
 
[env_vars]
MASTER_ADDR = "localhost"
MASTER_PORT = "29500"
WORLD_SIZE = "8"

Support

Configuration issues: support@trainwave.ai
Resource requests: resources@trainwave.ai
Custom solutions: enterprise@trainwave.ai

CLI GPU Types