DocsConfiguration

Configuration Guide

Learn how to configure your machine learning jobs in Trainwave for optimal performance and cost-efficiency.

Quick Start

Here’s a complete example of a trainwave.toml file for a PyTorch training job:

# Basic Information
name = "bert-finetuning"
project = "p-abc123"
description = "Fine-tuning BERT for text classification"
 
# Resource Configuration
gpu_type = "RTX A5000"
gpus = 1
cpu_cores = 4
memory_gb = 16
hdd_size_mb = 51200  # 50GB
 
# Runtime Configuration
image = "trainwave/pytorch:2.3.1"
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "python train.py"
 
# Environment Variables
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
 
# Optional Settings
expires = "4h"
compliance_soc2 = true
exclude_gitignore = true
exclude_regex = "data/raw/.*"

Configuration Options

Required Fields

OptionTypeDescriptionExample
nameStringJob name (doesn’t need to be unique)"mnist-training"
projectStringProject ID"p-abc123"
setup_commandStringEnvironment setup command"pip install -r requirements.txt"
run_commandStringTraining command"python train.py"
imageStringDocker image"trainwave/pytorch:2.3.1"
hdd_size_mbIntegerDisk space in MB51200 (50GB)

Optional Fields

OptionTypeDescriptionExample
descriptionStringJob description"Training MNIST classifier"
expiresStringAuto-termination time"4h", "1d", "30m"
env_varsObjectEnvironment variablesSee examples below
exclude_gitignoreBooleanRespect .gitignoretrue
exclude_regexStringFile exclusion pattern"data/raw/.*"
memory_gbIntegerRAM in GB16
cpu_coresIntegerCPU core count4
gpusIntegerGPU count1
gpu_typeStringGPU model"RTX A5000"
compliance_soc2BooleanSOC2 compliancetrue

Common Configurations

1. Basic PyTorch Training

name = "mnist-basic"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 10240  # 10GB
gpu_type = "RTX 3080"
gpus = 1
setup_command = "pip install -r requirements.txt"
run_command = "python train.py"

2. Distributed Training

name = "bert-distributed"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 102400  # 100GB
gpu_type = "A100"
gpus = 4
cpu_cores = 16
memory_gb = 64
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "torchrun --nproc_per_node=4 train.py"
 
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
MASTER_PORT = "29500"

3. Large Language Model Training

name = "llm-training"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 512000  # 500GB
gpu_type = "A100"
gpus = 8
cpu_cores = 32
memory_gb = 256
setup_command = """
pip install -r requirements.txt
huggingface-cli login ${HF_TOKEN}
"""
run_command = "python train.py --model gpt3"
 
[env_vars]
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
WANDB_API_KEY = "${WANDB_API_KEY}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Environment Variables

Local Variable Interpolation

Use ${VAR_NAME} to reference local environment variables:

[env_vars]
API_KEY = "${MY_API_KEY}"  # Uses MY_API_KEY from your environment
DATABASE_URL = "${DB_URL}"  # Uses DB_URL from your environment

Fixed Values

Set fixed values directly:

[env_vars]
BATCH_SIZE = "32"
LEARNING_RATE = "0.001"
DEBUG = "true"

Resource Optimization

GPU Selection

Choose the right GPU based on your needs:

GPU TypeBest ForExample Use Case
RTX 3080Small-medium modelsMNIST, CIFAR, small CNNs
RTX A5000Medium modelsBERT, ResNet, medium-scale training
A100Large modelsGPT, T5, large-scale distributed training

Memory Configuration

Optimize memory usage:

# For memory-intensive workloads
memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
 
# For distributed training
memory_gb = 64
env_vars.NCCL_P2P_DISABLE = "1"  # If P2P causes issues

Storage Management

Control what gets uploaded:

# Exclude common development files
exclude_gitignore = true
 
# Exclude specific patterns
exclude_regex = """
data/raw/.*
*.tmp
logs/.*
"""
 
# Specify minimum storage
hdd_size_mb = 51200  # 50GB

Best Practices

  1. Resource Allocation

    • Start with minimum required resources
    • Scale up based on monitoring data
    • Use expires to prevent runaway costs
  2. Environment Setup

    • Keep setup_command idempotent
    • Use requirements.txt with fixed versions
    • Cache heavy downloads when possible
  3. Security

    • Use environment variables for secrets
    • Enable compliance_soc2 for sensitive data
    • Regularly rotate API keys
  4. Performance

    • Match GPU type to workload
    • Configure appropriate memory limits
    • Use distributed training for large models

Troubleshooting

Common Issues

  1. Out of Memory

    # Increase memory and add CUDA config
    memory_gb = 32
    env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
  2. Slow Training

    # Upgrade GPU and increase CPU cores
    gpu_type = "A100"
    cpu_cores = 8
  3. Upload Timeout

    # Exclude unnecessary files
    exclude_gitignore = true
    exclude_regex = "data/raw/.*"

Advanced Topics

Custom Docker Images

Use your own Docker images:

image = "your-registry.com/your-image:tag"
 
[env_vars]
DOCKER_USERNAME = "${DOCKER_USER}"
DOCKER_PASSWORD = "${DOCKER_PASS}"

Multi-node Training

Configure for multiple nodes:

name = "multi-node-training"
gpus = 8
cpu_cores = 32
run_command = "python -m torch.distributed.launch --nproc_per_node=8 train.py"
 
[env_vars]
MASTER_ADDR = "localhost"
MASTER_PORT = "29500"
WORLD_SIZE = "8"

Support

Last updated on