Skip to Content
ConfigurationConfiguration Guide

Configuration Guide

Learn how to configure your machine learning jobs in Trainwave using trainwave.toml.

Quick Start

Here’s a complete example of a trainwave.toml file for a PyTorch training job:

# Basic Information name = "bert-finetuning" project = "p-abc123" description = "Fine-tuning BERT for text classification" # Resource Configuration gpu_type = "RTX A5000" gpus = 1 cpu_cores = 4 memory_gb = 16 hdd_size_mb = 51200 # 50GB # Runtime Configuration image = "trainwave/pytorch:2.3.1" setup_command = """ pip install -r requirements.txt wandb login ${WANDB_API_KEY} """ run_command = "python train.py" # Environment Variables [env_vars] WANDB_API_KEY = "${WANDB_API_KEY}" HUGGINGFACE_TOKEN = "${HF_TOKEN}" PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512" # Optional Settings expires = "4h" exclude_gitignore = true exclude_regex = "data/raw/.*"

Configuration Options

Required Fields

OptionTypeDescriptionExample
nameStringJob name (doesn’t need to be unique)"mnist-training"
projectStringProject ID"p-abc123"
setup_commandStringEnvironment setup command"pip install -r requirements.txt"
run_commandStringTraining command"python train.py"
imageStringDocker image"trainwave/pytorch:2.3.1"
hdd_size_mbIntegerDisk space in MB51200 (50GB)

Optional Fields

OptionTypeDescriptionExample
descriptionStringJob description"Training MNIST classifier"
expiresStringAuto-termination time"4h", "1d", "30m"
env_varsObjectEnvironment variablesSee examples below
exclude_gitignoreBooleanRespect .gitignoretrue
exclude_regexStringFile exclusion pattern"data/raw/.*"
memory_gbIntegerRAM in GB16
cpu_coresIntegerCPU core count4
gpusIntegerGPU count1
gpu_typeStringGPU model"RTX A5000"

Common Configurations

Basic PyTorch Training

name = "mnist-basic" project = "p-abc123" image = "trainwave/pytorch:2.3.1" hdd_size_mb = 10240 # 10GB gpu_type = "RTX 3080" gpus = 1 setup_command = "pip install -r requirements.txt" run_command = "python train.py"

Distributed Training

name = "bert-distributed" project = "p-abc123" image = "trainwave/pytorch:2.3.1" hdd_size_mb = 102400 # 100GB gpu_type = "A100" gpus = 4 cpu_cores = 16 memory_gb = 64 setup_command = """ pip install -r requirements.txt wandb login ${WANDB_API_KEY} """ run_command = "torchrun --nproc_per_node=4 train.py" [env_vars] WANDB_API_KEY = "${WANDB_API_KEY}" MASTER_PORT = "29500"

Large Language Model Training

name = "llm-training" project = "p-abc123" image = "trainwave/pytorch:2.3.1" hdd_size_mb = 512000 # 500GB gpu_type = "A100" gpus = 8 cpu_cores = 32 memory_gb = 256 setup_command = """ pip install -r requirements.txt huggingface-cli login ${HF_TOKEN} """ run_command = "python train.py --model gpt3" [env_vars] HUGGINGFACE_TOKEN = "${HF_TOKEN}" WANDB_API_KEY = "${WANDB_API_KEY}" PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Environment Variables

Local Variable Interpolation

Use ${VAR_NAME} to reference environment variables from your local shell at launch time:

[env_vars] API_KEY = "${MY_API_KEY}" # Uses MY_API_KEY from your local environment DATABASE_URL = "${DB_URL}" # Uses DB_URL from your local environment

Fixed Values

Set fixed values directly:

[env_vars] BATCH_SIZE = "32" LEARNING_RATE = "0.001" DEBUG = "true"

Resource Optimization

GPU Selection

Choose the right GPU based on your needs:

GPU TypeBest ForExample Use Case
RTX 3080Small-medium modelsMNIST, CIFAR, small CNNs
RTX A5000Medium modelsBERT, ResNet, medium-scale training
A100Large modelsGPT, T5, large-scale distributed training

Storage Management

Control which files get uploaded to the job:

# Exclude common development files exclude_gitignore = true # Exclude specific patterns exclude_regex = "data/raw/.*" # Allocate enough storage for your data and artifacts hdd_size_mb = 51200 # 50GB

Best Practices

  1. Resource Allocation

    • Start with the minimum required resources and scale up
    • Use expires to prevent runaway costs
  2. Environment Setup

    • Keep setup_command idempotent
    • Use requirements.txt with pinned versions
  3. Security

    • Use environment variables for secrets — never hardcode them
    • Regularly rotate API keys
  4. Performance

    • Match GPU type to workload
    • Use distributed training for large models

Troubleshooting

Out of Memory

memory_gb = 32 env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Slow Training

gpu_type = "A100" cpu_cores = 8

Upload Timeout

exclude_gitignore = true exclude_regex = "data/raw/.*"

Support

support@trainwave.ai

Last updated on