DocsConfiguration

Configuration Guide

Learn how to configure your machine learning jobs in Trainwave using trainwave.toml.

Quick Start

Here’s a complete example of a trainwave.toml file for a PyTorch training job:

# Basic Information
name = "bert-finetuning"
project = "p-abc123"
description = "Fine-tuning BERT for text classification"
 
# Resource Configuration
gpu_type = "RTX A5000"
gpus = 1
cpu_cores = 4
memory_gb = 16
hdd_size_mb = 51200  # 50GB
 
# Runtime Configuration
image = "trainwave/pytorch:2.3.1"
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "python train.py"
 
# Environment Variables
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"
 
# Optional Settings
expires = "4h"
exclude_gitignore = true
exclude_regex = "data/raw/.*"

Configuration Options

Required Fields

OptionTypeDescriptionExample
nameStringJob name (doesn’t need to be unique)"mnist-training"
projectStringProject ID"p-abc123"
setup_commandStringEnvironment setup command"pip install -r requirements.txt"
run_commandStringTraining command"python train.py"
imageStringDocker image"trainwave/pytorch:2.3.1"
hdd_size_mbIntegerDisk space in MB51200 (50GB)

Optional Fields

OptionTypeDescriptionExample
descriptionStringJob description"Training MNIST classifier"
expiresStringAuto-termination time"4h", "1d", "30m"
env_varsObjectEnvironment variablesSee examples below
exclude_gitignoreBooleanRespect .gitignoretrue
exclude_regexStringFile exclusion pattern"data/raw/.*"
memory_gbIntegerRAM in GB16
cpu_coresIntegerCPU core count4
gpusIntegerGPU count1
gpu_typeStringGPU model"RTX A5000"

Common Configurations

Basic PyTorch Training

name = "mnist-basic"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 10240  # 10GB
gpu_type = "RTX 3080"
gpus = 1
setup_command = "pip install -r requirements.txt"
run_command = "python train.py"

Distributed Training

name = "bert-distributed"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 102400  # 100GB
gpu_type = "A100"
gpus = 4
cpu_cores = 16
memory_gb = 64
setup_command = """
pip install -r requirements.txt
wandb login ${WANDB_API_KEY}
"""
run_command = "torchrun --nproc_per_node=4 train.py"
 
[env_vars]
WANDB_API_KEY = "${WANDB_API_KEY}"
MASTER_PORT = "29500"

Large Language Model Training

name = "llm-training"
project = "p-abc123"
image = "trainwave/pytorch:2.3.1"
hdd_size_mb = 512000  # 500GB
gpu_type = "A100"
gpus = 8
cpu_cores = 32
memory_gb = 256
setup_command = """
pip install -r requirements.txt
huggingface-cli login ${HF_TOKEN}
"""
run_command = "python train.py --model gpt3"
 
[env_vars]
HUGGINGFACE_TOKEN = "${HF_TOKEN}"
WANDB_API_KEY = "${WANDB_API_KEY}"
PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Environment Variables

Local Variable Interpolation

Use ${VAR_NAME} to reference environment variables from your local shell at launch time:

[env_vars]
API_KEY = "${MY_API_KEY}"       # Uses MY_API_KEY from your local environment
DATABASE_URL = "${DB_URL}"      # Uses DB_URL from your local environment

Fixed Values

Set fixed values directly:

[env_vars]
BATCH_SIZE = "32"
LEARNING_RATE = "0.001"
DEBUG = "true"

Resource Optimization

GPU Selection

Choose the right GPU based on your needs:

GPU TypeBest ForExample Use Case
RTX 3080Small-medium modelsMNIST, CIFAR, small CNNs
RTX A5000Medium modelsBERT, ResNet, medium-scale training
A100Large modelsGPT, T5, large-scale distributed training

Storage Management

Control which files get uploaded to the job:

# Exclude common development files
exclude_gitignore = true
 
# Exclude specific patterns
exclude_regex = "data/raw/.*"
 
# Allocate enough storage for your data and artifacts
hdd_size_mb = 51200  # 50GB

Best Practices

  1. Resource Allocation

    • Start with the minimum required resources and scale up
    • Use expires to prevent runaway costs
  2. Environment Setup

    • Keep setup_command idempotent
    • Use requirements.txt with pinned versions
  3. Security

    • Use environment variables for secrets — never hardcode them
    • Regularly rotate API keys
  4. Performance

    • Match GPU type to workload
    • Use distributed training for large models

Troubleshooting

Out of Memory

memory_gb = 32
env_vars.PYTORCH_CUDA_ALLOC_CONF = "max_split_size_mb:512"

Slow Training

gpu_type = "A100"
cpu_cores = 8

Upload Timeout

exclude_gitignore = true
exclude_regex = "data/raw/.*"

Support

support@trainwave.ai