Configuration: Fine-Tune Your Training Jobs
Trainwave’s flexible configuration system allows you to precisely define the environment and resources required for your machine learning jobs. By specifying parameters in your trainwave.toml
file, you can optimize your training process for efficiency and cost-effectiveness.
How Configuration Works
Trainwave’s intelligent matching system analyzes your configuration parameters to find the best available resources that meet your needs. While you have the freedom to specify as many or as few parameters as you like, providing more specific requirements may narrow down the available options and potentially impact pricing.
Configuration Options
Here’s a comprehensive guide to the available configuration options:
Option | Required | Type | Description |
---|---|---|---|
name | True | String | The name of your job. This does not need to be unique. |
project | True | String | The ID of the project this job belongs to (e.g., p-jsefhsee ). |
setup_command | True | String | The command to be executed first to set up your environment. This could involve installing dependencies or running a setup script. |
run_command | True | String | The command to start your training process (e.g., python train.py ). This can also point to a bash script. |
expires | False | String | Automatically stop the job after a specified duration (e.g., 1h for 1 hour, 10m for 10 minutes, 1d for 1 day). This helps prevent runaway costs. |
env_vars | False | Object | Define environment variables for your job. You can set fixed values (e.g., env_vars.ENV_VAR = "abc" ) or use interpolation to access your local environment variables (e.g., env_vars.ENV_VAR = "${MY_LOCAL_ENV_VAR}" ). |
exclude_gitignore | False | Bool | Set to true to prevent files and folders specified in your .gitignore file from being uploaded to the training environment. This can help reduce upload times and storage costs. |
exclude_regex | False | String | Use a regular expression to exclude specific files or folders from being uploaded (e.g., exclude_regex = "data.*" to exclude any file or folder starting with “data”). |
image | True | String | The Docker image to use for your job. See the Images documentation for available options. |
hdd_size_mb | True | Integer | The required disk space in MB (maximum 500GB). |
memory_mb | False | Integer | The minimum amount of RAM required for your job. |
cpus | False | Integer | The minimum number of CPUs required. |
gpus | False | Integer | The number of GPUs required. |
gpu_type | False | String | The specific type of GPU required. Refer to the GPU Types documentation for available options. |
compliance_soc2 | False | Bool | Set to true if you require SOC 2 compliant data centers for your job. |
Optimizing Your Configuration:
- Start with essential parameters: Begin by defining the core requirements for your job (
name
,project
,setup_command
,run_command
,image
,hdd_size_mb
). - Add constraints as needed: Introduce additional parameters like
gpus
,gpu_type
,memory_mb
, andcpus
to fine-tune your resource allocation. - Use exclusion rules: Utilize
exclude_gitignore
orexclude_regex
to avoid uploading unnecessary files and improve efficiency. - Consider
expires
for cost control: Set an expiration time for your jobs to prevent unexpected expenses.
By carefully configuring your training jobs, you can ensure optimal performance, efficient resource utilization, and predictable costs with Trainwave.