Overview

From Energy and Carbon Considerations of Fine-Tuning BERT: We find that pre-training BERT is equivalent to anywhere from 400 (MNLI) to 45,000 (RTE) fine-tuning runs depending on the dataset size, and that number of training tokens is a reasonable heuristic for estimating fine-tuning energy use. The “true” number of training tokens seen, accounting for dynamic padding of sequences to the maximum length in a batch, is a better predictor than relying on to mean or median number of tokens per example. Further comparison of fine-tuning inference energy intensity across tasks confirms that example sequence length holds a much stronger influence on energy intensity in the fine-tuning phase than in the inference phase, in alignment with expectations from previous work.

We find that, controlling for hardware, energy consumption scales most predictably with wall clock time and number of tokens encountered during training (including the pad tokens added to sequences to match the maximum sequence length in a batch).

Disclosure of fine-tuning costs

To assess the environmental impact of fine-tuning a model, developers should disclose the technical infrastructure used for fine-tuning and the duration of this training process.

Infrastructure data:

  • Fine-tuning cluster details
  • Managed service used (eg AWS Bedrock)
  • Physical location of the datacenter where the fine-tuning occurred

Operational data:

  • Based model
  • Total fine-tuning time
  • GPU and CPU utilization during fine-tuning
  • Total fine-tuning tokens, including padding, if total time not available, for instance if using a managed service
  • Start time

Usage data:

  • Expected use life in days
  • Expected inferences per day

Example disclosure

ComponentDisclosed data
Base modelLlama 2
GPUNvidia A100 80GB
ServerHPE Apollo 6500 Gen10 Plus
Number of GPUs4
Number of servers1
Server locationAWS US West (Oregon)
Total reserved time12 hours
Average CPU utilization12%
Average GPU utilization47%

Normalization of disclosed data

When disclosed data is not present or not complete, we need to use predictive or heuristic data to fill in the gaps.

Missing data pointMechanism to replace
GPU modelUse the most common GPU for the training year (for instance, 2022 is Nvidia A100)
Server modelUse the most common server or instance type for the training year
Cluster sizeAssume 1 server for fine-tuning
LocationUse the US as a relatively high-carbon country
Datacenter PUEUse location average
Datacenter WUEUse location average
Total fine-tuning timePredict from number of tokens and model
Start timeUse the published model date minus the total reserved time
GPU and CPU utilizationPredict from model

Example normalization: AWS Bedrock fine-tuning

When a managed service is used, we need to make some assumptions about the underlying execution.

ComponentDisclosed data
Base modelLlama 2
Managed serviceAWS Bedrock
RegionUS West (Oregon)
Start timeJuly 6, 2024 17:01
Tokens48,123

TODO - model a standard AWS instance for this use case & doco the token->time prediction

Calculation of carbon emissions and water use

Use the same calculations outlined in Training.

Amortization of fine-tuning impact across use life

To amortize the fine-tuning impact, we need to estimate the number of inferences that the model will perform during its use-life. This applies both for fine-tuning a base model or for fine-tuning a previously fine-tuned model (aka continuous fine-tuning), except that in the latter case the use life should be considered the time until the next fine-tuning is performed (eg one day).

EmissionsPerInference(fine-tuning) = Em(fine-tuning) / (inferences per day) / (use life days)

Example

A model is fine-tuned daily using 12.8kgCO2e and 18.3 LH2O. On average, the model performs 1000 inferences a day.

EmPerInf(fine-tuning)  = (12.8 kgCO2e) / (1000 inf/d) / (1 d)
                       = 12.8 gCO2e/inf

H2OPerInf(fine-tuning) = (18.3 LH2O) / (1000 inf/d) / (1 d)
                       = 18.3 mlH2O/inf