Overview

We need to model the environmental impact of the computation for the trained model. The training costs of genAI models are often disclosed in some form. These costs must be normalized and then amortized across the expected use-life of the model.

The key components of the training cost of the model include:

  • The operational energy used to train the final model
  • The operational energy used to train intermediate and preliminary models
  • The embodied emissions from the cluster used to train all models, including hardware reserved but not actively deployed
  • (Allocated emissions from generation of the training and test data)
  • (Allocated operational and embodied emissions of development and testing infrastructure)

Carbon Emissions and Large Neural Network Training includes a table that shows the impact of changing the model, the datacenter, the gpu, and the energy grid in training a model:

Disclosure of training costs

To fully assess environmental impact, model developers should disclose the technical infrastructure used for training and how this infrastructure was engaged during the training process.

Infrastructure data:

  • Training cluster details
  • Training cluster location(s) - if hosted in a cloud, which region(s)

Operational data:

  • Total reserved/owned wall clock time for intermediate and final model training
  • CPU utilization % for model training
  • GPU hours for intermediate model training
  • GPU hours for final model training

Example disclosure

As an example of a relatively complete disclosure, see Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2022. Estimating the carbon footprint of Bloom, a 176B parameter language model.: “We estimate that BLOOM’s final training emitted approximately 24.7 tonnes of CO2eq if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes ranging from equipment manufacturing to energy-based operational consumption.”

Infrastructure data

ComponentDisclosed data
GPUNvidia A100 80GB
ServerHPE Apollo 6500 Gen10 Plus
Number of GPUs384
Number of servers48
Training locationFrance

Training data

ComponentDisclosed data
Total reserved time118 days
Reservation start timeJanuary 2022 (?)
GPU hours for final model1,082,990

Normalization of disclosed data

When disclosed data is not present or not complete, we need to use predictive or heuristic data to fill in the gaps.

Missing data pointMechanism to replace
GPU modelUse the most common GPU for the training year (for instance, 2022 is Nvidia A100)
Server modelUse the most common server or instance type for the training year
GPUs usedUse the average cluster size for similar models
Servers usedDivide GPUs used by average GPUs per server
LocationUse the US as a relatively high-carbon country
Datacenter PUEUse location average
Datacenter WUEUse location average
Total reserved timeUse the average ratio of reserved time to GPU hours
Reservation start timeUse the published model date minus the total reserved time
GPU hours for final modelPredict using parameters and architecture per OpenCarbonEval
GPU hours for intermediate modelsPredict based on ratio of final to intermediate for other disclosed models

Example normalization: BLOOM 176B

The BLOOM paper includes most of the required parameters for carbon emissions. However, it does not include data for water consumption. We use the fallbacks for the missing data points in the calculations below.

Some assumptions:

  • The 35.8 tons of CO2 for the intermediate models implies proportional reservation and usage as the final model (1.45x)

Normalized training data

ComponentDisclosed data
Total reserved time289 days (118 x 1.45)
Reservation start timeAugust 2021 (finish date of June 2022 minus 289 days)
GPU hours for final model2,653,326 (1,082,990 x 1.45)

Calculation of carbon emissions

To calculate CO2 emissions, we use the Software Carbon Intensity formula. We need a few data points:

The training emissions will be:

Total emissions = (embodied emissions) + (usage emissions)

Embodied emissions = (cluster embodied emissions per hour) x (training time)

Usage emissions = (usage energy per GPU-hour) x (total GPU hours) x (average grid intensity during training)

Example carbon calculation

For the final BLOOM model described above:

ComponentValue
Server embodied emissions2500 kgCO2e for similar model
GPU embodied emissions318 kgCO2e
Usage energy per GPU428 W
Datacenter PUE1.1 (Google average)
Grid intensity57 kgCO2e / kWh
Server use life4 years given rapid pace of change in GPU market
Projected utilization95% given intense demand for GPUs

Modeling the cluster produces:

E(gpu-h) = (.428 kW) x (1.1)
         = .471 kW

EmbEm(h) = ((384 GPUs) x (318 kgCO2e/GPU) +
             (48 servers) x (2500 kgCO2e/server))
            / (4 years)
            / (8760 hours/year)
            / (95% utilization)
         = 7.27 kgCO2e/hour

Training emissions based on the normalized data:

EmbEm(training) = (7.27 kgCO2e/h) x (289 d) x (24 h/d)
                = 50,425 kgCO2e

OpEm(training)  = (.471 kW) x (2653326 h) x (57 kgCO2/kWh)
                = 71,234 kgCO2e

Em(training)    = 121,659 mtCO2e

Note that this calculation produces a higher estimate for embodied emissions on the final model (20.6 mtCO2e) than the 11.2 mtCO2e in the BLOOM paper referenced above for three reasons. First, the embodied emissions for the A100 are higher based on a more detailed paper. Second, we use a shorter use life as hardware efficiency is increasing extremely quickly in the AI space and these servers will be obsolete more quickly than general-purpose servers. Third, we use a higher utilization number based on increased demand for GPUs.

Calculation of water impact

The water impact of training includes:

  • The water consumed to cool the servers during the training period (using the “water utilization efficiency” or WUE of the datacenter)
  • The water consumed to produce the electricity used by the servers
  • The water consumed to produce the chips and servers

The water impact is calculated by:

Datacenter water consumption = (Usage energy per GPU) x (total GPU hours) x (datacenter WUE)
Electricity water consumption = (Usage energy per GPU) x (total GPU hours) x (electricity WUE)
Manufacturing water consumption = (Cluster embodied H2O per hour) x (total training hours)

TODO - update above to use cluster metrics and include PUE for the scope 2 number

Example of water impact calculation for BLOOM 176B

ComponentValue
Datacenter WUE1.8 L/kWh (US average)
Electricity WUE3.67 L/kWh - note that 2022 data nuclear data from FR indicates lower numbers that WRI report
Manufacturing WUE412 L/GPU

Modeling the cluster produces:

EmbH2O(h) = ((384 GPUs) x (412 L/GPU))
            / (4 years)
            / (8760 hours/year)
            / (95% utilization)
          = 4.75 L/h

This produces:

Datacenter H2O = (0.428 kW) x (2653326 hours) x 1.8 L/kWh = 2,044 kL
Electricity H2O = (0.428 kW) x (2653326 hours) x 3.67 L/kWh = 4,168 kL
Manufacturing H2O = (2832 h) x (4.75 L/h) = 13.5 kL

Amortization of impact across use life

A general-purpose model is likely to be used heavily for a period of time then made obsolete by newer models that are more effective and/or more efficient. Specialized models may have longer use lives. Open source models enable fine tuning that will create stickiness for ongoing use.

Each model should have a projected use life and track the actual and projected number of inferences for each month during that use life. As actual inference numbers are calculated each month, the projections for the remaining use life should be updated. Since actual inference numbers are sensitive, the model developer could publish the percent of amortized training cost remaining.

Amortization schedule tables

Initial amortization schedule

PI means the total projected inferences N means the total use life in months TC means the total training cost

Data pointMonth 1Month 2Month 3Month N
Remaining use lifeNN - 1N - 20
Training cost remaining (TCR)TCTC - TC / NTC - 2 x TC / N0
Projected inferences remaining (PCR)PIPI x (N - 1)PI X (N - 2)0
Training cost per inference (TPI)TCR1 / PCR1TCR2 / PCR2TCR3 / PCR30
Training cost “billed” (TCB)TC/N x TPITCB1 + TC/N x TPITC

Amortization schedule after month 1

Data pointMonth 1Month 2Month 3Month N
Remaining use lifeNN - 1N - 20
Training cost remaining (TCR)TCTC - TCB1TC - TCB20
Projected inferences remaining (PCR)PIAI1 x (N - 1)AI1 x (N - 2)0
Training cost per inference (TPI)TCR1 / PCR1TCR2 / PCR2TCR3 / PCR3TC / PI
Actual inferencesAI1
Training cost “billed” (TCB)AI1 x TPITCB1 + AI2 x TPITC

Open Questions

  • How to estimate projected inferences and use life for a new model (vs an upgrade)
  • How to estimate actual inferences for a closed model that doesn’t disclose usage
  • How to estimate actual inferences for an open source model like LLAMA where the use is not happening in a centralized fashion

Example of amortization

Traffic to ChatGPT was relatively flat from April 2023 to April 2024, averaging around 50M visits a day. Assuming 5 queries per visit and 925 inferences per query, this would represent 7T inferences per month.

With the impending release of GPT-4o, a reasonable projection would be that traffic would continue at the same rate, and that the model lifecycle would be around a year given the 14-month gap between GPT-4 and GPT-4o.

The initial amortization schedule would look like:

Data pointMonth 1Month 2Month 3Month 14
Remaining use life1413120
Training cost remaining (TCR)46 mt43 mt39 mt0
Projected inferences remaining (PCR)98T91840
Training cost per inference (TPI).46g/Mq.46g/Mq.46g/Mq0
Training cost “billed” (TCB)3 mt3 mt3 mt0

What actually happened was that thanks to the new model, traffic to chatgpt.com increased by 55% in June 2024 after the release of GPT-4o.

Data pointMonth 1Month 2Month 3Month 14
Remaining use life1413120
Training cost remaining (TCR)46 mt41 mt38 mt0
Projected inferences remaining (PCR)98T143T132T0
Training cost per inference (TPI).46g/Mq.28g/Mq.28g/Mq0
Actual inferences11T
Training cost “billed” (TCB)5 mt

The spike in traffic means that the model was effectively overbilled in month 1, making the projected training cost per inference much lower thanks to higher projected volume and the lower remaining training cost.