Overview
Inference is the process of an AI model responding to a request. A common example is an AI chatbot. The request and the response consist of text, which is encoded in “tokens” (words or sub-words). Other common AI use cases include image generation, image modification, video generation, etc. Typically a model is hosted on a cluster of servers or cloud instances, with the goal of maximizing throughput (tokens or responses per second) for a given latency (time to respond) constraint. A consumer of AI inference can either host a model locally, host a model on a cloud service (e.g. AWS, Together.ai) or use a managed service provider endpoint (e.g. OpenAI, Google Vertex AI). Scope3’s methodology for inference takes in a model, input and output parameters, and information about your inference service. It then estimates the energy required and environmental impact for that inference.Inference duration prediction
The key element of energy usage and environmental impact is the duration of the inference request - how long the computation takes. Many factors influence the duration. In particular, the model used (including the model’s precision, in bytes), the size of the input and output, and the hardware configuration have a large influence. There are also many additional factors that determine how a model is executed including batching strategies, paged attention, and optimizations like quantization. The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.Text-to-Text inference duration prediction
Baseten summarizes LLM inference duration. (For a deeper dive, see kipply’s post on inference arithmetic). LLM inference duration can be broken down into two parts:- Input token (“prefill”) duration - the time required to process all of the input tokens
- Output token (“decode”) duration - the time required to generate all of the output tokens
- is the number of parameters in the model
- and are the number of input and output tokens
- is the precision of the model parameters, in bytes (e.g. 2, for FP16)
- is the computational capacity of the model, in FLOPs, for the given precision
- is the memory bandwidth of the GPU
- and are the learned coefficients of the linear regression
Energy calculation
Given the predicted inference duration, we can calculate the energy use per inference: where:- , , and are the power consumption of the idle server, GPU and CPU respectively
- and are the number of GPUs and CPUs on the server
Cluster energy use
The cluster is defined by:- Cluster details
- Cluster location(s) - if hosted in a cloud, which region(s)
Calculating emissions and water usage for an inference request
Calculate embodied emissions and water usage using predicted inference duration (see Server cluster for more information about embodied emissions):- The base model used including the current amortized training cost per inference
- The current amortized fine-tuning cost per inference if applicable
Closed Model Parameter Estimation
Many providers do not disclose number of model parameters. For LLMs with unknown parameter counts, we use a linear regression model to produce a best estimate. It is well-documented that the cost a provider charges for inference is a good predictor of model size. For each popular provider, we fit a linear regression model to predict the number of parameters based on the cost, on a log-log scale: where:- is the number of parameters in model from provider
- is the cost of inference for model from provider
- and are the learned coefficients of the linear regression