Inference is the process of an AI model responding to a request.A common example is an AI chatbot. The request and the response consist of text, which is encoded in “tokens” (words or sub-words). Other common AI use cases include image generation, image modification, video generation, etc.Typically a model is hosted on a cluster of servers or cloud instances, with the goal of maximizing throughput (tokens or responses per second) for a given latency (time to respond) constraint.A consumer of AI inference can either host a model locally, host a model on a cloud service (e.g. AWS, Together.ai) or use a managed service provider endpoint (e.g. OpenAI, Google Vertex AI).Scope3’s methodology for inference takes in a model, input and output parameters, and information about your inference service. It then estimates the energy required and environmental impact for that inference.
The key element of energy usage and environmental impact is the duration of the inference request - how long the computation takes. Many factors influence the duration.
In particular, the model used (including the model’s precision, in bytes), the size of the input and output, and the hardware configuration have a large influence.There are also many additional factors that determine how a model is executed including batching strategies,
paged attention, and optimizations like quantization.
The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.
Input token (“prefill”) duration - the time required to process all of the input tokens
Output token (“decode”) duration - the time required to generate all of the output tokens
Prefill duration is compute bound whereas decode duration is memory bound:tin=Af(b)2∗Pm∗nintout=Abwb∗Pm∗noutwhere:
Pm is the number of parameters in the model
nin and nout are the number of input and output tokens
b is the precision of the model parameters, in bytes (e.g. 2, for FP16)
Af(b) is the computational capacity of the model, in FLOPs, for the given precision
Abw is the memory bandwidth of the GPU
The total duration is the sum:ttotal=tin+toutThese expressions are theoretical. Therefore we fit a linear regression to benchmarked inference speed data
to predict real-world inference duration:ttotal^=βin∗tin+βout∗toutwhere:
βin and βout are the learned coefficients of the linear regression
Many providers do not disclose number of model parameters. For LLMs with unknown parameter counts, we use a linear regression model to produce a best estimate.
It is well-documented that the cost a provider charges for inference is a good predictor of model size.
For each popular provider, we fit a linear regression model to predict the number of parameters based on the cost, on a log-log scale:ln(Pmp)^=β0p+β1p∗ln(cmp)where:
Pmp is the number of parameters in model m from provider p
cmp is the cost of inference for model m from provider p
β0p and β1p are the learned coefficients of the linear regression
Once we have a prediction for each model-provider pair, we average the predictions using the (inverse, squared) standard error of the prediction as the weight, to get a final prediction per model:Pm^=∑pwmp∑pwmp∗Pmp^wmp=SEmp21
Instead of using the point estimate prediction, we use the 95% confidence interval to produce a conservative estimate of the number of model parameters.
The method uses the weighted standard error of the prediction, which uses the standard errors from each individual prediction, as above.