> ## Documentation Index
> Fetch the complete documentation index at: https://preview.methodology.scope3.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Inference

> Methodology for calculating the impact of running AI inference

## Overview

Inference is the process of an AI model responding to a request.

A common example is an AI chatbot. The request and the response consist of text, which is encoded in "tokens" (words or sub-words). Other common AI use cases include image generation, image modification, video generation, etc.

Typically a model is hosted on a cluster of servers or cloud instances, with the goal of maximizing throughput (tokens or responses per second) for a given latency (time to respond) constraint.

A consumer of AI inference can either host a model locally, host a model on a cloud service (e.g. AWS, Together.ai) or use a managed service provider endpoint (e.g. OpenAI, Google Vertex AI).

Scope3's methodology for inference takes in a model, input and output parameters, and information about your inference service. It then estimates the energy required and environmental impact for that inference.

## Inference duration prediction

The key element of energy usage and environmental impact is the duration of the inference request - how long the computation takes. Many factors influence the duration.
In particular, the model used (including the model's precision, in bytes), the size of the input and output, and the hardware configuration have a large influence.

There are also many additional factors that determine how a model is executed including [batching strategies](https://www.anyscale.com/blog/continuous-batching-llm-inference#continuous-batching),
[paged attention](https://blog.vllm.ai/2023/06/20/vllm.html), and optimizations like quantization.
The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.

### Text-to-Text inference duration prediction

[Baseten](https://www.baseten.co/blog/llm-transformer-inference-guide/) summarizes LLM inference duration.
(For a deeper dive, see [kipply's post on inference arithmetic](https://kipp.ly/transformer-inference-arithmetic/)).
LLM inference duration can be broken down into two parts:

1. Input token ("prefill") duration - the time required to process all of the input tokens
2. Output token ("decode") duration - the time required to generate all of the output tokens

Prefill duration is compute bound whereas decode duration is memory bound:

$$
t_{in} = \frac{2 * P_m * n_{in}}{A_{f}(b)}
$$

$$
t_{out} = \frac{b * P_m * n_{out}}{A_{bw}}
$$

where:

* $P_m$ is the number of parameters in the model
* $n_{in}$ and $n_{out}$ are the number of input and output tokens
* $b$ is the precision of the model parameters, in bytes (e.g. 2, for FP16)
* $A_{f}(b)$ is the computational capacity of the model, in FLOPs, for the given precision
* $A_{bw}$ is the memory bandwidth of the GPU

The total duration is the sum:

$$
t_{total} = t_{in} + t_{out}
$$

These expressions are theoretical. Therefore we fit a linear regression to benchmarked [inference speed data](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md)
to predict real-world inference duration:

$$
\hat{t_{total}} = \beta_{in} * t_{in} + \beta_{out} * t_{out}
$$

where:

* $\beta_{in}$ and $\beta_{out}$ are the learned coefficients of the linear regression

## Energy calculation

Given the predicted inference duration, we can calculate the energy use per inference:

$$
E = \hat{t_{total}} * \left[ P_{idle} + \left(N_{gpu} * P_{gpu}\right) + \left(N_{cpu} * P_{cpu}\right) \right]
$$

where:

* $P_{idle}$, $P_{gpu}$, and $P_{cpu}$ are the power consumption of the idle server, GPU and CPU respectively
* $N_{gpu}$ and $N_{cpu}$ are the number of GPUs and CPUs on the server

### Cluster energy use

The cluster is defined by:

* [Cluster](/cluster) details
* Cluster location(s) - if hosted in a cloud, which region(s)

## Calculating emissions and water usage for an inference request

Calculate embodied emissions and water usage using predicted inference duration (see [Server cluster](/cluster) for more information about embodied emissions):

```
embodied_emissions_per_second = EmbEm(1) / 3600
embodied_emissions_per_inference = predicted_inference_duration x embodied_emissions_per_second

embodied_h2o_per_second = EmbH2O(1) / 3600
embodied_h2o_per_inference = predicted_inference_duration x embodied_h2o_per_second
```

Calculate the training and fine-tuning emissions using:

* The base model used including the current [amortized training cost per inference](/training#amortization-of-impact-across-use-life)
* The current [amortized fine-tuning cost per inference](/fine_tuning#amortization-of-fine-tuning-impact-across-use-life) if applicable

```
training_emissions_per_inference = training_emissions_per_inference + fine_tuning_emissions_per_inference
training_h20_per_inference = training_h20_per_inference + fine_tuning_h20_per_inference
```

Sum all emissions and water use to produce total impact for an inference request:

```
inference_emissions = usage_energy_per_inference x average_grid_intensity +
                    embodied_emissions_per_inference +
                    training_emissions_per_inference

inference_h20_impact = usage_energy_per_inference x cluster_WUE +
                    embodied_h2o_per_inference +
                    training_h2o_per_inference
```

## Closed Model Parameter Estimation

Many providers do not disclose number of model parameters. For LLMs with unknown parameter counts, we use a linear regression model to produce a best estimate.
It is [well-documented](https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller) that the cost a provider charges for inference is a good predictor of model size.
For each popular provider, we fit a linear regression model to predict the number of parameters based on the cost, on a log-log scale:

$$
\hat{ln(P_m^p)} = \beta_{0}^p + \beta_{1}^p * ln(c_m^p)
$$

where:

* $P_m^p$ is the number of parameters in model $m$ from provider $p$
* $c_m^p$ is the cost of inference for model $m$ from provider $p$
* $\beta_{0}^p$ and $\beta_{1}^p$ are the learned coefficients of the linear regression

Once we have a prediction for each model-provider pair, we average the predictions using the (inverse, squared) [standard error of the prediction](https://people.duke.edu/~rnau/mathreg.htm) as the weight, to get a final prediction per model:

$$
\hat{P_m} = \frac{\sum_p{w_m^p * \hat{P_m^p}}}{\sum_p{w_m^p}}
$$

$$
w_m^p = \frac{1}{{SE_m^p}^2}
$$

### Conservative Estimates

Instead of using the point estimate prediction, we use the 95% confidence interval to produce a conservative estimate of the number of model parameters.
The method uses the weighted standard error of the prediction, which uses the standard errors from each individual prediction, as above.

## Various Additional Research Findings

Inference speed may be limited by the framework or by compute. Maximum performance will be achieved when models are not limited by [framework overhead](https://arxiv.org/pdf/2302.06117).  In an optimal scenario, inference can use the full capacity and power of a GPU, as described by [Towards Pareto Optimal Throughput in Small Language Model Serving](https://arxiv.org/pdf/2404.03353).
