How to estimate and reduce the carbon footprint of machine learning models

Two ways to easily estimate the carbon footprint of machine learning models and 17 ideas for how you might reduce it

Published in

Towards Data Science

22 min readDec 1, 2022

Photo by Appolinary Kalashnikova on Unsplash

The environmental impact of machine learning models is increasingly receiving attention, however mostly from academia. Here, the conversation tends to focus on the carbon footprint of language models which are not necessarily representative of the general machine learning field and not enough attention is paid to the environmental impact of the operations phase of machine learning models.

In addition, existing material on the topic of the environmental impact of machine learning puts too little emphasis on how the environmental impact can actually be estimated and reduced. This article is an attempt to address these issues and is written for practitioners and researchers alike who do hands-on machine learning. Although this post was written with machine learning in mind, some of the contents is also applicable to general software engineering.

The article is structured as follows:

First, we’ll take a look at some concrete examples of carbon emissions from machine learning.
Then, I’ll present two tools that can be used to estimate the carbon footprint of a machine learning model.
In section 3, I’ll present 17 ideas for how you might reduce the carbon footprint of your machine learning related work.
Finally, I’ll present some machine learning sustainability considerations that are not addressed directly here, but which are nonetheless important.

Before we begin, I want to emphasise that this post was not written to point fingers or moralise. The aim is simply to present information that you may or may not find relevant to your daily activities.

1. The environmental impact of data science and machine learning

All software — from the apps that run on our phones to the data science pipelines that run in the cloud — consume electricity and as long as not all our electricity is generated by renewable energy sources, electricity consumption will have a carbon footprint. This is why machine learning models can have a carbon footprint.

I will use the term “carbon footprint” to refer to the amount of CO₂e emissions, where “e” is short for “equivalents.” Since other gases such as methane, nitrous oxide or even water vapor also have a warming effect, a standardised measure for describing how much warming a given amount of gas will have is often provided in CO₂-equivalents (CO₂e) for simplification purposes.

To my knowledge, as of this writing, no reports exist that attempt to estimate the total carbon footprint of machine learning as a field. This conclusion has recently been echoed by [1].

However, multiple attempts have been made at estimating the total electricity consumption of the global data center industry. Estimates of global energy consumption from data centers reported in 2015 and 2016 ranged from 3–5 % of global electricity consumption [2][3]. More recently, some claim data centers account for 1 % of global electricity use [4][5]. It is my impression from having reviewed some of the literature on this topic that the 1 % estimate is more accurate than the 5 % estimate.

If we take a look at the energy consumption from machine learning at the organisational level, Google says that 15 % of the company’s total energy consumption went towards machine learning related computing across research, development, and production [6]. NVIDIA has estimated that 80–90% of machine learning workload is inference processing [7]. Similarly, Amazon Web Services have stated that 90% of the machine learning demand in the cloud is for inference [8]. This is much higher than the estimates put forward by an unnamed large cloud compute provider in a recent OECD report [1]. This provider estimates that between 7–10% of enterprise customers’ total spend on compute infrastructure goes towards AI applications, with 3–4.5% used for training machine learning models and 4–4.5% spent on inference.

Let’s now take a look at some concrete examples of the carbon footprint of machine learning models.

Table 1 shows an overview of the estimated energy consumption and carbon footprint of some large language models. The carbon footprints range from 3.2 tons to 552 tons of CO₂e. Now you may notice that the energy consumption of PaLM is significantly larger than that of GPT-3, although GPT-3’s estimated carbon footprint is more than twice that of PaLM. This is likely due to differences in the carbon intensity of the electricity used to train the models. To put the carbon footprint of these large language models into perspective, the average American generates 16.4 tons of CO₂e emissions in a year [9] and the average Dane generates 11 tons of CO₂e emissions [10]. So the carbon footprint of GPT-3 is roughly that of the annual carbon footprint of 50 Danes.

Table 1: Energy consumption of 7 large deep learning models. Adapted from [6] and [11]

Looking at a few examples of the carbon footprint of running inference with language models, Facebook has estimated that the carbon footprint of their Transformer-based Universal Language Model for text translation is dominated by the inference phase, using much higher inference resources (65%) as compared to training (35%) [40]. The average carbon footprint for ML training tasks at Facebook is 1.8 times larger than that of Meena used in modern conversational agents and 0.3 times of GPT-3’s carbon footprint.

BLOOM, the open source language model by BigScience, consumed 914 kWh of electricity and emitted 360 kg over an 18 day period where it handled 230,768 requests, corresponding to a footprint of roughly 1.56 gCO₂e per request [41].

These large language models are extreme cases in terms of their carbon footprint, and very few data scientists will ever be involved in training such large models from scratch. So let’s now consider a case that is likely to be closer to the every day life of most data scientists and ML engineers. I recently trained a Transformer model for time series forecasting for 3 hours on a GPU in Frankfurt. This consumed 0.57 kWh of electricity and had a carbon footprint of 340 grams of CO₂e which corresponds to roughly 0.003 % of the annual carbon footprint of one Dane.

Historically, the amount of compute needed to obtain state of the art results have doubled every 3.4 months [12], so the problem may compound — a worry that is echoed in [13] — although some expect energy consumption to decrease due to algorithmic and hardware improvements [6].

To sum up this section, it is difficult to paint a full picture of the carbon footprint of machine learning as a field due to a lack of data, but specific models, particularly language models, can have large carbon footprints, and some organisations spend a relatively large amount of their total energy usage on machine learning related activities.

8 podcast episodes on the climate impact of machine learning

Here’s a curated list of 8 great podcast episodes about the environmental footprint of machine learning and how to…

towardsdatascience.com

2. How to estimate the carbon footprint of a machine learning model

Before we dive into the specifics of the tools that can estimate the carbon footprint of your machine learning models, it is helpful to familiarise ourselves with a formula for computing carbon footprint. It is strikingly simple:

Carbon footprint = E * C

where E is the number of electricity units consumed during some computational procedure. This can be quantified as kilowatt-hours (kWh). C is the amount of CO₂e emitted from producing one of said unit of electricity. This can be quantified as kg of CO₂e emitted per kilowatt-hour of electricity and is sometimes referred to as the carbon intensity of electricity. The carbon intensity varies between geographic regions, because the energy sources vary between regions. Some regions have a lot of renewable energy, some have less. Given this equation, we can now see that any tool that estimates the carbon footprint of some computational procedure must measure or estimate E and C.

Several tools exist for estimating the carbon footprint of machine learning models. It has been my experience that these tools fall into one of two categories:

Tools that estimate carbon footprint from estimates of E (energy consumption)
Tools that estimate carbon footprint from measurements of E (energy consumption)

In this post we’ll take a closer look at two such tools:

ML CO2 Impact, which relies on estimates of E and thus falls into category 1 above
CodeCarbon, which relies on measurements of E and thus falls into category 2above

Note that other software packages, e.g. carbontracker [14] and experiment-impact-tracker [15] provide similar functionality to CodeCarbon, but I’ve chosen to focus on CodeCarbon as this package seems to be continuously updated and expanded, whereas the most recent commits to carbontracker and experiment-impact-tracker were made long ago.

It must be noted that the tools presented in this post only estimate the carbon footprint of the electricity used for some computational procedure. They do not, for instance, take into account the emissions associated with manufacturing the hardware on which the code was run.

2.1. Estimating machine learning model carbon footprint with ML CO2 Impact

The free web-based tool ML CO2 Impact [16] estimates the carbon footprint of a machine learning model model by estimating the electricity consumption of the training procedure. To obtain a carbon footprint estimate with this tool, all you have to do is input the following parameters:

Hardware type (e.g. A100 PCIe 40/80 GB)
Number of hours the hardware was used
Which cloud provider was used
In which cloud region the compute took place (e.g. “europe-north1”)

The tool then outputs how many kilograms of CO₂e your machine learning model emitted. It is calculated as:

Power consumption * Time * Carbon Produced Based on the Local Power Grid, e.g.:

250W x 100h = 25 kWh x 0.56 kg eq. CO2/kWh = 14 kg CO₂e

The tool also shows what the emission would have been in a cloud region with a lower carbon intensity.

The benefit of a tool like ML CO2 Impact is that it can be used post-hoc to estimate the carbon footprint of your own or other people’s models, and you don’t have to edit your scripts to compute the estimates.

The downside of a tool like ML CO2 Impact is that it relies on estimates of energy consumption, which naturally means that its carbon footprint estimates can be off. In fact, such tools can be off by a ratio of 2.42 as illustrated by Figure 1 below.

2.2. Estimating machine learning model carbon footprint with CodeCarbon

CodeCarbon [18] is a software package that is available for Python amongst other languages and can be installed by running pip install codecarbon from your command prompt. CodeCarbon computes the carbon footprint of your code like this:

At fixed intervals, e.g. 15 seconds, CodeCarbon directly measures the electricity consumption of the GPU, CPU and RAM on which the your code is executed. Note that CodeCarbon works both on your local machine and on cloud machines. The package also monitors the duration of your code execution and uses this information to compute the total electricity consumption of your code. Then, CodeCarbon retrieves information about the carbon intensity of the electricity in the geographic location of your hardware. If you train in the cloud, CodeCarbon automatically retrieves information about the location of your cloud instance. The carbon intensity of the electricity is then multiplied by the amount of electricity consumed by the code to provide an estimate of the total carbon emissions of the electricity consumed by the code.

The tool can be used in your code in several ways. One way is to initialise a tracker object. When the tracker object is stopped, CodeCarbon’s default behavior is to save the results to a .csv file which will contain information about how much electricity in kWh your code consumed and how much CO₂e in kg this electricity emitted. Instead of writing to the file system, the information can be sent to a logger [30]. Suppose you have a function, train_model() , which executes model training, then you can use CodeCarbon like this:

Another way is to use CodeCarbon as a context manager like this:

Finally, CodeCarbon can be used as a function decorator:

Note that if the constructor argument log_level is set to its default, CodeCarbon will print out several lines of text every time it pings your hardware for its energy consumption. This will quickly drown out other information that you may be interested in viewing in your terminal during model training. If you instead set log_level="error" CodeCarbon will only print to the terminal if it encounters an error.

It is also possible to visualise the energy consumption and emissions, and the tool can also recommend cloud regions with lower carbon intensity [19].

2.2.1. Additional information about CodeCarbon methodology

Carbon intensity (C) is a weighted average of the emissions produced by the energy sources (e.g. coal, wind) used in the energy grid in which the computation is performed. This means that the carbon footprint that CodeCarbon reports is not the actual carbon footprint, but an estimate. The carbon intensity of electricity varies throughout the day, so a more accurate approach to computing carbon footprint would be to use real-time electricity carbon intensity. The energy sources that are used in a local energy grid is called the energy mix. The package assumes the same carbon intensity of an energy source regardless of where in the world the energy is produced, i.e. the carbon intensity of coal is considered to be the same in both Japan and Germany for instance. All renewable energy is ascribed a carbon intensity of 0.

Power Consumed (E) is measured as kWh and is obtained by tracking power supply to the hardware at frequent time intervals. The default is every 15 seconds, but can be configured with the measure_power_secs constructor argument.

The 10 most energy efficient programming languages

In a survey of the energy efficiency of 27 programming languages, C tops the list, and Python was the second most…

kaspergroesludvigsen.medium.com

3. How to reduce the carbon footprint of your machine learning models

Before we take a closer look at how you can reduce the carbon footprint of your machine learning models, let’s make sure we’re on the same page terminology-wise.

In the following, I will sometimes distinguish between the carbon footprint from training a machine learning model and the carbon footprint from using the machine learning model. The former will be referred to as “model training” or “model development.” The latter will be referred to as the “operations” phase of the machine learning model’s life cycle or using the model for inference.

I also want to point out that I deliberately do not touch upon carbon offsets. This is not because I am against offsets, but because I want to focus on reducing carbon emissions through elimination. Check out Asim Hussain’s great talk about carbon footprint terminology for more on the distinction between offsets and elimination [20].

3.1. Begin to estimate the footprint of your work

The first recommendation is simple and its not about how you can reduce the footprint of a single model, but how you may reduce the footprint of your work in general: Begin estimating the carbon footprint of your models. If you do so, you will be able to factor in this information during model selection. I was recently in a situation where my best model obtained an MAE that was 13 % lower than the second best model, but at the same time, the best model had a carbon footprint that was roughly 9,000 % larger. Should you trade a decrease in model error for a large increase in carbon footprint? Well, that’s obviously very context dependent, and ultimately it should probably be decided by the business based upon data provided by you, the data scientist or ML engineer.

3.2. Specify a model carbon footprint budget

Page weight budget is a term from web development. A page weight budget specifies how much a website is allowed to “weigh” in kilobytes of files. Specifically, it is the size of files transferred over the internet when a webpage is loaded [21]. It is important that the page weight budget is specified before development begins and the budget should act as a guiding star in the whole process from design to implementation.

In data science, we could cultivate a similar concept, for instance by setting a limit on how much we will allow a machine learning model to emit during its lifetime. A simpler metric would be the carbon footprint per inference.

3.3. Don’t build a sportscar if a skateboard can take you where you want to go

This next recommendation may sound like a no-brainer, but I think it’s worth reminding ourselves every now and then that you don’t always need very complex solutions for your problem.

When you begin a new project, start by computing a reasonable, but cheap baseline. For time series forecasting, such a baseline could be to use the value at t-n (n could be 24 if your data has an hourly resolution and exhibits daily seasonality) as the predicted value for t. In natural language processing, a reasonable baseline may be some heuristics implemented as regex. And in regression, a good baseline might be to use the mean of your target variable, potentially grouped by some other variable.

Once you have a reasonable baseline, you can compare against this to see if more complex solutions are worth the extra work and carbon emissions.

3.3. Test what happens if you go small

If you decide that a neural network architecture is suitable for your problem, don’t blindly select a feed forward layer dimension of 2048 because that’s what the research papers do. I was recently able to outperform a large model on accuracy with a model that very small layers (less than 256 neurons in each layer, sometimes much less). The two models had the same number and types of layers — one just had very small layers. On top of being more accurate, it was also faster during inference.

**3.4. Train models where the energy is cleaner**

If you’re using a cloud provider like Google Cloud, AWS or Azure, you can choose which region to run your computational procedure. Research has shown that emissions can be reduced by up to 30x just by running experiments in regions powered by more renewable energy sources [22]. Table 2 below shows how many kg CO2 eq are emitted by using 100 hours of compute on an A100 GPU on the Google Cloud Platform in various regions. It is clear that the carbon intensity of electricity varies greatly between regions.

**3.5. Train models when the energy is cleaner**

The carbon intensity of electricity can vary from day to day and even from hour to hour as illustrated by Figure 2 and 2 below, which show the average carbon footprint in g/kWh for each hour of the day and each day of the week respectively in eastern Denmark in the period January 1 2022 — October 7 2022. Contrary to popular belief, this data shows that the electricity is cleaner around noon than at night.

You can reduce the carbon footprint of your work by scheduling heavy workloads to those periods where the energy is cleaner. If you’re not in an extreme hurry to train a new model, a simple idea is to start your model training when the carbon intensity of electricity in your cloud region is below a certain threshold, and the put the training on hold when the carbon intensity is above some threshold. With PyTorch, you can easily save and load your model. You can hardcode or manually provide the hours a day when your region has clean energy, or you can obtain the data from a paid service like Electricity Map which provides access to real time data and forecasts on carbon intensity of electricity in various regions.

Fig. 2. Average carbon intensity of electricity in eastern Denmark per hour of the day. Graph by author. Data from https://www.energidataservice.dk/tso-electricity/CO2Emis

Fig. 3. Average carbon intensity of electricity in eastern Denmark per day of the week. Graph by author. Data from https://www.energidataservice.dk/tso-electricity/CO2Emis

3.6. Optimise hyperparameters for energy consumption and model accuracy

When you run your hyperparameter optimisation procedure, you can define a dual-objective optimisation problem in which you optimise for model accuracy and energy consumption. This will reduce the model’s energy consumption, thus its carbon footprint, in the operations phase. This was done by Dercynski et al [23] who specified a combined measure of perplexity (an accuracy metric in NLP) and energy consumption in a masked language modelling task using a Transformer architecture and Bayesian hyperparameter optimisation. To identify the models that could not be optimised further without sacrificing perplexity or energy consumption, they identified the Pareto-optimal models. They found that the most important parameters appeared to be the number of hidden layers, the activation function and the position embedding type.

3.7. Be mindful of your activation function

The selection of an activation function can greatly influence the time your model takes to train. As seen in Fig. 4 below, Dercynski [24] demonstrated that the time it took to train an image classification model on the MNIST dataset to 90 % accuracy varied from a few seconds to more than 500 seconds. Aside from demonstrating that the choice of activation function influences training time, Dercynski also found that

activation function choice appears to have more effect in situations where inference is performed over smaller sets at a time
applications should be analysed and tuned on the target hardware if one is to avoid particularly costly activation functions

3.8. Distill large models

By distilling large models, you can reduce the model’s carbon footprint in production. Model distillation can be thought of as the process transferring knowledge from a larger model to a smaller model. Sometimes this is achieved by training a smaller model to imitate the behavior of a larger, already trained model.

One example of a distilled pre-trained language model is DistilBERT. Compared to its un-distilled version, BERT, DistilBert is 40 % smaller in terms of its number of parameters and it’s 60 % faster in inference while maintaining 97 % of its language understanding as measured GLUE [25]. Model distillation has also been applied successfully to image recognition [26], and I would venture it can also be used in other domains, for instance in time series forecasting with neural networks.

Model distillation can be an effective way to produce a more computationally efficient model without sacrificing accuracy. But if you apply model distillation with the objective to reduce the model’s carbon footprint, beware that the added carbon emissions that result from the distillation procedure do not outweigh the emissions savings you will obtain during the model’s time in production.

3.9. Don’t blindly throw more compute at your problem

I suppose most of us would intuitively think that reducing execution time of some procedure reduces its carbon footprint. It is my understanding that this is the case if you speed up your code by simply writing better code that executes faster. However, speeding up your programs by throwing more compute power at it will only make it greener to a certain extent — at least according to the results of the study from which Fig. 5 originates [27].

The authors measured the processing time and the carbon footprint of running a particle physics simulation on a CPU with a varying number of cores. Their results are displayed in Fig. 5. The green line shows the running time, i.e. how long it took to run the simulation. The orange line shows the carbon footprint from running the simulation. It is worth noting that when they doubled the number of CPU cores used in the simulation from 30 to 60, execution time barely decreased, and the carbon footprint increased from around 300 gCO₂e to more than 450 gCO₂e.

From this the authors conclude that generally, if the reduction in running time is lower than the relative increase in the number of cores, distributing the computations will worsen the carbon footprint. For any parallelised computation, there is likely to be a specific optimal number of cores for minimal GHG emissions.

3.10. Stop model training early if your model underperforms or performance plateaus

Another thing you might do to reduce your carbon footprint is to make sure that you’re not wasting resources on training a model that either does not converge or already has converged and likely won’t improve any further. A tool for achieveing this is the early stopping module from pytorchtools [28] by Bjarte Sunde. It is very easy to use. You create an EarlyStopping object and after each epoch, you pass the models average loss to the object along with the model. If the loss has improved, the EarlyStopping object makes a checkpoint of the model which means it saves the model paramters at that given epoch. If the loss has not improved for a number of epochs defined through the constructor argument ”patience”, the training loop is exited.

3.11. Don’t use grid search for hyperparameter tuning

Hyperparameters are the “annoying” [29, p. 7], but very important knobs that control the behavior of a training algorithm. They have a significant effect on the performance of your machine learning model and must thus be carefully tuned. One way of doing so is by specifying a set of hyperparameter to try out and train a model on possible combination in the set. This is called grid search and is very computationally expensive for large grids because a model is trained for each possible combination. Luckily, more efficient approaches exist such as random search and Bayesian optimisation. On top of being more computationally efficient, they have also been demonstrated to lead to better model performance [30]. Hyperopt [31], Bayesian Optimization [32] and Ray Tune [33] are some Python packages that lets you do efficient hyperparameter tuning.

3.12. Train in a specialised data center

Cloud data centers can be 1.4–2x more energy efficient than typical datacenters, and ML specific hardware can be 2–5x more efficient than off-the-shelf systems. For instance, Google’s custom TPU v2 processor used 1.3x and 1.2 less energy for training some large language models compared to Nvidia’s Tesla P100 processor. [17]. Using highly efficient data centers will reduce the carbon footprint from both the development and operations phase.

3.13. Debug on small scale examples

An easy way to reduce the carbon footprint of your machine learning model development is to reduce the amount of times your model training procedure runs. By experimenting with small subsets of your data before starting the whole training procedure, you can catch bugs more quickly and reduce your energy consumption during model development.

3.14. Use pre-trained language models whenever possible

This is a no-brainer. Don’t train large language models from scratch unless you absolutely have to. Instead, fine tune the ones that are already available, for instance through Huggingface [34]. As a Scandinavian and an open-source fan, I want to make a shout out to my board colleague Dan Saatrup’s ScandEval, which benchmarks a large number of open-source Scandinavian langauge models [35].

3.15. Make your service available only at specific times

A recent article suggests that the energy consumption of a cloud instance on which a large langauge model is deployed is relatively high even when the model is not actively processing requests [41]. Therefore, consider whether your service must be available at 24/7/365 or if you can shut it down sometimes. If, for instance, you have developed an ML-based service for internal use in an organisation, you may in some cases reasonably expect that no one will use the service at night. The compute resource on which the service is running can therefore be shut down during some hours of the night and be spun up again in the morning.

3.16. Use sparsely activated neural networks

Neural networks can require a lot of compute to train, but sparsely activated networks can yield the same performance at a lower energy consumption. A sparse network can be defined as one in which only a percentage of the possible connections exist [36].

Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters [17, p.1]

In my opinion, sparse neural networks is a complex topic and there are other fruits on the machine-learning-model-decarbonisation-tree that hang lower, but here are a few resources to get started if you want to explore this further:

Improving Sparse Training with RigL

Modern deep neural network architectures are often highly redundant [ 1, 2, 3], making it possible to remove a…

ai.googleblog.com

Accelerating Neural Networks on Mobile and Web with Sparse Inference

On-device inference of neural networks enables a variety of real-time applications, like pose estimation and background…

ai.googleblog.com

3.17. Move data to a low-energy storage solution

Granted, this is not a recommendation that is specific to machine learning models, but it is an idea for how you might reduce the carbon footprint of your entire application. Some data storage solutions are more energy efficient than others. AWS Glacier is an example of an efficient data archiving solution [37, ~02:50]. It takes a while to retrieve data from a solution like AWS Glacier, but it is more energy efficient than faster storage solutions.

4. Additional sustainability considerations for machine learning practitioners and researchers

In this section I will briefly point out some additional sustainability considerations that are not addressed directly in this article.

If we as machine learning practitioners and researchers — by increasing our awareness of the carbon footprint of our software — incentivise data center owners and operators to continuously replace their hardware with more efficient ones, the carbon emission savings from running on this efficient hardware might be outweighed by the embodied emissions of the hardware.

In addition, carbon footprint is not the only factor related to the environmental sustainability of machine learning — or the wider IT industry for that matter — that deserves attention.

Other considerations include biodiversity assessments and the impacts of machine learning on other planetary boundaries (e.g. land system change and freshwater use), direct natural resource impacts from manufacturing, transport and end-of-life impacts, and indirect impacts from machine learning applications [1].

We also must not forget that machine learning models run on hardware that are the result of a complicated value chain. Environmental impacts along the value chain include soil contamination, deforestation, erosion, biodiversity degradation, toxic waste disposal, groundwater pollution, water use, radioactive waste and air pollution [38].

The above are all environment-related impacts of machine learning, but other impacts exist as well such as social impacts [39].

Conclusion

This article demonstrates that while it is difficult to determine the environmental impact of machine learning as a field, it is easy for practitioners to estimate the carbon footprint of their machine learning models with either CodeCarbon or ML CO2 Impact. The article also shows 17 concrete ideas for how you might reduce the carbon footprint of machine learning models. Some of these ideas are very low hanging fruits that are easily implemented, while others require more expertise.

That’s it! I hope you enjoyed this post 🤞

Please leave a comment letting me know what you think 🙌

Follow for more posts related to sustainable data science. I also write about time series forecasting like here or here.

Also, make sure to check out the Danish Data Science Community’s Sustainable Data Science guide for more resources on sustainable data science and the environmental impact of machine learning.

And feel free to connect with me on LinkedIn.