Carbon Footprint of LLM Fine Tuning — A Case Study

I got surprising results when I measured the carbon emissions from instruction fine tuning an LLM

Kasper Groes Albin Ludvigsen
Towards Data Science

--

Photo by Ingmar H on Unsplash

I recently LoRA fine-tuned a Danish LLM called Munin-7b-alpha on an instruction fine tuning dataset called SkoleGPT-instruct. During the fine tuning procedure, I measured the energy consumption and computed the carbon footprint. In this article, I present the surprising results. You can find the model here.

Introduction

Munin-7b-alpha is a pre-trained model (or a so-called foundation model), which has been trained solely to generate text. To make them suitable for a chat setup, pre-trained models need to be good at following instructions, which requires a subsequent training step called instruction fine tuning.

As opposed to pre-training, which requires massive amounts of unlabeled text data on which the model trains in a self-supervised fashion, instruction fine tuning requires a relatively modest amount of data, which in turn must be carefully curated and annotated.

It is a fune-tuning procedure that I report on in this article.

Methodology

The Munin-7b-alpha has 7 billion parameters and the instruction dataset that I used consists of 21,300 samples. That is, 21,300 examples of a prompt and a good answer.

Using an slightly adapted version of this fantastic model fine tuning notebook, I trained a LoRA for 1 epoch, i.e. I showed the model each sample once.

LoRA – low rank adaptation – is an efficient fine tuning technique for adapting LLMs to specific tasks. Hugging Face provides a succinct description of the technique:

“Low-Rank Adaptation (LoRA) is a PEFT [parameter efficient fine tuning] method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned.”

The model trained on a single Nvidia RTX A4000 GPU, which is a consumer grade GPU with 16 GB memory – just enough memory for LoRA fine tuning of this model.

I measured energy consumption with the Python package CodeCarbon. CodeCarbon is an extremely light weight and easy-to-use package that let’s you measure the energy consumption of a Python script, function or method with just two lines of code. Read more about how to use it here:

Aside from energy consumption, CodeCarbon also estimates the carbon footprint of the energy your computing procedure consumes, but I found the numbers to appear inaccurate. This is likely because CodeCarbon uses a hardcoded average carbon intensity (CO2e per produced KWh) of your geographic region and not an near real time carbon intensity. So I went to a website called Energi Data Service, which lets you download fine grained electricity emissions data from the Danish grid. By multiplying the energy consumption measurements obtained with CodeCarbon by the carbon intensity of electricity in the grid during the hours the model trained, I obtained the carbon footprint of the training.

Results

The fine tuning process took just shy of 4 hours and consumed a total of 0.694 KWh – the combined GPU, CPU and RAM consumption as per estimates produced with the Python package CodeCarbon.

During the hours the model trained, the average C02e emissions per produced KWh was 82.5 g as per Energi Data Service (license: “The Licensor grants you a worldwide, free, non-exclusive and otherwise unrestricted licence to use the Data” [1]).

Thus, the fine tuning emitted a minuscule 57 grams of CO2e (0.694 KWh * 82.5 g).

For comparison, the average Dane emits 11 TONS CO2e per year.

Generating a single image with generative AI has been found in a research study to consume 2.9 Wh on average [2]. So for the amount of energy it took to instruction fine-tune the LLM, you can generate a mere 239 images.

If you’re wondering if such a short and efficient fine-tuning procedure yielded a better model, the answer is a clear “yes”:

According to the ScandEval leader board on natural language generation, the pre-trained model scores an average of 43.44 on Danish tasks and the fine tuned model scores an average of 47.55. A gain of 9.45 percent. As of this writing, that’s the difference between a 5th place and a 7th place on the leader board.

Discussion

It’s surprising to me that it did not require more compute, energy, and emissions to perform the fine tuning.

I expect my findings to scale linearly with the amount of samples if holding other variables constant (e.g. using a similar GPU, training method etc.). I.e. if you fine tune on twice the amount of samples, or for double the number of epochs, I expect the energy consumption to double.

The energy consumption will likely be significantly higher for a 70 billion parameter model, thus leading to higher emissions, but emissions would probably still very modest in the grand scheme of things.

Further, the energy consumption would likely be higher if I hadn’t used LoRA.

Conclusion

Using the instruction fine-tuning technique LoRA is indeed efficient—both in terms of how long it takes, how much compute (eg GPU RAM) you need, and how much carbon it emits.

Instruction fine tuning a 7B LLM with LoRA on 21,300 samples for one epoch took four hours and emitted 57 gram CO2e—a tiny amount.

That’s it! I hope you enjoyed the story. Let me know what you think!

Get the benefits of Medium and support my writing by signing up for a Medium membership HERE.

Follow me for more on AI and sustainability and subscribe to get my stories via email when I publish.

I also sometimes write about time series forecasting.

And feel free to connect on LinkedIn.

References

[1] https://www.energidataservice.dk/Conditions_for_use_of_Danish_public_sector_data-License_for_use_of_data_in_ED.pdf

[2] “Power Hungry Processing: Watts Driving the Cost of AI Deployment?” by Luccioni et al

--

--

I write about LLMs, time series forecasting, sustainable data science and green software engineering