Generative AI Series

Fine Tuning LLM: Parameter Efficient Fine Tuning (PEFT) — LoRA & QLoRA — Part 2

Parameter Efficient Fine Tuning — LoRA, QLoRA — Hands-On

A B Vijay Kumar
10 min readSep 1, 2023

--

In this blog, we will implement LoRA the idea behind Parameter Efficient Fine Tuning (PEFT), and explore LoRA and QLoRA, Two of the most important PEFT methods. We will also be exploring “Weights and Biases” for capturing the training metrics. We will be fine-tuning a small Salesforce codegen 350m parameter model to improve efficacy to generate Python code.

In Part 1, we discussed how LoRA introduces modularity and reduces training time by allowing us to enhance the base model using an adapter module with significantly lower dimensions. QLoRA takes this approach a step further by further reducing the dimensions of the base model. This is achieved through quantization, which involves converting the floating-point 32 format to smaller data types like 8-bit or 4-bit.

In this blog post, we will take the Salesforce codegen 350m model and fine-tune it for generating complex Python code. To improve its performance, we will utilize the Alpaca instruction set for fine-tuning and generating Python code more efficiently.

Let’s begin by setting up the “Hugging Face” account and the “Weights and Biases” account. We will use Weights and Biases to capture training metrics.

First, create a Hugging Face account and generate an access token with Write permissions. This token will allow us to save our trained model on the Hugging Face Hub. We will use this token for logging in to Hugging Face to pull the base model and push the trained model.

We will be using Weights and Biases to capture the metrics of our training and experiments. To create the account go to https://wandb.ai/abvijay/projects. Add a new project (I added python-fine-tuning (https://wandb.ai/abvijay/python-fine-tuning)). You will find the API key in the base dashboard workspace, or you can find it under “Quick help” in the right top corner.

Now let's start coding.

Installing Dependencies

The following code shows the various dependencies and imports that we will need

!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops
!pip install -q wandb

from datasets import load_dataset
from random import randrange

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, AutoPeftModelForCausalLM

from trl import SFTTrainer

from huggingface_hub import login

import wandb

Let us understand why we need these various dependencies

  • trl: This Python package “Transformer Reinforcement Learning” is used for fine-tuning the transformer model, using reinforcement learning. We will use our instruction dataset to perform this reinforcement learning and fine-tune the model. We will be using SFTrainer object to perform the fine-tuning.
  • 🤗 transformers: This package provides all the APIs for downloading and working with various pre-trained models that are in the huggingface model hub. In our example, we will be downloading Salesforce/codegen-350M-mono. We will also be using the bits and bytes library from transformers, for quantization and AutoTokenizers for creating a tokenizer for the pre-trained model.
  • 🤗 accelerate: This is another very powerful huggingface package, that hides the complexity of the developer trying to write/manage code needed to use multi-GPUs/TPU/fp16.
  • 🤗 peft: This package provides all the APIs we will need to perform the LoRA technique.
  • 🤗 datasets: This huggingface package provides access to the various datasets in the huggingface hub.
  • wandb: This library provides access to the Weights and Biases library to capture various metrics, during the fine-tuning process.

In the following code, we are setting the values of the model, dataset name, and the device map

model_name = "Salesforce/codegen-350M-mono"
dataset_name = "iamtarun/python_code_instructions_18k_alpaca"
device_map = {"": 0}

LoRA Configuration

We can define the configuration for LoRA using LoraConfig. The hugginface PEFT library supports various other PEFT methods such as prefix tuning, P-tuning, and Prompt Tuning. etc. Since we are using the LoRA method, we are using the LoraConfig class. The code below shows the configuration, we will use

peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)

The LoraConfig has the following attributes.

  • lora_alpha: scaling factor for the weight matrices. alpha is a scaling factor that adjusts the magnitude of the combined result (base model output + low-rank adaptation). We have set it to 16. You can find more details of this in the LoRA paper here.
  • lora_dropout: dropout probability of the LoRA layers. This parameter is used to avoid overfitting. This technique basically drop-outs some of the neurons during both forward and backward propagation, this will help in removing dependency on a single unit of neurons. We are setting this to 0.1 (which is 10%), which means each neuron has a dropout chance of 10%.
  • r: This is the dimension of the low-rank matrix, Refer to Part 1 of this blog for more details. In this case, we are setting this to 64 (which effectively means we will have 512x64 and 64x512 parameters in our LoRA adapter.
  • bias: We will not be training the bias in this example, so we are setting that to “none”. If we have to train the biases, we can set this to “all”, or if we want to train only the LORA biases then we can use “lora_only
  • task_type: Since we are using the Causal language model, the task type we set to CAUSAL_LM.

This configuration is used to create a LoRA adapter on top of the base model

We need to now define the QLoRA configuration. We define it using BitsAndBytesConfig. The following code shows the QLoRA configurations

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16"
)

We will be setting the following attributes. (Refer to Part 1 of the blog on more details on quantization before you try to understand these configuration parameters)

  • load_in_4bit: we are loading the base model with a 4-bit quantization, so we are setting this value to True.
  • bnb_4bit_use_double_quant: We also want double quantization so that even the quantization constant is quantized. So we are setting this to True.
  • bnb_4bit_quant_type: We are setting this to nf4.
  • bnb_4bit_compute_dtype: and the compute datatype we are setting to float16
from huggingface_hub import notebook_login
# Log in to HF Hub
notebook_login()

wandb.login()
%env WANDB_PROJECT=python-fine-tuning

Let's now look at the instruction set that we will be using to fine-tune the model. We will be using an 18k alpaca Python instruction set on huggingface.

iamtarun/python_code_instructions_18k_alpaca

The following screenshot shows the structure of the instruction set

We will have to define a method to process these instruction sets, and this method will be called by the supervised fine-tuning trainer, to convert the instruction dataset into a required instruction and prompt. The following method defines how to convert each row in this data set into an instruction/prompt, and output.

def prompt_instruction_format(sample):
return f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:

### Task:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}
"""

Let's now load the dataset by calling load_dataset()

dataset = load_dataset(dataset_name, split=split)

We can load the model using AutoModelForCausalLM, and we will be passing the QLoRA configuration, so that the model is loaded, as quantized

model = AutoModelForCausalLM.from_pretrained(model_name, 
quantization_config=bnb_config,
use_cache = False,
device_map=device_map)
model.config.pretraining_tp = 1

Let's define the tokenizer, for the model, using huggingface AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Let's now define the various training arguments. These arguments will be used by the trainer to fine-tune the model. Let's go through each of these arguments in detail


trainingArgs = TrainingArguments(
output_dir=finetunes_model_name,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=5,
save_strategy="epoch",
learning_rate=2e-4,
weight_decay=0.001,
max_grad_norm=0.3,
warmup_ratio=0.03,
group_by_length=False,
lr_scheduler_type="cosine",
disable_tqdm=True,
report_to="wandb",
seed=42
)
  • output_dir: Output directory where the model predictions and checkpoints will be stored
  • num_train_epochs=3: Number of training epochs
  • per_device_train_batch_size=4: Batch size per GPU for training
  • gradient_accumulation_steps=2: Number of update steps to accumulate the gradients for
  • gradient_checkpointing=True: Enable gradient checkpointing. Gradient checkpointing is a technique used to reduce memory consumption during the training of deep neural networks, especially in situations where memory usage is a limiting factor. Gradient checkpointing selectively re-computes intermediate activations during the backward pass instead of storing them all, thus performing some extra computation to reduce memory usage.
  • optim=”paged_adamw_32bit”: Optimizer to use, We will be using paged_adamw_32bit
  • logging_steps=5: Log on to the console on the progress every 5 steps.
  • save_strategy=”epoch”: save after every epoch
  • learning_rate=2e-4: Learning rate
  • weight_decay=0.001: Weight decay is a regularization technique used while training the models, to prevent overfitting by adding a penalty term to the loss function. Weight decay works by adding a term to the loss function that penalizes large values of the model’s weights.
  • max_grad_norm=0.3: This parameter sets the maximum gradient norm for gradient clipping.
  • warmup_ratio=0.03: The warm-up ratio is a value that determines what fraction of the total training steps or epochs will be used for the warm-up phase. In this case, we are setting it to 3%. Warm-up refers to a specific learning rate scheduling strategy that gradually increases the learning rate from its initial value to its full value over a certain number of training steps or epochs.
  • lr_scheduler_type=”cosine”: Learning rate schedulers are used to adjust the learning rate dynamically during training to help improve convergence and model performance. We will be using the cosine type for the learning rate scheduler.
  • report_to=”wandb”: We want to report our metrics to Weights and Bias
  • seed=42: This is the random seed that is set during the beginning of the training.

Let's now create a trainer object. We will be passing the LoRA configurations here so that the training will be done on the low-rank adapter, not the base model.

# Create the trainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=2048,
tokenizer=tokenizer,
packing=True,
formatting_func=prompt_instruction_format,
args=trainingArgs,
)

While initializing the SFTTrainer class, we are passing the base model, to be trained, the training data set the PEFT configurations, tokenized, and the method that needs to be used to convert the training data into a “prompt”, as we have specified the packaging as “True”. We are also passing the training arguments that we initialized in the previous step.

We can start the training by calling train() method.


trainer.train()

The above screenshot shows the output of the training, you can also see the weights and bias link and the various metrics that are captured on the loss and training.

The following screenshots show the dashboards on the weights and biases site, which show the various training metrics and the performance metrics — GPU usage, Memory usage, etc.

Once the training is done, we will saving the model and stop collecting the metrics.

#stop reporting to wandb
wandb.finish()
# save model
trainer.save_model()

Now that we have a trained model, we need to merge the trained model with the base model. We can do that by calling merge_and_unload(). This method will merge the LORA layers with the base model. The following code shows the merging and then uploading the merged model to the huggingface hub.

# Merge LoRA with the base model and save the merged model
merged = trained_model.merge_and_unload()
merged.save_pretrained("merged",safe_serialization=True)
tokenizer.save_pretrained("merged")

#push merged model to the hub
merged.push_to_hub("codegen-350M-mono-python-18k-alpaca")
tokenizer.push_to_hub("codegen-350M-mono-python-18k-alpaca")

The following screenshot shows the output of merging and uploading the merged model to the huggingface hub

Once it is uploaded, you should be able to find it, in your huggingface account. Below is the screenshot of my model, on the huggingface hub.

Let's now run some inferences on these trained models, and test if it really improved the efficacy.

instruction="Write a Python program to generate a Markov chain given a text input."
input="Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'"

prompt = f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
{instruction}

### Input:
{input}

### Response:
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():

print(f"-------------------------\n\n")
print(f"Prompt:\n{prompt}\n")
print(f"-------------------------\n\n")

print(f"Before Training Response :")
output_before = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)
print(f"Generated instruction:\n{tokenizer.batch_decode(output_before.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"-------------------------\n\n")

print(f"After Training Response :")
outputs = merged.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)

print(f"-------------------------\n\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"-------------------------\n\n")

We will run the same prompt on both the base model and the trained and merged model. The following screenshot clearly shows how the output of Python code is much better after fine-tuning.

Output of the Base Model
Output of the Trained and Merged Model

Let's now clean up the memory and perform garbage collection. The following screenshot shows the GPU RAM usage, before the garbage collection.

GPU RAM before the cleanup
import gc
# clear the VRAM
import gc
del base_model
del trained_model
del lora_merged_model
del trainer
torch.cuda.empty_cache()
gc.collect()

The following screenshot shows the output and the GPU RAM after executing the code.

Output of the cleanup
GPU RAM after cleanup

As you can see, we are able to clean up the RAM. Typically this is a good practice to do this before we run the inferences, after training.

There you go, we are able to use PEFT LoRA and QLoRA techniques to fine-tune a model, with a simple T4 GPU on Google Bollab. You can get the full source code on my GitHub link here.

Hope this has been helpful. I enjoyed learning and sharing my learnings. Please feel free to share your feedback or any mistakes/comments.

I will be back with more techniques, in the meantime. Have fun...

See you soon ;-)

--

--

A B Vijay Kumar

IBM Fellow, Master Inventor, Mobile, RPi & Cloud Architect & Full-Stack Programmer