Classifying Banking Intent from Customer Queries¶

Introduction¶

This project compares two text classification models to predict customer intent from real-world customer banking queries. The end-to-end AI workflow showcases the implementation of foundational AI and deep learning techniques, such as modern transformer-based fine-tuning approaches for natural language tasks.

We'll show and compare two language models:

  • A baseline Multi-Layer Perceptron (MLP) classifier.
  • A fine-tuned RoBERTa model using LoRA.

This comparison will highlight the trade-off between training efficiency/speed and model performance. While we expect the MLP to train significantly faster, we also expect fine-tuned RoBERTA to achieve superior classification performance.

Project Goals¶

  • Explore and visualize the Banking77 dataset.
  • Implement a baseline MLP classifier.
  • Fine-tune a pre-trained RoBERTa model using LoRA.
  • Compare model performances (overall accuracy, mean F1-scores).

Dataset¶

Banking77 is a text dataset containing banking customer queries labeled with an intent related to a specific banking-related action. Queries are short, natural language texts that customers might input into a chatbot.

  • 77 unique intents (classification labels).
  • 13,083 queries (10,003 for training and 3,080 for testing).
  • Source: https://huggingface.co/datasets/PolyAI/banking77

Example query:

{
  'label': 11, # integer label corresponding to "card_arrival" intent
  'text': 'I am still waiting on my card?'
}

Import libraries and setup up workspace¶

# Import essential libraries for data handling, visualization, and machine learning
import pandas as pd
import numpy as np
import datasets
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and metrics from sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, classification_report


# PyTorch for neural network implementation
import torch
import torch.nn as nn
import torch.nn.functional as f
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Hugging Face tools for transformers and PEFT (LoRA)
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments

import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configure pandas to display all rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Load and verify data¶

Load the training and testing data to verify their existence and check for data quality issues, such as missing queries or intent labels.

# Load training dataset
train_df = pd.read_csv("./datasets/banking77_train.csv")
# Display the first few rows of the training data
train_df.head()
text category
0 I am still waiting on my card? card_arrival
1 What can I do if my card still hasn't arrived ... card_arrival
2 I have been waiting over a week. Is the card s... card_arrival
3 Can I track my card while it is in the process... card_arrival
4 How do I know if I will get my card, or if it ... card_arrival
# Load testing dataset
test_df = pd.read_csv("./datasets/banking77_test.csv")
# Display the first few rows of the testing data
test_df.head()
text category
0 How do I locate my card? card_arrival
1 I still have not received my new card, I order... card_arrival
2 I ordered a card but it has not arrived. Help ... card_arrival
3 Is there a way to know when my card will arrive? card_arrival
4 My card has not arrived yet. card_arrival
# Summary statistics for training set to check for anomalies
print(train_df.describe())
                                  text                  category
count                            10003                     10003
unique                           10003                        77
top     I am still waiting on my card?  card_payment_fee_charged
freq                                 1                       187
# Check for any missing values in the training set
print(train_df.isna().sum())
text        0
category    0
dtype: int64
# Summary statistics for test set to ensure consistency
print(test_df.describe())
                            text      category
count                       3080          3080
unique                      3080            77
top     How do I locate my card?  card_arrival
freq                           1            40
# Check for missing values in the test set
print(test_df.isna().sum())
text        0
category    0
dtype: int64

Distribution of class intent labels¶

Let's visualize the distribution of intent labels (counts for each class).

# Plot the distribution of categories to visualize class balance
plt.figure(figsize=(16,4))
train_df["category"].value_counts().plot(kind='bar')
plt.title("Distribution of Intent Labels")
plt.xlabel("category")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

The distribution shows signs of class imbalance of intent labels, particularly at the right tail of the distribution; we may need to address later should our models exhibit signs of underperformance.

Explore and visualize data¶

The top 10 most occurring intents have a few common themes:

  • Fee-related complaints: payment fees, transfer fees, and withdrawal charges.
  • Transaction errors: charged twice, wrong amount received, failed transactions.
  • Balance update issues: balance not updated after deposits or transfers.
# Calculate text length for each query in training set
train_df["text_length"] = train_df.text.apply(len)
# Aggregate text length metrics by category
text_length_dist = train_df.groupby("category")["text_length"].agg(['count', 'min', 'median', 'max'])

# Display top 10 categories by query count
print("=" * 18 + " Word Length - Top 10 Intent Categories " + "=" * 18)
print(text_length_dist.sort_values("count", ascending=False).head(10))
print()
================== Word Length - Top 10 Intent Categories ==================
                                                  count  min  median  max
category                                                                 
card_payment_fee_charged                            187   23    53.0  213
direct_debit_payment_not_recognised                 182   17    60.0  268
balance_not_updated_after_cheque_or_cash_deposit    181   25    65.0  202
wrong_amount_of_cash_received                       180   15    54.5  254
cash_withdrawal_charge                              177   19    51.0  255
transaction_charged_twice                           175   19    51.0  339
declined_cash_withdrawal                            173   22    49.0  207
transfer_fee_charged                                172   18    55.0  409
transfer_not_received_by_recipient                  171   20    60.0  268
balance_not_updated_after_bank_transfer             171   23    58.0  202

# Display bottom 10 categories by query count
print("=" * 6 + " Word Length - Bottom 10 Intent Categories " + "=" * 6)
print(text_length_dist.sort_values("count", ascending=False).tail(10))
====== Word Length - Bottom 10 Intent Categories ======
                             count  min  median  max
category                                            
get_disposable_virtual_card     97   22    43.0  156
top_up_limits                   97   13    33.0   80
receiving_money                 95   22    49.0  103
atm_support                     87   16    35.0   72
compromised_card                86   30    61.5  321
lost_or_stolen_card             82   18    39.0  210
card_swallowed                  61   15    41.0  141
card_acceptance                 59   20    34.0   58
virtual_card_not_working        41   28    44.0  123
contactless_not_working         35   20    48.0  143

The 10 least occurring intents on the other hand show various issues with a physical or virtual card: lost or stolen card, card swallowed, virtual card not working, contactless not working.

# Plot D
plt.figure(figsize=(16, 16))
sns.boxplot(data=train_df, y="category", x="text_length")
plt.suptitle('')
plt.title('Text Length Distribution by Intent')
plt.ylabel('Number of Words')
plt.tight_layout()
plt.show()
No description has been provided for this image

Looking at the text length distribution:

  • Intents are all fairly uniform in their mean, median, and interquartile ranges, suggesting that customers use concise language when reporting their issues, asking questions, or making requests.
  • Outliers are present in almost all intents, suggesting that some customers provide extensive details in their text inputs.

Word Clouds¶

Next, we'll take a look at word frequency for the top 10 largest intent categories to see if there are any observale patterns between word choice and intent category.

# Get the top 10 categories by frequency
top_10_intents = train_df['category'].value_counts().nlargest(10).index

fig, axes = plt.subplots(2, 5, figsize=(25, 12))
axes = axes.flatten()

for i, intent in enumerate(top_10_intents):
    # Filter and join text for the intent
    words = " ".join(train_df[train_df.category == intent]['text'])
    
    # Generate word cloud
    wordcloud = WordCloud(width=600, height=600, background_color='white', max_words=20).generate(words)
    
    # Plot on the corresponding axis
    axes[i].imshow(wordcloud, interpolation='bilinear')
    axes[i].set_title(f"Category: {intent}")
    axes[i].axis("off")

plt.tight_layout()
plt.show()
No description has been provided for this image

There are several interesting themes while looking at these word clouds:

  • Strong lexical alignment with intent labels: dominant words closely mirror the intent name itself.
  • Strong vocabulary overlap across intents: words like charged, transfer, fee, and money are very common.
  • Fine-grained distinctions: Some intents differ by a few critical words like cash vs card.

These observations suggest that identifying differences in intent may depend more on specific keywords rather than deep banking knowledge, so an MLP might achieve decent accuracy. However, while intent categories are keyword-rich there is also a high degree of overlapping, limiting some of the MLP's ability to distinguish between classes, presenting an opportunity for a fine-tuned RoBERTa to excel by utilizing its contextual embeddings for better understanding.

Multi-Layer Perceptron (MLP)¶

First we start by creating a Multi-Layer Perceptron (MLP) to serve as a baseline text classifier.

The purpose of this baseline is to:

  • Create a benchmark to compare against more advanced models (e.g., transformers).
  • Provide a strong non-contextual reference point for text classification performance.

Encode Class Labels with LabelEncoder¶

  • Each unique intent category is mapped to a unique integer ID.
  • Fit on the training dataset and apply to the test dataset.
encoder = LabelEncoder()

y_train = encoder.fit_transform(train_df.category.astype(str).values)
y_test = encoder.transform(test_df.category.astype(str).values)

num_classes = len(encoder.classes_)

#Confirm 77 classes encoded
print(num_classes)
77

Extract and Vectorize Text Features with TfidVectorizer¶

To train the MLP, we'll represent each text as TF-IDF (Term Frequency-Inverse Document Frequency) features.

  • TF-IDF converts each text into vectors that represent how important a word is in a given text relative to the full text corpus.

Preprocessing steps:

  1. Fit and transform on the training dataset.
  2. Transform the testing dataset using learned features from the training dataset.
tfidf = TfidfVectorizer(
    lowercase=True,
    ngram_range = (1,2),
    min_df = 2,
    max_df = 0.95,
    max_features = 50000,
    strip_accents = "unicode"
)

X_train = tfidf.fit_transform(train_df["text"].astype(str).values)
X_test = tfidf.transform(test_df["text"].astype(str).values)

input_size = X_train.shape[1]
print("Input Size: ", input_size)
Input Size:  10292

Create dataset class for PyTorch processing¶

Because PyTorch models expect data to be provided through a Dataset class, we define a custom dataset class that:

  • Wraps sparse TF-IDF feature matrices.
  • Converts sparse rows into dense tensors.
  • Returns input–label pairs for training.
class TfidDataset(Dataset):
    def __init__(self, X_sparse, y):
        self.X = X_sparse
        self.y = y

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        x = self.X[idx].toarray().astype(np.float32).squeeze(0)
        y = np.int64(self.y[idx])
        return torch.from_numpy(x), torch.tensor(y)

train_ds =TfidDataset(X_train, y_train)

test_ds = TfidDataset(X_test, y_test)
### Create DataLoaders
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)

Build MLP Architecture¶

For our baseline we'll use a simple architecture consisting of:

  • An input layer matching the dimensionality of the TF-IDF vectors.
  • Two hidden layers with ReLU activations to introduce non-linearities.
  • An output layer that produces predicted outputs (logits) for each of the 77 intent classes.
class SimpleMLP(nn.Module):
    def __init__(self, input_size: int, hidden_size=256, output_size=77):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Instantiate MLP, Loss Function, and Optimizer¶

We'll create an instance of the MLP and initialize the following:

  • Cross-Entropy Loss is selected for multi-class classification.
  • AdamW optimizer to update model parameters using adaptive learning rates with weight decay.
from torchviz import make_dot

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SimpleMLP(
    input_size=input_size,
    hidden_size = 256,
    output_size=num_classes).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
Using device: cuda

Train the MLP on the Training Set¶

Next, we create a training function that trains the MLP that includes:

  • Training parameters (number of epochs).
  • Looping through the training batches.
  • Calculating the loss through the forward pass.
  • Updating weight values through the backward pass.
  • Calculating training performance metrics per epoch.
# Set random seed
torch.manual_seed(42)

def train_model(model, train_loader, loss_fn, optimizer, num_epochs, device):
    for epoch in range(num_epochs):
        epoch_loss=0
        correct = 0
        total = 0
        num_batches = 0

        for batch_X, batch_y in train_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)

            # Forward Pass
            logits = model(batch_X)
            loss = loss_fn(logits, batch_y)

            # Backward Pass
            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            # Training predicitions
            epoch_loss += loss.item()
            num_batches += 1
            preds = logits.argmax(dim=1)
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

        # Metrics
        avg_loss = epoch_loss / num_batches
        accuracy = correct / total

        print(
            f"Epoch [{epoch + 1}/{num_epochs}] | "
            f"Loss: {avg_loss:.4f} | "
            f"Accuracy: {accuracy:.4f}"
        )         

#Train model
train_model(model=model,
            train_loader=train_loader,
            loss_fn=criterion,
            optimizer=optimizer,
            num_epochs=10,
            device=device)
Epoch [1/10] | Loss: 4.2509 | Accuracy: 0.0753
Epoch [2/10] | Loss: 3.3632 | Accuracy: 0.3640
Epoch [3/10] | Loss: 1.5487 | Accuracy: 0.7505
Epoch [4/10] | Loss: 0.5909 | Accuracy: 0.9001
Epoch [5/10] | Loss: 0.2978 | Accuracy: 0.9485
Epoch [6/10] | Loss: 0.1804 | Accuracy: 0.9708
Epoch [7/10] | Loss: 0.1162 | Accuracy: 0.9827
Epoch [8/10] | Loss: 0.0836 | Accuracy: 0.9906
Epoch [9/10] | Loss: 0.0615 | Accuracy: 0.9931
Epoch [10/10] | Loss: 0.0450 | Accuracy: 0.9956

Evaluate MLP on the Testing Dataset¶

Now we'll generate predictions on the testing set.

def predict(model, dataloader, device):
    model.eval()

    all_predictions = []
    all_labels = []

    with torch.no_grad():
        for batch_X, batch_y in dataloader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)

            logits = model(batch_X)
            preds = logits.argmax(dim=1)

            all_predictions.append(preds.cpu().numpy())
            all_labels.append(batch_y.cpu().numpy())

    y_pred = np.concatenate(all_predictions)
    y_true = np.concatenate(all_labels)
    return y_pred, y_true

    # Generate test predictions

mlp_preds, mlp_true = predict(
    model=model,
    dataloader=test_loader,
    device=device)
    

Classification Report: MLP¶

Using our predictions we can create a classification report to evaluate for each intent class:

  • Precision
  • Recall
  • F1-Score
mlp_accuracy = accuracy_score(mlp_true, mlp_preds)
mlp_macro_f1 = f1_score(mlp_true, mlp_preds, average='macro')
mlp_weighted_f1 = f1_score(mlp_true, mlp_preds, average='weighted')
mlp_report = classification_report(mlp_true, mlp_preds, target_names=encoder.classes_)

print("=== Baseline MLP Performance ===")
print(f"Overall Accuracy Score: {mlp_accuracy:.4f}")
print(f"Macro F1: {mlp_macro_f1:.4f}")
print(f"Weighted F1: {mlp_weighted_f1:.4f}")
print("\nClassification report:")
print(mlp_report)
=== Baseline MLP Performance ===
Overall Accuracy Score: 0.8779
Macro F1: 0.8782
Weighted F1: 0.8782

Classification report:
                                                  precision    recall  f1-score   support

                           Refund_not_showing_up       0.93      0.95      0.94        40
                                activate_my_card       0.93      0.97      0.95        40
                                       age_limit       0.98      1.00      0.99        40
                         apple_pay_or_google_pay       1.00      0.97      0.99        40
                                     atm_support       0.88      0.93      0.90        40
                                automatic_top_up       1.00      0.90      0.95        40
         balance_not_updated_after_bank_transfer       0.69      0.68      0.68        40
balance_not_updated_after_cheque_or_cash_deposit       0.92      0.90      0.91        40
                         beneficiary_not_allowed       0.95      0.97      0.96        40
                                 cancel_transfer       0.86      0.95      0.90        40
                            card_about_to_expire       0.98      1.00      0.99        40
                                 card_acceptance       0.71      0.88      0.79        40
                                    card_arrival       0.85      0.85      0.85        40
                          card_delivery_estimate       0.92      0.82      0.87        40
                                    card_linking       0.97      0.90      0.94        40
                                card_not_working       0.63      0.90      0.74        40
                        card_payment_fee_charged       0.83      0.88      0.85        40
                     card_payment_not_recognised       0.87      0.82      0.85        40
                card_payment_wrong_exchange_rate       0.88      0.93      0.90        40
                                  card_swallowed       0.92      0.85      0.88        40
                          cash_withdrawal_charge       1.00      0.93      0.96        40
                  cash_withdrawal_not_recognised       0.88      0.93      0.90        40
                                      change_pin       0.90      0.95      0.93        40
                                compromised_card       0.86      0.78      0.82        40
                         contactless_not_working       0.76      0.70      0.73        40
                                 country_support       0.90      0.90      0.90        40
                           declined_card_payment       0.73      0.90      0.81        40
                        declined_cash_withdrawal       0.82      0.90      0.86        40
                               declined_transfer       0.96      0.62      0.76        40
             direct_debit_payment_not_recognised       0.83      0.88      0.85        40
                          disposable_card_limits       0.92      0.90      0.91        40
                           edit_personal_details       0.98      1.00      0.99        40
                                 exchange_charge       0.95      0.93      0.94        40
                                   exchange_rate       0.90      0.95      0.93        40
                                exchange_via_app       0.89      0.97      0.93        40
                       extra_charge_on_statement       0.92      0.88      0.90        40
                                 failed_transfer       0.69      0.85      0.76        40
                           fiat_currency_support       0.97      0.78      0.86        40
                     get_disposable_virtual_card       0.87      0.85      0.86        40
                               get_physical_card       0.92      0.88      0.90        40
                              getting_spare_card       0.95      0.90      0.92        40
                            getting_virtual_card       0.90      0.88      0.89        40
                             lost_or_stolen_card       0.87      0.85      0.86        40
                            lost_or_stolen_phone       0.95      0.97      0.96        40
                             order_physical_card       0.87      0.85      0.86        40
                              passcode_forgotten       1.00      0.93      0.96        40
                            pending_card_payment       0.95      0.93      0.94        40
                         pending_cash_withdrawal       1.00      0.97      0.99        40
                                  pending_top_up       0.89      0.82      0.86        40
                                pending_transfer       0.79      0.75      0.77        40
                                     pin_blocked       0.94      0.82      0.88        40
                                 receiving_money       0.88      0.93      0.90        40
                                  request_refund       1.00      0.90      0.95        40
                          reverted_card_payment?       0.85      0.97      0.91        40
                  supported_cards_and_currencies       0.78      0.97      0.87        40
                               terminate_account       0.93      0.95      0.94        40
                  top_up_by_bank_transfer_charge       0.91      0.72      0.81        40
                           top_up_by_card_charge       0.90      0.93      0.91        40
                        top_up_by_cash_or_cheque       0.94      0.80      0.86        40
                                   top_up_failed       0.70      0.93      0.80        40
                                   top_up_limits       0.91      0.97      0.94        40
                                 top_up_reverted       0.97      0.85      0.91        40
                              topping_up_by_card       0.85      0.72      0.78        40
                       transaction_charged_twice       0.93      1.00      0.96        40
                            transfer_fee_charged       0.77      0.90      0.83        40
                           transfer_into_account       0.91      0.72      0.81        40
              transfer_not_received_by_recipient       0.68      0.80      0.74        40
                                 transfer_timing       0.88      0.75      0.81        40
                       unable_to_verify_identity       0.91      0.72      0.81        40
                              verify_my_identity       0.82      0.57      0.68        40
                          verify_source_of_funds       0.87      1.00      0.93        40
                                   verify_top_up       0.97      0.93      0.95        40
                        virtual_card_not_working       0.97      0.72      0.83        40
                              visa_or_mastercard       0.97      0.93      0.95        40
                             why_verify_identity       0.61      0.95      0.75        40
                   wrong_amount_of_cash_received       0.95      0.93      0.94        40
         wrong_exchange_rate_for_cash_withdrawal       0.92      0.85      0.88        40

                                        accuracy                           0.88      3080
                                       macro avg       0.89      0.88      0.88      3080
                                    weighted avg       0.89      0.88      0.88      3080

Finetuning RoBERTa Transformer with LoRA¶

Next we'll finetune a pretrained RoBERTa transformer using LoRA (Low-Rank Adaptation) in an attempt to improve on our ability to predict intent.

  • RoBERTa provides contextualized token embeddings that enable the model to better capture contextual information and semantic differences between intent classes that our baseline MLP might have struggled to understand.
  • LoRA helps us use trainable adapters to significantly reduce the computational cost of finetuning while preserving performance.
  • We expect finetuning to still be slower than training the MLP, but expect the classification performance of LoRA-RoBERTa to be better.
train = pd.read_csv("datasets/banking77_train.csv")
test = pd.read_csv("datasets/banking77_test.csv")
train.head()
text category
0 I am still waiting on my card? card_arrival
1 What can I do if my card still hasn't arrived ... card_arrival
2 I have been waiting over a week. Is the card s... card_arrival
3 Can I track my card while it is in the process... card_arrival
4 How do I know if I will get my card, or if it ... card_arrival

Encode Class Labels with LabelEncoder¶

train = train.dropna(subset=["text", "category"]).copy()
test = test.dropna(subset=["text", "category"]).copy()

label_encoder = LabelEncoder()
train["labels"] = label_encoder.fit_transform(train["category"].astype(str))
test["labels"]  = label_encoder.transform(test["category"].astype(str))

num_classes = len(label_encoder.classes_)
print("num_classes:", num_classes)

training_set = datasets.Dataset.from_pandas(train[["text", "labels"]], preserve_index=False)
test_set = datasets.Dataset.from_pandas(test[["text", "labels"]], preserve_index=False)
num_classes: 77

Tokenization Using RoBERTa Tokenizer¶

Here we use the RoBERTa tokenizer roberta-base to convert raw text queries into subword-based tokens:

  • Truncate text to cap sequence length.
  • Set max token length to 256 tokens.
  • Apply dynamic padding using DataCollatorWithPadding, which pads sequences based on the longest sequence within each batch.
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=256)

tokenized_training_set = training_set.map(tokenize_function, batched=True)
tokenized_test_set     = test_set.map(tokenize_function, batched=True)

cols_to_keep = {"input_ids", "attention_mask", "labels"}
tokenized_training_set = tokenized_training_set.remove_columns(
    [c for c in tokenized_training_set.column_names if c not in cols_to_keep]
)
tokenized_test_set = tokenized_test_set.remove_columns(
    [c for c in tokenized_test_set.column_names if c not in cols_to_keep]
)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Map: 100%|█████████████████████████████████████████████████████████████| 10003/10003 [00:00<00:00, 57454.96 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████| 3080/3080 [00:00<00:00, 77901.34 examples/s]

RoBERTa Transformer Finetuning and Evaluation¶

Next, we configure the transformer finetuning process using Hugging Face's TrainingArguments module.

Set Up Finetuning Configuration¶

training_args = TrainingArguments(
    output_dir="./temp_results",
    save_strategy="no",
    logging_dir="./logs",
    eval_strategy="epoch",
    logging_strategy="epoch",
    report_to="none",

    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    seed=42,
)
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.

Configure LoRA for Parameter-Efficient Finetuning¶

Next, we configure and apply LoRA to efficiently finetune RoBERTa using the PEFT library.

torch.manual_seed(42)

model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=num_classes
).to(device)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    bias="none",
    target_modules=["query", "key", "value"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model.to(device)
Loading weights: 100%|█| 197/197 [00:00<00:00, 1175.45it/s, Materializing param=roberta.encoder.layer.11.output.dense.w
RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
classifier.out_proj.weight      | MISSING    | 
classifier.out_proj.bias        | MISSING    | 
classifier.dense.weight         | MISSING    | 
classifier.dense.bias           | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
trainable params: 2,419,277 || all params: 127,124,122 || trainable%: 1.9031
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): RobertaForSequenceClassification(
      (roberta): RobertaModel(
        (embeddings): RobertaEmbeddings(
          (word_embeddings): Embedding(50265, 768, padding_idx=1)
          (token_type_embeddings): Embedding(1, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (position_embeddings): Embedding(514, 768, padding_idx=1)
        )
        (encoder): RobertaEncoder(
          (layer): ModuleList(
            (0-11): 12 x RobertaLayer(
              (attention): RobertaAttention(
                (self): RobertaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (key): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (value): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): RobertaSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): RobertaIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
                (intermediate_act_fn): GELUActivation()
              )
              (output): RobertaOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
        )
      )
      (classifier): ModulesToSaveWrapper(
        (original_module): RobertaClassificationHead(
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (out_proj): Linear(in_features=768, out_features=77, bias=True)
        )
        (modules_to_save): ModuleDict(
          (default): RobertaClassificationHead(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (out_proj): Linear(in_features=768, out_features=77, bias=True)
          )
        )
      )
    )
  )
)

Finetune and Evaluate the LoRA-RoBERTa Model¶

torch.manual_seed(42)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_training_set,
    eval_dataset=tokenized_test_set,
    processing_class=tokenizer,
    data_collator=data_collator,
)

trainer.train()
eval_metrics = trainer.evaluate()
print(eval_metrics)
[6260/6260 04:26, Epoch 10/10]
Epoch Training Loss Validation Loss
1 2.867155 0.820623
2 0.588901 0.377281
3 0.340138 0.305643
4 0.246103 0.275706
5 0.183561 0.265762
6 0.139964 0.271889
7 0.110447 0.259594
8 0.087138 0.250638
9 0.076306 0.251792
10 0.070241 0.250426

[97/97 00:02]
{'eval_loss': 0.25042590498924255, 'eval_runtime': 2.2513, 'eval_samples_per_second': 1368.102, 'eval_steps_per_second': 43.086, 'epoch': 10.0}

Save Finetuned LoRA-RoBERTa¶

After finetuning is completed, we save the resulting finetuned LoRA-RoBERTa model along with the tokenizer.

save_dir = "./finetuned_roberta_lora_model"

model = model.merge_and_unload()
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
Writing model shards: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.69it/s]
('./finetuned_roberta_lora_model\\tokenizer_config.json',
 './finetuned_roberta_lora_model\\tokenizer.json')

Load the Finetuned LoRA-RoBERTa¶

path = "./finetuned_roberta_lora_model"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForSequenceClassification.from_pretrained(path).to(device)
model.eval()
Loading weights: 100%|█| 201/201 [00:00<00:00, 1129.77it/s, Materializing param=roberta.encoder.layer.11.output.dense.w
RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=77, bias=True)
  )
)

Evaluate the Finetuned LoRA-RoBERTa on the Testing Dataset¶

Next we generate predictions on the testing set from the finetuned LoRA-RoBERTa model that we can use later to evaluate the model and compare it with the MLP from above.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

prediction_args = TrainingArguments(
    output_dir="./temp_predictions", 
    per_device_eval_batch_size=32,    
    report_to="none",                  
)

predictor = Trainer(
    model=model,
    args=prediction_args, 
    processing_class=tokenizer,
    data_collator=data_collator,
)

# Generate testing predictions
roberta_pred_output = predictor.predict(tokenized_test_set)
roberta_y_pred = np.argmax(roberta_pred_output.predictions, axis=1)
roberta_y_true = roberta_pred_output.label_ids

Classification Report: Finetuned LoRA-RoBERTa¶

lora_roberta_accuracy = accuracy_score(roberta_y_true, roberta_y_pred)
lora_roberta_macro_f1 = f1_score(roberta_y_true, roberta_y_pred, average='macro')
lora_roberta_weighted_f1 = f1_score(roberta_y_true, roberta_y_pred, average='weighted')
lora_roberta_report = classification_report(roberta_y_true, roberta_y_pred, target_names=label_encoder.classes_)

print("=== Finetuned RoBERTa + LoRA Performance ===")
print(f"Overall Accuracy Score: {lora_roberta_accuracy:.4f}")
print(f"Macro F1: {lora_roberta_macro_f1:.4f}")
print(f"Weighted F1: {lora_roberta_weighted_f1:.4f}")
print("\nClassification report:")
print(lora_roberta_report)
=== Finetuned RoBERTa + LoRA Performance ===
Overall Accuracy Score: 0.9367
Macro F1: 0.9367
Weighted F1: 0.9367

Classification report:
                                                  precision    recall  f1-score   support

                           Refund_not_showing_up       0.97      0.95      0.96        40
                                activate_my_card       0.97      0.97      0.97        40
                                       age_limit       1.00      1.00      1.00        40
                         apple_pay_or_google_pay       1.00      1.00      1.00        40
                                     atm_support       0.98      1.00      0.99        40
                                automatic_top_up       1.00      0.97      0.99        40
         balance_not_updated_after_bank_transfer       0.82      0.78      0.79        40
balance_not_updated_after_cheque_or_cash_deposit       1.00      0.90      0.95        40
                         beneficiary_not_allowed       0.88      0.88      0.88        40
                                 cancel_transfer       1.00      0.97      0.99        40
                            card_about_to_expire       0.97      0.97      0.97        40
                                 card_acceptance       0.97      0.93      0.95        40
                                    card_arrival       0.90      0.88      0.89        40
                          card_delivery_estimate       0.90      0.93      0.91        40
                                    card_linking       1.00      1.00      1.00        40
                                card_not_working       0.88      0.95      0.92        40
                        card_payment_fee_charged       0.88      0.95      0.92        40
                     card_payment_not_recognised       0.92      0.90      0.91        40
                card_payment_wrong_exchange_rate       0.97      0.95      0.96        40
                                  card_swallowed       0.97      0.88      0.92        40
                          cash_withdrawal_charge       0.95      0.95      0.95        40
                  cash_withdrawal_not_recognised       0.88      0.95      0.92        40
                                      change_pin       0.93      1.00      0.96        40
                                compromised_card       0.90      0.93      0.91        40
                         contactless_not_working       1.00      0.93      0.96        40
                                 country_support       0.93      1.00      0.96        40
                           declined_card_payment       0.81      0.95      0.87        40
                        declined_cash_withdrawal       0.82      1.00      0.90        40
                               declined_transfer       0.97      0.75      0.85        40
             direct_debit_payment_not_recognised       0.94      0.85      0.89        40
                          disposable_card_limits       0.93      0.93      0.93        40
                           edit_personal_details       1.00      1.00      1.00        40
                                 exchange_charge       1.00      0.90      0.95        40
                                   exchange_rate       0.91      1.00      0.95        40
                                exchange_via_app       0.91      0.97      0.94        40
                       extra_charge_on_statement       0.95      0.97      0.96        40
                                 failed_transfer       0.88      0.93      0.90        40
                           fiat_currency_support       0.90      0.93      0.91        40
                     get_disposable_virtual_card       0.94      0.85      0.89        40
                               get_physical_card       0.97      0.97      0.97        40
                              getting_spare_card       0.97      0.97      0.97        40
                            getting_virtual_card       0.83      0.97      0.90        40
                             lost_or_stolen_card       0.84      0.95      0.89        40
                            lost_or_stolen_phone       0.97      0.95      0.96        40
                             order_physical_card       0.92      0.90      0.91        40
                              passcode_forgotten       0.98      1.00      0.99        40
                            pending_card_payment       0.97      0.95      0.96        40
                         pending_cash_withdrawal       0.97      0.95      0.96        40
                                  pending_top_up       0.93      0.95      0.94        40
                                pending_transfer       0.86      0.80      0.83        40
                                     pin_blocked       0.97      0.90      0.94        40
                                 receiving_money       0.93      0.93      0.93        40
                                  request_refund       0.93      0.97      0.95        40
                          reverted_card_payment?       0.86      0.90      0.88        40
                  supported_cards_and_currencies       0.88      0.95      0.92        40
                               terminate_account       0.98      1.00      0.99        40
                  top_up_by_bank_transfer_charge       0.90      0.95      0.93        40
                           top_up_by_card_charge       0.95      0.95      0.95        40
                        top_up_by_cash_or_cheque       0.93      0.95      0.94        40
                                   top_up_failed       0.93      0.93      0.93        40
                                   top_up_limits       1.00      1.00      1.00        40
                                 top_up_reverted       0.97      0.85      0.91        40
                              topping_up_by_card       0.89      0.82      0.86        40
                       transaction_charged_twice       0.95      1.00      0.98        40
                            transfer_fee_charged       1.00      0.90      0.95        40
                           transfer_into_account       0.92      0.90      0.91        40
              transfer_not_received_by_recipient       0.82      0.90      0.86        40
                                 transfer_timing       0.86      0.90      0.88        40
                       unable_to_verify_identity       1.00      0.95      0.97        40
                              verify_my_identity       0.95      0.95      0.95        40
                          verify_source_of_funds       1.00      1.00      1.00        40
                                   verify_top_up       1.00      1.00      1.00        40
                        virtual_card_not_working       1.00      0.88      0.93        40
                              visa_or_mastercard       1.00      0.90      0.95        40
                             why_verify_identity       0.93      0.97      0.95        40
                   wrong_amount_of_cash_received       1.00      0.93      0.96        40
         wrong_exchange_rate_for_cash_withdrawal       0.97      0.95      0.96        40

                                        accuracy                           0.94      3080
                                       macro avg       0.94      0.94      0.94      3080
                                    weighted avg       0.94      0.94      0.94      3080

Compare Performances¶

Next we look at the performance metrics of the baseline MLP against the finetuned LoRA-RoBERTa by re-generating classification reports (as dictionaries and creating a new comparison DataFrame.

mlp_report_dict = classification_report(mlp_true, mlp_preds, target_names=encoder.classes_, output_dict=True)
roberta_report_dict = classification_report(roberta_y_true, roberta_y_pred, target_names=label_encoder.classes_, output_dict=True)

# Extract per-class metrics from classification reports
def extract_class_metrics(report_dict):
    class_metrics = {}
    for intent, metrics in report_dict.items():
        # Skip aggregate metrics (accuracy, macro avg, weighted avg)
        if intent not in ['accuracy', 'macro avg', 'weighted avg']:
            class_metrics[intent] = {
                'precision': metrics['precision'],
                'recall': metrics['recall'],
                'f1-score': metrics['f1-score'],
                'support': metrics['support']
            }
    return pd.DataFrame(class_metrics).T

# Create DataFrames for each model
mlp_df = extract_class_metrics(mlp_report_dict)
roberta_df = extract_class_metrics(roberta_report_dict)

# Create comparison DataFrame (optionally add recall/precision scores)
comparison_df = pd.DataFrame({
     'MLP_Precision': mlp_df['precision'],
     'RoBERTa_Precision': roberta_df['precision'],
     'MLP_Recall': mlp_df['recall'],
     'RoBERTa_Recall': roberta_df['recall'],
     'MLP_F1': mlp_df['f1-score'],
     'RoBERTa_F1': roberta_df['f1-score'],
     'Precision_Diff': roberta_df['precision'] - mlp_df['precision'],
     'Recall_Diff': roberta_df['recall'] - mlp_df['recall'],
     'F1_Diff': roberta_df['f1-score'] - mlp_df['f1-score'],
})

# Sort by F1 difference
comparison_df = comparison_df.sort_values('F1_Diff', ascending=False)
comparison_df.head(10)
MLP_Precision RoBERTa_Precision MLP_Recall RoBERTa_Recall MLP_F1 RoBERTa_F1 Precision_Diff Recall_Diff F1_Diff
verify_my_identity 0.821429 0.950000 0.575 0.950 0.676471 0.950000 0.128571 0.375 0.273529
contactless_not_working 0.756757 1.000000 0.700 0.925 0.727273 0.961039 0.243243 0.225 0.233766
why_verify_identity 0.612903 0.928571 0.950 0.975 0.745098 0.951220 0.315668 0.025 0.206121
card_not_working 0.631579 0.883721 0.900 0.950 0.742268 0.915663 0.252142 0.050 0.173395
unable_to_verify_identity 0.906250 1.000000 0.725 0.950 0.805556 0.974359 0.093750 0.225 0.168803
card_acceptance 0.714286 0.973684 0.875 0.925 0.786517 0.948718 0.259398 0.050 0.162201
failed_transfer 0.693878 0.880952 0.850 0.925 0.764045 0.902439 0.187075 0.075 0.138394
top_up_failed 0.698113 0.925000 0.925 0.925 0.795699 0.925000 0.226887 0.000 0.129301
transfer_not_received_by_recipient 0.680851 0.818182 0.800 0.900 0.735632 0.857143 0.137331 0.100 0.121511
top_up_by_bank_transfer_charge 0.906250 0.904762 0.725 0.950 0.805556 0.926829 -0.001488 0.225 0.121274
print("====================== 10 Highest F1 Diff ======================")
print(comparison_df[['MLP_F1', 'RoBERTa_F1', 'F1_Diff']].head(10))
print()
print("==================== 10 Lowest F1 Diff =====================")
print(comparison_df[['MLP_F1', 'RoBERTa_F1', 'F1_Diff']].tail(10))
====================== 10 Highest F1 Diff ======================
                                      MLP_F1  RoBERTa_F1   F1_Diff
verify_my_identity                  0.676471    0.950000  0.273529
contactless_not_working             0.727273    0.961039  0.233766
why_verify_identity                 0.745098    0.951220  0.206121
card_not_working                    0.742268    0.915663  0.173395
unable_to_verify_identity           0.805556    0.974359  0.168803
card_acceptance                     0.786517    0.948718  0.162201
failed_transfer                     0.764045    0.902439  0.138394
top_up_failed                       0.795699    0.925000  0.129301
transfer_not_received_by_recipient  0.735632    0.857143  0.121511
top_up_by_bank_transfer_charge      0.805556    0.926829  0.121274

==================== 10 Lowest F1 Diff =====================
                           MLP_F1  RoBERTa_F1   F1_Diff
getting_virtual_card     0.886076    0.896552  0.010476
request_refund           0.947368    0.951220  0.003851
top_up_reverted          0.906667    0.906667  0.000000
lost_or_stolen_phone     0.962963    0.962025 -0.000938
visa_or_mastercard       0.948718    0.947368 -0.001350
cash_withdrawal_charge   0.961039    0.950000 -0.011039
card_about_to_expire     0.987654    0.975000 -0.012654
pending_cash_withdrawal  0.987342    0.962025 -0.025316
reverted_card_payment?   0.906977    0.878049 -0.028928
beneficiary_not_allowed  0.962963    0.875000 -0.087963

Performance Comparison: MLP and the Finetuned LoRA-RoBERTa¶

Both models achieve fairly strong performance on the Banking77 dataset.

The baseline MLP establishes a strong baseline with an overall accuracy of 88%. As expected, the finetuned LoRA-RoBERTa outperforms the MLP across all performance metrics. Specifically, the gain in the overall accuracy and macro/weighted F1 scores is substantially meaningful:

Model Accuracy Macro F1 Weighted F1
MLP (baseline) 0.8779 0.8782 0.8782
LoRA-RoBERTa 0.9367 0.9367 0.9367

Macro F1 Scores¶

Since Banking77 is multi-class (77 intents) and is balanced in the testing set (40 samples per class), macro F1 is the primary metric we'll look at to compare models. Macro F1 is the unweighted average F1-scores calculated per intent class, ignoring class frequency by weighting each class equally.

The finetuned LoRA-RoBERTa models achieve a +5.85% gain in macro F1 score over the MLP (93.67% vs 87.82%), showing uniform improvement across nearly all 77 intent classes.

This benefits our model in production, where:

  • Less frequent but high-stakes queries like compromised_card, lost_or_stolen_card are identified better.
  • Consistent customer experience across all different banking queries.

Intent-Level F1 Score Performance¶

The largest F1 score improvements center on two themes:

  1. Identity Verification

    • verify_my_identity: LoRA-RoBERTa (95.00%) outperforming the MLP (67.65%).
    • why_verify_identity: LoRA-RoBERTa (95.12%) outperforming the MLP (74.51%).
    • unable_to_verify_identity: LoRA-RoBERTa (97.44%) outperforming the MLP (80.56%).
  2. Card Issues

    • contactless_not_working:LoRA-RoBERTa (96.10%) outperforming the MLP (72.72%).
    • card_not_working:LoRA-RoBERTa (91.57%) the MLP (74.23%).

Observing intents where F1 scores did not improve:

  • 6 intents had small performance losses (<3%) when moving from MLP to the LoRA-RoBERTa model; while 1 intent saw a large performance loss (>3%).
  • However, both models have excellent performance (F1 scores >=85%) across each of these 7 intents.
  • Minor MLP advantages could probably be because of random variation, overfitting, or LoRA-RoBERTa's over complexity.

Summary¶

  • RoBERTa + LoRA achieved 93.67% accuracy and 93.67% F1 score, representing state-of-the-art performance for the Banking77 multi-class classification task.
  • The baseline MLP achieved 87.79% accuracy and 87.82% F1 score, demonstrating that traditional architectures can provide reasonable results.
  • Despite the longer training time, the 5.85% improvement across justifies the additional complexity of transformer-based models for production deployment.