LLM Embeddings vs TF-IDF vs Bag-of-Words: Which Works Better in Scikit-learn?


In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Let’s get right to it.

LLM Embeddings vs TF-IDF vs Bag-of-Words Works Better Scikit-learn

LLM Embeddings vs TF-IDF vs Bag-of-Words: Which Works Better in Scikit-learn? (click to enlarge)
Image by Author

Introduction

Machine learning models built with frameworks like scikit-learn can accommodate unstructured data like text, as long as this raw text is converted into a numerical representation that is understandable by algorithms, models, and machines in a broader sense.

This article takes three well-known text representation approaches — TF-IDF, Bag-of-Words, and LLM-generated embeddings — to provide an analytical and example-based comparison between them, in the context of downstream machine learning modeling with scikit-learn.

For a glimpse of text representation approaches, including an introduction to the three used in this article, we recommend you take a look at this article and this one.

The article will first navigate you through a Python example where we will use the BBC news dataset — a labeled dataset containing a few thousand news articles categorized into five types — to obtain the three target representations for each text, build some text classifiers and compare them, and also build and compare some clustering models. After that, we adopt a more general and analytical perspective to discuss which approach is better — and when to use one or another.

Setup and Getting Text Representations

First, we import all the modules and libraries we will need, set up some configurations, and load the BBC news dataset:

At the time of writing, the dataset version we are using contains 2225 instances, that is, documents containing news articles.

Since we will train some supervised machine learning models for classification later on, before obtaining the three representations for our text data, we separate the input texts from their labels and split the whole dataset into training and test subsets:

Representation 1: Bag-of-Words (BoW)

Representation 2: TF-IDF

Representation 3: LLM Embeddings

Comparison 1: Text Classification

That was a thorough preparatory stage! Now we are ready for a first comparison example, focused on training several types of machine learning classifiers and comparing how each type of classifier performs when trained on one text representation or another.

In a nutshell, the code provided below will:

  1. Consider three classifier types: logistic regression, random forests, and support vector machines (SVM).
  2. Train and evaluate each of the 3×3 = 9 classifiers trained, using two evaluation metrics: accuracy and F1 score.
  3. List and visualize the results obtained from each model type and text representation approach used.

Output:

Input code for visualizing results:

Comparing classifiers trained on different text representations

Let’s take these results with a pinch of salt, as they are specific to the dataset and model types trained, and by no means generalizable. TF-IDF combined with an SVM classifier led to the best accuracy (0.987), while LLM embeddings with SVM yielded the fastest model to train (0.15s). Meanwhile, the best overall combination in terms of performance-speed balance is logistic regression with TF-IDF, with a nearly perfect accuracy of 0.984 and a very fast training time of 0.52s.

Why did LLM embeddings, supposedly the most advanced of the three text representation approaches, not provide the best performance? There are several reasons for this. First, the existing five classes (news categories) in the BBC news dataset are strongly word-discriminative; in other words, they are easily separable by class, so moderately simpler representations like TF-IDF are enough to capture these patterns very well. This also implies there is little need for the deep semantic understanding that LLM embeddings achieve; in fact, this can sometimes be counterproductive and lead to overfitting. In addition, because of the near separability between news types, linear and simpler models work great, compared to complex ones like random forests.

If we had a more challenging, real-world dataset than BBC news, with issues like noise, paraphrasing, slang, or even cross-lingual data, LLM embeddings would probably outperform the other two representations.

Regarding Bag-of-Words, in this scenario it only marginally outperforms in terms of inference speed, so it is mainly recommended for very simple tasks requiring maximum interpretability, or as part of a baseline model before trying other strategies.

Comparison 2: Document Clustering

We will consider a second scenario: applying k-means clustering with k=5 and comparing the cluster quality across the three text representation schemes. Notice in the code below that, since clustering is an unsupervised task not requiring labels or train-test splitting, we will re-generate all three representations again for the whole dataset.

Output:

Code for visualizing results:

Clustering results with three text representations

LLM embeddings won this time, with an ARI score of 0.899, showing strong alignment between clusters found and real subgroups that abide by true document categories. This is largely because clustering is an unsupervised learning task and, unlike classification, this is a territory where semantic understanding like that provided by embeddings becomes far more important for capturing patterns, even on simpler datasets.

Summary

Simpler, well-behaved datasets like BBC news are a great example of a problem where advanced and LLM-based representations like embeddings do not always win. Traditional natural language processing approaches for text representation may excel in problems with clear class boundaries, linear separability, and clean, formal text without noisy patterns.

In sum, when addressing real-world machine learning projects, consider always starting with simpler baselines and keyword-based representations like TF-IDF, before directly jumping into state-of-the-art or most advanced strategies. The smaller your challenge, the lighter the outfit you need to dress it with that perfect machine learning look!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *