Cluster Benchmarks
#

Or: Trying to save myself some boilerplate when picking an embedding+clustering model combination for the 8th time this month
#

This project contains 0% LLM-generated content

Purpose
#

This is a script allowing multiple text embedding/clustering workloads to be run in a task queue, ideally making evaluation of multiple models easier and more fault tolerant.

Dataset
#

The dataset used for the current project was pulled from the following:

Pipeline
#

Each workload is composed of the main components of any embedding/clustering analysis, plus some optional extras:

Dataset: A list of texts to be analyzed
Embedding model: Converts the texts to embedding vectors, effectively storing the meaning of the texts as numbers
Optional dimensionality reduction: Condenses the embedding vectors to a smaller size while minimizing loss of information contained in the original embeddings
Clustering algorithm: Groups the embedding vectors into similar groups, where similarity is defined as the distance between any 2 embeddings
Optional evaluation metrics: Measures how “good” an individual cluster or clustering algorithm is at grouping similar datapoints together

def single_pipeline(job: dict, sample: int, output_dir: str) -> None:
	"""
	Runs one iteration of the clustering process, encompassing:
		- data readin
		- text embedding
		- dimensionality reduction
		- clustering
		- cluster evaluation
		- results export

	Args:
		job (dict): dict containing dataset, embedding model, dimension reduction model, clustering model, and evaluation metrics
		sample (int): number of datapoints to use from the provided dataset. If set to 0 or None, all datapoints will be used
		output_dir (str): folder relative to cwd to write results to. Must exist prior to running the pipeline
	"""
	start = time.time()

	dataset = job["dataset"]
	embedder = job["embedder"]
	reducer = job["reducer"]
	clusterer = job["clusterer"]
	evaluators = job["evaluators"]

	if not sample:
		sample = None
	data = dataset()[:sample]
	embeddings = embedder(data)
	if reducer.name:
		embeddings = reducer(embeddings)

	labels = clusterer(embeddings)

	df = pd.DataFrame({"text": data, "cluster": labels}).sort_values("cluster")
	clusters = df.groupby("cluster").agg(lambda x: list(x)).to_dict()

	evals = {
		e.name: e(embeddings=embeddings, labels=labels, clusters=clusters)
		for e in evaluators
	}

	cluster_file = f"{dataset.name}_{embedder.name}_{reducer.name}_{clusterer.name}_clusters".replace(
		"/", "_"
	).replace(".", "_")
	eval_file = (
		f"{dataset.name}_{embedder.name}_{reducer.name}_{clusterer.name}_evals".replace(
			"/", "_"
		).replace(".", "_")
	)

	with open(os.path.join(".", output_dir, f"{cluster_file}.json"), "w") as f:
		json.dump(clusters, f, indent=4)
		logger.info(f"job clusters saved to {cluster_file}")

	if evals:
		with open(os.path.join(".", output_dir, f"{eval_file}.json"), "w") as f:
			json.dump(evals, f, indent=4)
			logger.info(f"job evals saved to {eval_file}.json")

	logger.info(f"job took {start - time.time()}")

Configuration
#

Workload specification is done through config.json, which is mostly made up of dataset/model names with boolean flags. In main.py, config.json is read in and creates one job per unique combination of dataset, embedding model, clustering algorithm, and dimensionality reduction model. Combining multiple of each model/dataset can balloon the number of jobs quickly.

Running multiple evaluation metrics is not multiplicative, e.g.:

If you have 1 embedding model, 1 clustering model, 1 dimension reduction model, 1 dataset, and no evaluation metrics, you will need 1 job.
If you have all of that with 4 evaluation metrics, you will also only need 1 job.

{
  "sample": 0, // number of datapoints to sample, set to 0 for all datapoints
  "output_folder": "outputs", // dir to save outputs to
  "datasets": {
    "emotion": false,
    "news": false,
    "bitext": false,
    "csv": "" // user specified csv dataset, relative to cwd
  },
  "embed": {
    "sif": false, // Smooth Inverse Frequency, implemented in sif.py
    "tfidf": false, // sklearn TFIDF Vectorizer
    // Following are from huggingface.co
    "average_word_embeddings_levy_dependency": false,
    "average_word_embeddings_komninos": false,
    "average_word_embeddings_glove.840B.300d": false,
    "sentence-t5-base": false,
    "LaBSE": false,
    "gtr-t5-base": false,
    "stsb-roberta-base-v2": false,
    "all-MiniLM-L12-v2": false,
    "stsb-mpnet-base-v2": false,
    "gte-base": false,
    "modernbert-embed-base": false,
    "e5-base-v2": false,
    "bge-base-en-v1.5": false,
    "nomic-embed-text-v1.5": false,
    "potion-base-2M": false,
    "potion-base-8M": false,
    // Following have been pre-distilled from huggingface.co models using model2vec and saved to disk in ./dat/models/model2vec
    "model2vec/LaBSE": false,
    "model2vec/bge-base-en-v1.5": false,
    "model2vec/e5-base-v2": false,
    "model2vec/gte-base": false,
    "model2vec/all-MiniLM-L12-v2": false,
    "model2vec/modernbert-embed-base": false,
    "model2vec/stsb-roberta-base-v2": false
  },
  "cluster": {
    // From sklearn.cluster
    "affinitypropagation": false,
    "agglomerative": false,
    "birch": false,
    "hdbscan": false,
    "meanshift": false,
    "optics": false,
    "spectral": false
  },
  "eval": {
    // From sklearn.metrics
    "descriptives": false,
    "silhouette": false,
    "daviesbouldin": false,
    "calinskiharabasz": false
  },
  "dimredux": {
    "umap": false, // umap-learn
    "pacmap": false, // pacmap
    "none": true // do not reduce embedding dimensions, should be true by default
  }
}

Usage
#

The script uses offline resources as much as possible, so you’ll need to clone the model repos you want to use into ./data/models. You can modify it to download models from huggingface each time, but if you have a couple models you’ll be using with any frequency, storing on disk is the way to go.

# run main.py to populate the queue based on config.json
$ python3 main.py

# initialize an rq worker to start processing workloads, in "burst mode" to exit once the queue is empty
rq worker -b

Job results are written to the specified output folder relative to cwd, with each job exporting to one __clusters.json file and one __evals.json file (if specified). Clustering results are output as:

{
	"0": [
		"text1",
		"text2",
		...
	],
	"1": [
		"text3",
		"text4",
		...
	],
	...
}

And evaluations (if specified) are output as:

{
    "metric1": 0.01,
    "grouped_metrics": {
		"val1": 10,
		"val2": 0,
		...
	},
	...
}

Once output, a typical workflow will probably consist of reviewing the clusters for consistency/intelligibility and noting components that produce desirable results for your specific use case. While I haven’t built in full parameter specification for the variety of clustering algorithms, embedding models, and dimensionality reduction algorithms, you should be able to tweak those in the code without breaking things too much. Think of this as a jumping off point for any fine tuning you may need to do rather than a comprehensive do-it-all tool.

Manual Evaluations
#

I did some manual evaluation of the available embedding models and my go-to clustering models, just to get a sense of what I should actually recommend for use in production settings. This was mostly to see the differences in embedding models, as I feel reasonably confident in which clustering models have been useful in my day to day work.

Metrics
#

For the manual evaluation, I clustered the same dataset (200 utterences, 100 from the emotion dataset and 100 from bitext) for each embedder and clusterer. I came up with 3 qualitative metrics to capture the things I’ve found to be desirable in NLP clustering outputs:

Granularity: How specific are clusters to a particular topic or handful of topics?
Intelligibility: How human-identifiable are the topics represented by each cluster?
Containment: How well is a particular topic confined to 1 or a few clusters?

I reviewed the clustering results and rated each embedder/clusterer combination on a scale of -1, 0, or 1 for low, medium, and high. These ratings were of course subjective, but over the full set of combinations they should capture something about the result quality, especially alongside the calculated metrics included from the evaluators.

Results
#

The clusterer results were pretty surprising to me, as HDBSCAN (at least with its default settings, and yes I know you can greatly adjust the output using the model parameters) performed pretty poorly across the board. Affinity propagation was much more consistent and tended to have very good results out of the box. Model2Vec distilled models showed an overall drop in clustering quality, which wasn’t surprising given the performance-model size tradeoff associated with model distillation. Finally (and at least for this evaluation, most importantly) are the embedder results. The below summary ignores HDBSCAN results since they were generally pretty unusable.

For granularity, the high performers were:

gte-base
gtr-t5-base
LaBSE
modernbert-embed-base
nomic-embed-text-v1_5
sentence-t5-base
stsb-mpnet-base-v2
stsb-roberta-base-v2
tfidf

For intelligibility:

bge-base-en-v1_5
model2vec_bge-base-en-v1_5
potion-base-8M
gte-base
gtr-t5-base
LaBSE
modernbert-embed-base
nomic-embed-text-v1_5
sentence-t5-base
stsb-roberta-base-v2

For containment:

modernbert-embed-base
sentence-t5-base
stsb-roberta-base-v2

And high performers across all categories:

modernbert-embed-base
sentence-t5-base
stsb-roberta-base-v2

I think generally any of these models will be hard to go wrong with in nlp clustering tasks. That being said, sentencet5-base is less than 1/2 the size on disk as stsb-roberta-base-v2 and 1/3 the size of modernbert-embed-base, so I know which I’ll be using by default.

Author

Ryan Hildebrandt

Data Scientist, etc.

Cluster Benchmarks #

Or: Trying to save myself some boilerplate when picking an embedding+clustering model combination for the 8th time this month #

Purpose #

Dataset #

Pipeline #

Configuration #

Usage #

Manual Evaluations #

Metrics #

Results #