Skip to main content
  1. My Projects/

Priors

·349 words·2 mins·

Report
Python
NLP
Sentence embedding


ryancahildebrandt/priors

An experiment combining pretrained and bag of words embedding approaches for text classification

HTML
0
0


Mixing Up Sentence Embeddings
#

Using bag of words embeddings to incorporate prior knowledge into pretrained sentence embeddings
#


Open in gitpod
This project contains 0% LLM-generated content

Purpose
#

This project is an experiment on the potential usefulness of combining two commonly used sentence/document embedding approaches, bag of words models (tf-idf, hash, count) and pretrained sentence embedding models (RoBERTa, miCSE, sentence-t5). These embeddings are frequently used in text labeling or classification tasks, which depend on information contained in the embedding vectors. Different embedding approaches encode different sorts and amounts of information, and as a result embedding method can drastically change the outcome of a given task. Where neural network based sentence embeddings can more accurately encode the intent, context, and word similarity in an utterance, bag of words models encode specific words explicitly. Traditionally, these methods have not been combined to create a single embedding vector for a variety of reasons, not least of which is the assumption that neural network based embeddings encode all of the information of bag of words models and then some. While it’s true that neural network embeddings are much more flexible in the types of information they encode and often outperform bag of words models on more complex tasks, that doesn’t necessarily mean that bag of words models can’t encode useful information for a specific task. This is where prior knowledge enters the equation and forms the central question for this experiment:

  • If we have an expectation that for a given problem (text classification in this case), certain words are likely to be informative features in a classification model, is there any benefit in supplementing neural network embeddings with said feature in the form of additional bag of words based embeddings?

Dataset
#

The datasets used for the current project were pulled from the following:


Outputs
#

Ryan Hildebrandt
Author
Ryan Hildebrandt
Data Scientist, etc.