NLP Text Pre-processing Techniques

Vaishnavi Abirami
9 min readFeb 26, 2024

--

Text preprocessing and feature extraction are essential steps in natural language processing (NLP) tasks, serving to prepare raw text data for analysis and modeling. These are the crucial steps in NLP that enable machines to effectively understand, analyze, and derive insights from raw text data.

Text Preprocessing involves transforming raw text into a format that is suitable for further analysis. This typically includes steps such as:
1. Tokenization: Breaking down text into smaller units, such as words or characters.
2. Lowercasing: Converting all text to lowercase to ensure consistency.
3. Removing Punctuation: Eliminating punctuation marks that do not contribute significantly to the meaning of the text.
4. Removing Stopwords: Removing common words (e.g., “the”, “is”, “and”) that occur frequently but often do not carry meaningful information.
5. Stemming or Lemmatization: Normalizing words to their root form to reduce inflectional forms and improve analysis accuracy.

Feature Extraction involves transforming text data into numerical features that can be understood by machine learning algorithms. Some common techniques include:
1. Bag of Words (BoW): Representing text data as a matrix where rows correspond to documents and columns correspond to unique words in the corpus, with each cell indicating the frequency of a word in a document. Bag of Words is a special case of the n-gram model where n=1.

Consider the following Corpus.

The BOW representation of the above corpus is as follows:

The above matrix would be given as an input to a classification algorithm to perform some classification task.

There are some Disadvantages in the Bag of Words Model.

Loss of Sequence Information: BOW disregards the order of words and treats each document or sentence as a collection of independent words, resulting in the loss of sequence information.
Sparsity: It often leads to high-dimensional and sparse representations, especially when working with a large vocabulary, which can increase computational complexity and memory requirements.
Semantic Gap: BOW may struggle to capture semantic relationships between words and can’t differentiate between words used in different contexts.

Here is the code for generating vector using BOW for a reddit questions dataset.

2. Term Frequency-Inverse Document Frequency (TF-IDF): Similar to BoW but with weighting to give more importance to rare words and less importance to common words.

Term frequency :
Term frequency (TF) is used in connection with information retrieval and shows how frequently an expression (term, word) occurs in a document. Term frequency indicates the significance of a particular term within the overall document. It is the number of times a word w occurs in a document d with respect to the total number of words in document d.

TF can be said as what is the probability of finding a word in a document d.

TF(w,d) = No of times w occurs in d / Total no of words in d

Inverse Document Frequency:

The inverse document frequency is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

IDF(term) = log(total number of documents / number of documents containing the term)

TF-IDF is then calculated as follows:

For the aforementioned corpus in BOw, here is how we calculate TF, IDF and the TF-IDF

IDF:

From the above table, we can see that the TF-IDF of common words was zero, which shows they are not significant. While other words such as sat, on, mat,dog,barked,at,mat,jumped,off has higher significance.

Here is the code for generating vector using TF-IDF for a reddit questions dataset.

3. Word Embeddings: Representing words as dense vectors in a continuous vector space, typically learned from large text corpora using techniques like Word2Vec, GloVe, or BERT. Word embeddings capture semantic relationships between words and are often used as features in NLP tasks.

The above picture represents the features of a rectangle in shapes. In place of this, we will be seeing a feature vector for each image on the left.
The above picture represents the features of a rectangle in shapes. In place of this, we will be seeing a feature vector for each image on the left.
This represents how man-woman is related to king-queen. The main difference highlighted here is the gender feature

For this example, we have considered Word2Vec

Word2Vec encompasses not just one algorithm but rather a spectrum of model architectures and optimizations utilized for acquiring word embeddings from extensive datasets. The embeddings derived via Word2Vec have exhibited efficacy across various downstream natural language processing tasks.

The word2vec model generates a 300-dimensional feature vector. When a corpus is fed into the word2vec model, it produces word embeddings with numerical representations. These representations are derived from the degree of correspondence between each feature and the 300 dimensions of the feature vector. For this the cosine similarity is used to derive how each word is similar to each feature.

The EApple and EOrange represents the 300 dimension vector of each word in the vocabulary. This mainly represents their features

These studies introduced two approaches for crafting word representations:

  1. Continuous bag-of-words model: This model forecasts the central word grounded on the neighboring contextual words. The context comprises a few words preceding and succeeding the present (central) word. This design is termed a bag-of-words model since the sequence of words in the context holds no significance.
  2. Continuous skip-gram model: Here, the model predicts words located within a specified range both before and after the current word within the same sentence.

In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. While in the Skip-gram model, the distributed representation of the input word is used to predict the context.

A prerequisite for any neural network or any supervised training technique is to have labeled training data. How do you a train a neural network to predict word embedding when you don’t have any labeled data i.e words and their corresponding word embedding.

Skip-gram Model

We’ll do so by creating a “fake” task for the neural network to train. We won’t be interested in the inputs and outputs of this network, rather the goal is actually just to learn the weights of the hidden layer that are actually the “word vectors” that we’re trying to learn.

The fake task for Skip-gram model would be, given a word, we’ll try to predict its neighbouring words. We’ll define a neighbouring word by the window size — a hyper-parameter.

Given the above sentence, orange would be the context and target would be its surrounding words with some window size

Given the sentence:
“I would like some orange juice with some white chocolate chips.”
and a window size of 2, if the context word is orange, its neighboring words will be ( juice, with, some, like). Our input and target word pair would be (orange, juice), (orange, with), (orange, some), (orange, like).
Also note that within the sample window, proximity of the words to the source word plays no role. So juice, with, some, and like will be treated the same while training.

The dimensions of the input vector will be 1xV — where V is the number of words in the vocabulary — i.e one-hot representation of the word. The single hidden layer will have dimension VxE, where E is the size of the word embedding and is a hyper-parameter. The output from the hidden layer would be of the dimension 1xE, which we will feed into an softmax layer. The dimensions of the output layer will be 1xV, where each value in the vector will be the probability score of the target word at that position.

The back propagation for training samples corresponding to a source word is done in one back pass. So for juice, we will complete the forward pass for all 4 target words ( juice, with, some, like). We will then calculate the errors vectors[1xV dimension] corresponding to each target word. We will now have 4 1xV error vectors and will perform an element-wise sum to get a 1xV vector. The weights of the hidden layer will be updated based on this cumulative 1xV error vector.

There are still some computation issues with Softmax. We need to sum up the dot products of all words in the vocabulary.

The solution for this is Negative sampling. We will simplify the learning problem in here. The solution is to given a context word, predict if some random word is its target word or not.

Following picture illustrates how negative sampling overcomes the computation problem which we just faced in softmax.

We need to consider x as (context, target) and Targets are labels which indicates if its a target word or not
The Computation is made much faster by using negative sampling with k+1 logistic regression rather than using softmax

Intutions of Negative Sampling

  • The learned word embeddings will contain information about word co-occurrence relationship
  • Embeddings are learned in such a way that they know which words tend to appear together(in a window) or not
  • Semantics can be described by words co-occurrence. “A word is characterized by the company it keeps”

CBOW

CBOW is somewhat similar to Skip-gram, in the sense that we still take a pair of words and teach the model that they co-occur but instead of adding the errors we add the input words for the same target word.

The dimension of our hidden layer and output layer will remain the same. Only the dimension of our input layer and the calculation of hidden layer activations will change, if we have 4 context words for a single target word, we will have 4 1xV input vectors. Each will be multiplied with the VxE hidden layer returning 1xE vectors. All 4 1xE vectors will be averaged element-wise to obtain the final activation which then will be fed into the softmax layer.

Skip-gram: works well with a small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

Here you go with the Kaggle notebook for the above mentioned techniques.

--

--