TEXT CLASSIFICATION USING ML
Natural language processing is a vast area of research which is used in multiple fields such as machine learning, linguistics, and computer science. Text classification is one of the applications of natural language processing in which the textual documents are categorized into several categories of interest. Some interesting applications of text classification include spam detection, sentiment analysis, articles classification into different topics etc
Before we dive into how text is classified using machine learning, we first need to understand how automatic text classifiers are created and why we need to use machine learning for text classification. Previously, the text classifiers were created manually using rule sets such as a series of if-else statements. However, programming a rule-based classifier is a time consuming and painful task as it could require hundreds and thousands of rules. Furthermore, the rule-based approach works only if the programmer knows all the situations under which decisions can be made. Natural language, however, does not rely on a set of limited conditions rather it is based on subtle differences in meaning, tone, and expression. It is not humanly possible to write all the rules for capturing all such linguistic nuances. Hence, this is where machine learning and artificial intelligence can be helpful. The machine learning algorithms learn multiple patterns from the data and hence can classify text efficiently and effectively even if the datasets are large.
Let’s understand this via an example. Suppose we have a simple sentence say, “I hate this movie”, and we need to classify its sentiment as either positive or negative. Since, the term ‘hate’ in the sentence makes it obvious to classify the sentence as negative, you can easily write the rules for it. However, if you have a complex case such as:
“I really loved the design of this hat. It looked exactly like the picture on the website, and I was so excited to wear it. However, I put it in the washing machine and dryer, and it got COMPLETELY ruined! The whole hat pilled and the pompom on the top is no longer soft and fluffy.”
“I really loved the design of this hat. It looked exactly like the picture on the website, and I was so excited to wear it. However, I put it in the washing machine and dryer, and it got COMPLETELY ruined! The whole hat pilled and the pompom on the top is no longer soft and fluffy.”
Types of machine learning:
There are two types of machine learning:
1. Supervised machine learning
2. Unsupervised machine learning
In supervised machine learning, labeled training data is used to train a predictive model while in unsupervised learning the model learns the patterns from unlabeled training data. In this case, we shall be focusing on supervised learning.
Steps for building a supervised text classifier:
The general steps for training a supervised text classifier are as follows:
1. Data preprocessing
2. Vectorization
3. Training a machine learning algorithm
4. Making predictions
Let’s look into each of these steps in detail.
1. Data Preprocessing
The first step is data preprocessing. Textual data can be preprocessed in various ways. A few of the most common techniques of preprocessing include:
i. Lowercasing
ii. Tokenization
iii. Stop words removal
iv. Lemmatization
i. Lowercasing
In lowercasing, all the uppercase letters in the raw text are converted into lower case letters. Lowercasing is important to ensure uniformity in the text as we do not want out model to differentiate two similar words written in different letter case such as ‘nlp’ and “NLP”. Furthermore, we assume that meaning of the text is not influenced by the letter case.
ii. Tokenization
Tokenization is the process of splitting the raw text, documents, or phrases into smaller units such as words. Hence, we shall represent the text as a sequence of words. For example:
Text: “I was so excited to wear it” Tokenized Text: [‘i’, ‘was’, ‘so’, ‘excited’, ‘to’, ‘wear’, ‘it’]
iii. Stop words Removal
Stop words are very commonly used words in the document such as a, the, are, have, to etc. These words do not contribute in making distinction between two documents, therefore, we need to clean our text by removing the stop words.
iv. Lemmatization
In natural language, words have different inflectional forms such as the word ‘excite’ may be used as ‘excited’, ‘exciting’, or ‘excitement’. The inherent meaning of all these words is technically the same. Therefore, we need to reduce all the inflectional forms of the words to their base forms called lemmas. This process of reducing the words to their base forms is called lemmatization.
After stop words removal and lemmatization, our example text is now converted into the preprocessed text given below:
Preprocessed Text: [‘excite’, ‘wear’]
2. Vectorization
Now that we have preprocessed text, we need to transform the preprocessed text into numerical features called vectors. This process is called feature extraction or vectorization. There are different techniques of vectorization such as bag of words, Term Frequency Inverse Document Frequency (TFIDF), word embeddings etc.
i. Vectorization using bag of words
Bag of words is the simplest technique of feature extraction. It first constructs a vocabulary containing the list of words in the training data. Using this vocabulary, it creates the vectors of each document containing the word counts for each word in the vocabulary.
ii. Vectorization using Term Frequency Inverse Document Frequency
Words occurring quite frequently in a document are typically not much important such as ‘and’, ‘what’, ‘this’ etc. TFIDF computes the relevant importance of each word for a document by assigning weight to each word. The words appearing frequently in the document are given less weight while the words that occur relatively less are given more weight. TFIDF is calculated by multiplying two metrics i.e., term frequency and inverse document frequency. Term frequency is simply the count of a word in a document while inverse document frequency is obtained by dividing total number of documents by the number of documents containing the word and taking its logarithm.
iii. Vectorization using word embeddings
Both bag of words and TFIDF ignore word position and meaning and are based on word frequencies. However, NLP is not based on word occurrences rather it depends on underlying meaning of the text. Word embeddings is an effective approach of vectorization that overcomes the limitations of both bag of words and TFIDF. Word embeddings technique is used for semantic representation of the text. It can predict context of a word based on the nearby words in the corpus giving better semantic/syntactic relationships of words. In this technique words which are similar to each other in meaning lie closer to each other in the vector space. There are different models to create word embeddings or word vectors such as word2vec, Fasttext, Glove etc.
3. Training a machine learning algorithm
Once we have the preprocessed text converted into vectors using any of the vectorization techniques, we can pass these vectors to a machine learning algorithm. There are different algorithms for machine learning. One of the simplest algorithms is Naïve Bayes. This algorithm is used when the output variable is discrete. The mechanics of this algorithm are based upon Bayes theorem which is given as under:
P(y|X) = P(X|y) * P(y)
Where X represents the document features and y represents the class. In simple English, given the input features X, the equation determines the probability of category y. A class with the maximum probability is chosen as the predicted class of a given document.
4. Making predictions
Once the model is trained on the given data, it can be used to make predictions on the unseen data.