Text Classification with scikit-learn: A Beginner's Guide

Suhas Bhairav
Feb 5, 2024
2 min read

In today's data-driven world, the ability to analyze and categorize text data is invaluable. Text classification, a fundamental task in natural language processing (NLP), involves automatically assigning predefined categories or labels to free-text documents. In this blog post, we'll explore how to perform text classification using scikit-learn, a popular machine learning library in Python.

Introduction to Text Classification

Text classification, also known as text categorization or document classification, is a supervised learning task where we train a model to classify text documents into one or more predefined categories. This can have various applications such as spam detection, sentiment analysis, topic modeling, and more.

The Dataset: 20 Newsgroups

For this tutorial, we'll use the 20 Newsgroups dataset, a classic benchmark dataset widely used for text classification tasks. It consists of approximately 20,000 newsgroup documents across 20 different topics. Each document belongs to one of the predefined categories.

Code Implementation

Let's dive into the Python code to perform text classification using scikit-learn:

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# Importing necessary libraries

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn import metrics

# Load dataset

dataset = fetch_20newsgroups()

X, y = dataset.data, dataset.target

# Splitting dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=45)

# Convert dataset into feature vectors using TF-IDF Vectorizer

vectorizer = TfidfVectorizer(stop_words="english")

X_train = vectorizer.fit_transform(X_train)

X_test = vectorizer.transform(X_test)

# Train classifier (Logistic Regression)

clf = LogisticRegression()

clf.fit(X_train, y_train)

# Making predictions

pred = clf.predict(X_test)

# Evaluating the model

print(metrics.classification_report(y_test, pred))

Understanding the Code

1. Importing Libraries: We import the necessary libraries including scikit-learn modules for dataset loading, feature extraction (TF-IDF Vectorizer), model training (Logistic Regression), and evaluation metrics.

2. Loading Dataset: We load the 20 Newsgroups dataset using `fetch_20newsgroups()` function provided by scikit-learn.

3. Splitting Dataset: The dataset is split into training and testing sets using `train_test_split()` function.

4. Feature Extraction: We use TF-IDF Vectorizer to convert text documents into numerical feature vectors.

5. Training Classifier: We train a Logistic Regression classifier using the training data.

6. Making Predictions: We use the trained classifier to make predictions on the testing data.

7. Model Evaluation: Finally, we evaluate the model's performance using classification metrics such as precision, recall, and F1-score.

Conclusion

In this blog post, we've demonstrated how to perform text classification using scikit-learn library in Python. We've covered loading the dataset, preprocessing text data, feature extraction, model training, making predictions, and evaluating the model's performance. Text classification is a powerful technique with numerous real-world applications, and scikit-learn provides a user-friendly interface to implement it efficiently.

The link to the github code is here.

Text Classification with scikit-learn: A Beginner's Guide

Related Posts

Subscribe to get all the updates