Text Classification with scikit-learn: A Beginner's Guide
- Suhas Bhairav

- Feb 5, 2024
- 2 min read
In today's data-driven world, the ability to analyze and categorize text data is invaluable. Text classification, a fundamental task in natural language processing (NLP), involves automatically assigning predefined categories or labels to free-text documents. In this blog post, we'll explore how to perform text classification using scikit-learn, a popular machine learning library in Python.
Introduction to Text Classification
Text classification, also known as text categorization or document classification, is a supervised learning task where we train a model to classify text documents into one or more predefined categories. This can have various applications such as spam detection, sentiment analysis, topic modeling, and more.
The Dataset: 20 Newsgroups
For this tutorial, we'll use the 20 Newsgroups dataset, a classic benchmark dataset widely used for text classification tasks. It consists of approximately 20,000 newsgroup documents across 20 different topics. Each document belongs to one of the predefined categories.
Code Implementation
Let's dive into the Python code to perform text classification using scikit-learn:
import sslssl._create_default_https_context = ssl._create_unverified_context# Importing necessary librariesfrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn import metrics# Load datasetdataset = fetch_20newsgroups()X, y = dataset.data, dataset.target# Splitting dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=45)# Convert dataset into feature vectors using TF-IDF Vectorizervectorizer = TfidfVectorizer(stop_words="english")X_train = vectorizer.fit_transform(X_train)X_test = vectorizer.transform(X_test)# Train classifier (Logistic Regression)clf = LogisticRegression()clf.fit(X_train, y_train)# Making predictionspred = clf.predict(X_test)# Evaluating the modelprint(metrics.classification_report(y_test, pred))Understanding the Code
1. Importing Libraries: We import the necessary libraries including scikit-learn modules for dataset loading, feature extraction (TF-IDF Vectorizer), model training (Logistic Regression), and evaluation metrics.
2. Loading Dataset: We load the 20 Newsgroups dataset using `fetch_20newsgroups()` function provided by scikit-learn.
3. Splitting Dataset: The dataset is split into training and testing sets using `train_test_split()` function.
4. Feature Extraction: We use TF-IDF Vectorizer to convert text documents into numerical feature vectors.
5. Training Classifier: We train a Logistic Regression classifier using the training data.
6. Making Predictions: We use the trained classifier to make predictions on the testing data.
7. Model Evaluation: Finally, we evaluate the model's performance using classification metrics such as precision, recall, and F1-score.
Conclusion
In this blog post, we've demonstrated how to perform text classification using scikit-learn library in Python. We've covered loading the dataset, preprocessing text data, feature extraction, model training, making predictions, and evaluating the model's performance. Text classification is a powerful technique with numerous real-world applications, and scikit-learn provides a user-friendly interface to implement it efficiently.
The link to the github code is here.


