Text Classification Project
This project demonstrates a basic text classification algorithm using Python and common NLP libraries. The goal is to classify text into different categories, specifically positive and negative sentiments.
Directory Structure
text_classification/
├── data/
│ └── text_data.csv
├── models/
│ └── text_classifier.pkl
├── preprocess.py
├── train.py
└── classify.py
- data/text_data.csv: Contains the dataset used for training and testing the model.
- models/text_classifier.pkl: Stores the trained text classification model.
- preprocess.py: Contains the code for preprocessing the text data.
- train.py: Script for training the text classification model.
- classify.py: Script for classifying new text using the trained model.
Dataset
The dataset (text_data.csv
) contains 200 entries with two columns: text
and label
. The text
column contains the text to be classified, and the label
column contains the corresponding category (positive or negative).
Preprocessing
The preprocess.py
file contains functions to preprocess the text data. It includes removing non-alphabetic characters, converting text to lowercase, removing stopwords, and lemmatizing the words.
Training the Model
The train.py
script is used to train the text classification model. It performs the following steps:
- Load the dataset.
- Preprocess the text data.
- Split the dataset into training and testing sets.
- Create a pipeline with
TfidfVectorizer
andMultinomialNB
. - Train the model.
- Evaluate the model and print the accuracy and classification report.
- Save the trained model to
models/text_classifier.pkl
.
Running the Training Script
To train the model, run:
Classifying New Text
The classify.py
script is used to classify new text using the trained model. It performs the following steps:
- Load the trained model.
- Preprocess the input text.
- Predict the class of the input text.
Running the Classification Script
To classify new text, run:
Dependencies
The project requires the following Python libraries:
- pandas
- scikit-learn
- nltk
- joblib
You can install the dependencies using:
Example Usage
# Example usage of the classify.py script
if __name__ == "__main__":
new_text = "I enjoy working on machine learning projects."
print(f'Text: "{new_text}"')
print(f'Predicted Class: {classify_text(new_text)}')
This project provides a simple text classification model using Naive Bayes with TF-IDF vectorization. You can expand it by using more advanced models or preprocessing techniques based on your requirements.