Skip to content

Email Spam Detection

AIM

To develop a machine learning-based system that classifies email content as spam or ham (not spam).

https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification

https://www.kaggle.com/code/inshak9/email-spam-detection

LIBRARIES NEEDED

LIBRARIES USED
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • seaborn

DESCRIPTION

What is the requirement of the project?

  • A robust system to detect spam emails is essential to combat increasing spam content.
  • It improves user experience by automatically filtering unwanted messages.
Why is it necessary?
  • Spam emails consume resources, time, and may pose security risks like phishing.
  • Helps organizations and individuals streamline their email communication.
How is it beneficial and used?
  • Provides a quick and automated solution for spam classification.
  • Used in email services, IT systems, and anti-spam software to filter messages.
How did you start approaching this project? (Initial thoughts and planning)
  • Analyzed the dataset and prepared features.
  • Implemented various machine learning models for comparison.
Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).
  • Documentation from scikit-learn
  • Blog: Introduction to Spam Classification with ML

EXPLANATION

DETAILS OF THE DIFFERENT FEATURES

The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham.

Feature Description
word_freq_x Frequency of specific words in the email body
capital_run_length Length of consecutive capital letters
char_freq Frequency of special characters like ; and $
is_spam Target variable (1 = Spam, 0 = Ham)

WHAT I HAVE DONE

Initial data exploration and understanding: - Loaded the dataset using pandas. - Explored dataset features and target variable distribution.

Data cleaning and preprocessing: - Checked for missing values. - Standardized features using scaling techniques.

Feature engineering and selection: - Extracted relevant features for spam classification. - Used correlation matrix to select significant features.

Model training and evaluation: - Trained models: KNN, Naive Bayes, SVM, and Random Forest. - Evaluated models using accuracy, precision, and recall.

Model optimization and fine-tuning: - Tuned hyperparameters using GridSearchCV.

Validation and testing: - Tested models on unseen data to check performance.


PROJECT TRADE-OFFS AND SOLUTIONS

  • Accuracy vs. Training Time:
  • Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes.
  • Complexity vs. Interpretability:
  • Simpler models like Naive Bayes were more interpretable but slightly less accurate.

SCREENSHOTS

Project structure or tree diagram

  graph LR
    A[Start] --> B[Load Dataset];
    B --> C[Preprocessing];
    C --> D[Train Models];
    D --> E{Compare Performance};
    E -->|Best Model| F[Deploy];
    E -->|Retry| C;
Visualizations and EDA of different features

Correlation

Model performance graphs

Comparison


MODELS USED AND THEIR EVALUATION METRICS

Model Accuracy Precision Recall
KNN 90% 89% 88%
Naive Bayes 92% 91% 90%
SVM 94% 93% 91%
Random Forest 95% 94% 93%
AdaBoost 97% 97% 100%

MODELS COMPARISON GRAPHS

Models Comparison Graphs

Accuracy Graph


CONCLUSION

WHAT YOU HAVE LEARNED

Insights gained from the data

  • Feature importance significantly impacts spam detection.
  • Simple models like Naive Bayes can achieve competitive performance.
Improvements in understanding machine learning concepts
  • Gained hands-on experience with classification models and model evaluation techniques.
Challenges faced and how they were overcome
  • Balancing between accuracy and training time was challenging, solved using model tuning.

USE CASES OF THIS MODEL

Email Service Providers - Automated filtering of spam emails for improved user experience.

Enterprise Email Security - Used in enterprise software to detect phishing and spam emails.


FEATURES PLANNED BUT NOT IMPLEMENTED

  • Integration of deep learning models (LSTM) for improved accuracy.

DEVELOPER

Insha Khan

LinkedIn GitHub

Happy Coding 🤓

Show some  ❤️  by  🌟  this repository!