sms-spam-classifier

SMS Spam Detector

image

Project Overview

A machine learning-based web application that classifies SMS messages as spam or not spam. This project demonstrates the full pipeline of a machine learning project, from data preprocessing to model deployment via streamlit.

Live demo: SMS Spam Detector App

Features

Technologies Used

Project Structure

How to Run Locally

  1. Clone this repository
  2. Install dependencies: pip install -r requirements.txt
  3. Run the Streamlit app: streamlit run app.py

Machine Learning Pipeline

  1. Data Preprocessing:
    • Removed special characters and numbers
    • Converted text to lowercase
    • Tokenized messages
    • Removed stop words
    • Applied stemming to reduce words to their root form
  2. Exploratory Data Analysis:
    • Cleaned data: removed unnecessary columns and duplicates
    • Graphs and Heatmap: Learnt about trends and its corelations using graphs and heatmaps
    • WordClouds: Used wordclouds to visualise which were the most common words
    • Found class imbalance: more ham than spam
    • Key insight: Spam messages typically longer than ham
  3. Feature Engineering:
    • Used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
    • Experimented with different max_features settings for TF-IDF
  4. Model Selection:
    • Tested models:
      • SVC
      • KNeighbors
      • MultinomialNB
      • DecisionTree
      • LogisticRegression
      • RandomForest
      • AdaBoost
      • BaggingClassifier
      • ExtraTrees
      • GradientBoosting
      • XGBoost
    • Also experimented with ensemble methods:
      • VotingClassifier
      • StackingClassifier
    • Final model chosen: MultinomialNB
  5. Model Evaluation:
    • Used accuracy and precision as key metrics
    • Compared performance across different models and configurations
    • Used train-test split for evaluation (80% train, 20% test)

Future Improvements

  1. Hyperparameter tuning: Systematically tune hyperparameters for top-performing models

  2. Advanced NLP techniques: Incorporate word embeddings or transformer-based models for potentially improved performance

  3. Multilingual: Develop the ability to classify data from different languages.

Contributing

Contributions to improve the project are welcome. Please feel free to submit a Pull Request.


Connect with Me: All my socials

Project Link: GitHub Repository URL