SMS Spam Detector

Project Overview
A machine learning-based web application that classifies SMS messages as spam or not spam. This project demonstrates the full pipeline of a machine learning project, from data preprocessing to model deployment via streamlit.
Live demo: SMS Spam Detector App
Features
- Text input for SMS messages
- Real-time classification of messages as spam or not spam
- Simple and intuitive user interface
Technologies Used
- Python
- Scikit-learn for handling machine learning models
- NLTK for natural language processing
- Streamlit for web application development
- Pickle for model serialization and vectorization
Project Structure
app.py
: Main Streamlit application file
model.pkl
: Serialized machine learning model
vectorizer.pkl
: Serialized text vectorizer
requirements.txt
: List of Python dependencies
email_spam_classifier.ipynb
: Additional exploratory data analysis and model training and evaluating.
How to Run Locally
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
- Run the Streamlit app:
streamlit run app.py
Machine Learning Pipeline
- Data Preprocessing:
- Removed special characters and numbers
- Converted text to lowercase
- Tokenized messages
- Removed stop words
- Applied stemming to reduce words to their root form
- Exploratory Data Analysis:
- Cleaned data: removed unnecessary columns and duplicates
- Graphs and Heatmap: Learnt about trends and its corelations using graphs and heatmaps
- WordClouds: Used wordclouds to visualise which were the most common words
- Found class imbalance: more ham than spam
- Key insight: Spam messages typically longer than ham
- Feature Engineering:
- Used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
- Experimented with different max_features settings for TF-IDF
- Model Selection:
- Tested models:
- SVC
- KNeighbors
- MultinomialNB
- DecisionTree
- LogisticRegression
- RandomForest
- AdaBoost
- BaggingClassifier
- ExtraTrees
- GradientBoosting
- XGBoost
- Also experimented with ensemble methods:
- VotingClassifier
- StackingClassifier
- Final model chosen: MultinomialNB
- Model Evaluation:
- Used accuracy and precision as key metrics
- Compared performance across different models and configurations
- Used train-test split for evaluation (80% train, 20% test)
Future Improvements
-
Hyperparameter tuning: Systematically tune hyperparameters for top-performing models
-
Advanced NLP techniques: Incorporate word embeddings or transformer-based models for potentially improved performance
-
Multilingual: Develop the ability to classify data from different languages.
Contributing
Contributions to improve the project are welcome. Please feel free to submit a Pull Request.
Connect with Me: All my socials
Project Link: GitHub Repository URL