Project: Spam/Ham Email Classification using Logistic Regression
This project involved building and improving a binary classifier for email spam detection. Using real-world email datasets, it focused on feature engineering, classification, and cross-validation to distinguish spam emails from ham (non-spam) emails.
1/10/20251 min read
Part 1: Initial Spam/Ham Classification
File (Google Drive):
PDF version:
Objective:
Created a basic spam detection model to classify emails using logistic regression with word-based features.
Key Activities and Findings:
Data Preprocessing:
Cleaned a dataset of 8,348 labeled emails and handled missing data.
Converted email content to lowercase for consistency.
Feature Engineering:
Developed binary features indicating the presence of specific words (e.g., "drug," "bank").
Exploratory Data Analysis (EDA):
Visualized word frequencies in spam and ham emails to identify distinguishing patterns.
Model Development and Evaluation:
Trained a logistic regression model, achieving a training accuracy of ~76%.
Evaluated performance using metrics like precision, recall, and false positive rate.
Technologies Used:
Languages & Libraries: Python (pandas, numpy, scikit-learn)
Visualization: Matplotlib, Seaborn
Part 2: Improve Spam/Ham Classification
File (Google Drive):
PDF version:
Objective:
Enhanced the spam classifier by incorporating advanced features and refining the logistic regression model.
Key Activities and Results:
Feature Engineering:
Added novel features, including word counts, email length, punctuation frequency, and the number of links.
Selected key words based on their ability to distinguish between spam and ham emails.
Model Optimization:
Used cross-validation and hyperparameter tuning to improve model accuracy.
Achieved a training accuracy of ~89% and test accuracy exceeding 85%.
Model Evaluation:
Generated and analyzed ROC curves to understand the trade-off between false positives and false negatives.
Challenges and Insights:
Discovered that simple features like email length significantly boosted model accuracy.
Analyzed ambiguous cases to understand the impact of subjective labels on performance.
Technologies Used:
Languages & Libraries: Python (scikit-learn, pandas)
Tools: GridSearchCV for cross-validation, LogisticRegression with L1 regularization
Results:
Demonstrated a robust approach to feature engineering and model evaluation, achieving high accuracy while maintaining interpretability.
Address
San Francisco Bay Area