Project: Spam/Ham Email Classification using Logistic Regression

This project involved building and improving a binary classifier for email spam detection. Using real-world email datasets, it focused on feature engineering, classification, and cross-validation to distinguish spam emails from ham (non-spam) emails.

1/10/20251 min read

Part 1: Initial Spam/Ham Classification

File (Google Drive):

PDF version:

Objective:
Created a basic spam detection model to classify emails using logistic regression with word-based features.

Key Activities and Findings:

Data Preprocessing:
- Cleaned a dataset of 8,348 labeled emails and handled missing data.
- Converted email content to lowercase for consistency.
Feature Engineering:
- Developed binary features indicating the presence of specific words (e.g., "drug," "bank").
Exploratory Data Analysis (EDA):
- Visualized word frequencies in spam and ham emails to identify distinguishing patterns.
Model Development and Evaluation:
- Trained a logistic regression model, achieving a training accuracy of ~76%.
- Evaluated performance using metrics like precision, recall, and false positive rate.

Technologies Used:

Languages & Libraries: Python (pandas, numpy, scikit-learn)
Visualization: Matplotlib, Seaborn

PDF

file

Part 2: Improve Spam/Ham Classification

File (Google Drive):

PDF version:

Objective:
Enhanced the spam classifier by incorporating advanced features and refining the logistic regression model.

Key Activities and Results:

Feature Engineering:
- Added novel features, including word counts, email length, punctuation frequency, and the number of links.
- Selected key words based on their ability to distinguish between spam and ham emails.
Model Optimization:
- Used cross-validation and hyperparameter tuning to improve model accuracy.
- Achieved a training accuracy of ~89% and test accuracy exceeding 85%.
Model Evaluation:
- Generated and analyzed ROC curves to understand the trade-off between false positives and false negatives.
Challenges and Insights:
- Discovered that simple features like email length significantly boosted model accuracy.
- Analyzed ambiguous cases to understand the impact of subjective labels on performance.

Technologies Used:

Languages & Libraries: Python (scikit-learn, pandas)
Tools: GridSearchCV for cross-validation, LogisticRegression with L1 regularization

Results:
Demonstrated a robust approach to feature engineering and model evaluation, achieving high accuracy while maintaining interpretability.

PDF

file

Project: Spam/Ham Email Classification using Logistic Regression

Part 1: Initial Spam/Ham Classification

Part 2: Improve Spam/Ham Classification

Address

Contacts