Project: Spam/Ham Email Classification using Logistic Regression

This project involved building and improving a binary classifier for email spam detection. Using real-world email datasets, it focused on feature engineering, classification, and cross-validation to distinguish spam emails from ham (non-spam) emails.

1/10/20251 min read

Part 1: Initial Spam/Ham Classification

File (Google Drive):

PDF version:

Objective:
Created a basic spam detection model to classify emails using logistic regression with word-based features.

Key Activities and Findings:

  1. Data Preprocessing:

    • Cleaned a dataset of 8,348 labeled emails and handled missing data.

    • Converted email content to lowercase for consistency.

  2. Feature Engineering:

    • Developed binary features indicating the presence of specific words (e.g., "drug," "bank").

  3. Exploratory Data Analysis (EDA):

    • Visualized word frequencies in spam and ham emails to identify distinguishing patterns.

  4. Model Development and Evaluation:

    • Trained a logistic regression model, achieving a training accuracy of ~76%.

    • Evaluated performance using metrics like precision, recall, and false positive rate.

Technologies Used:

  • Languages & Libraries: Python (pandas, numpy, scikit-learn)

  • Visualization: Matplotlib, Seaborn

Part 2: Improve Spam/Ham Classification

File (Google Drive):

PDF version:

Objective:
Enhanced the spam classifier by incorporating advanced features and refining the logistic regression model.

Key Activities and Results:

  1. Feature Engineering:

    • Added novel features, including word counts, email length, punctuation frequency, and the number of links.

    • Selected key words based on their ability to distinguish between spam and ham emails.

  2. Model Optimization:

    • Used cross-validation and hyperparameter tuning to improve model accuracy.

    • Achieved a training accuracy of ~89% and test accuracy exceeding 85%.

  3. Model Evaluation:

    • Generated and analyzed ROC curves to understand the trade-off between false positives and false negatives.

  4. Challenges and Insights:

    • Discovered that simple features like email length significantly boosted model accuracy.

    • Analyzed ambiguous cases to understand the impact of subjective labels on performance.

Technologies Used:

  • Languages & Libraries: Python (scikit-learn, pandas)

  • Tools: GridSearchCV for cross-validation, LogisticRegression with L1 regularization

Results:
Demonstrated a robust approach to feature engineering and model evaluation, achieving high accuracy while maintaining interpretability.