Project: Movie Genre Classification using k-NN

This project aimed to build and evaluate a k-nearest neighbors (k-NN) classifier to predict whether a movie belongs to the comedy or thriller genre, based on the frequency of specific words in its screenplay.

1/10/20251 min read

File (Google Drive):

PDF version:

Objective:
Classify movies as either comedy or thriller using a k-NN algorithm based on word frequencies in their screenplays.

Key Activities and Findings:

  1. Dataset Exploration and Preprocessing:

    • Analyzed a dataset containing 333 movies and the frequency of 5,000 common words per movie.

    • Preprocessed the data by converting words to lowercase and transforming word counts into proportions.

    • Introduced the bag-of-words model to represent textual data numerically.

  2. Feature Engineering and Selection:

    • Selected key words as features, focusing on their discriminative power between genres.

    • Utilized word stemming to reduce variations of words to their root forms for consistency.

  3. Building and Evaluating the Classifier:

    • Trained a k-NN classifier using Euclidean distance to measure the similarity between movies.

    • Iteratively optimized feature sets and the number of neighbors (k) to improve classification accuracy.

  4. Performance and Insights:

    • Achieved an accuracy of 76% on the test dataset using a carefully chosen feature set of 10 words.

    • Identified common patterns in misclassifications, such as blended genres (e.g., comedy with thriller elements).

    • Developed an efficient classifier with a limited feature set of 5 words, achieving 66% accuracy while improving computational efficiency.

  5. Explorations:

    • Extended the classifier to predict genres for a new dataset of movies curated by peers.

    • Highlighted the trade-offs between accuracy and computational efficiency in classifier design.

Technologies Used:

  • Languages & Libraries: Python (datascience, numpy, matplotlib)

  • Algorithm: k-nearest neighbors (k-NN)

  • Tools: Data visualization (Matplotlib), feature engineering