Project: Movie Genre Classification using k-NN
This project aimed to build and evaluate a k-nearest neighbors (k-NN) classifier to predict whether a movie belongs to the comedy or thriller genre, based on the frequency of specific words in its screenplay.
1/10/20251 min read
File (Google Drive):
PDF version:
Objective:
Classify movies as either comedy or thriller using a k-NN algorithm based on word frequencies in their screenplays.
Key Activities and Findings:
Dataset Exploration and Preprocessing:
Analyzed a dataset containing 333 movies and the frequency of 5,000 common words per movie.
Preprocessed the data by converting words to lowercase and transforming word counts into proportions.
Introduced the bag-of-words model to represent textual data numerically.
Feature Engineering and Selection:
Selected key words as features, focusing on their discriminative power between genres.
Utilized word stemming to reduce variations of words to their root forms for consistency.
Building and Evaluating the Classifier:
Trained a k-NN classifier using Euclidean distance to measure the similarity between movies.
Iteratively optimized feature sets and the number of neighbors (k) to improve classification accuracy.
Performance and Insights:
Achieved an accuracy of 76% on the test dataset using a carefully chosen feature set of 10 words.
Identified common patterns in misclassifications, such as blended genres (e.g., comedy with thriller elements).
Developed an efficient classifier with a limited feature set of 5 words, achieving 66% accuracy while improving computational efficiency.
Explorations:
Extended the classifier to predict genres for a new dataset of movies curated by peers.
Highlighted the trade-offs between accuracy and computational efficiency in classifier design.
Technologies Used:
Languages & Libraries: Python (datascience, numpy, matplotlib)
Algorithm: k-nearest neighbors (k-NN)
Tools: Data visualization (Matplotlib), feature engineering
Address
San Francisco Bay Area