Project: Movie Genre Classification using k-NN

This project aimed to build and evaluate a k-nearest neighbors (k-NN) classifier to predict whether a movie belongs to the comedy or thriller genre, based on the frequency of specific words in its screenplay.

1/10/20251 min read

File (Google Drive):

PDF version:

Objective:
Classify movies as either comedy or thriller using a k-NN algorithm based on word frequencies in their screenplays.

Key Activities and Findings:

Dataset Exploration and Preprocessing:
- Analyzed a dataset containing 333 movies and the frequency of 5,000 common words per movie.
- Preprocessed the data by converting words to lowercase and transforming word counts into proportions.
- Introduced the bag-of-words model to represent textual data numerically.
Feature Engineering and Selection:
- Selected key words as features, focusing on their discriminative power between genres.
- Utilized word stemming to reduce variations of words to their root forms for consistency.
Building and Evaluating the Classifier:
- Trained a k-NN classifier using Euclidean distance to measure the similarity between movies.
- Iteratively optimized feature sets and the number of neighbors (k) to improve classification accuracy.
Performance and Insights:
- Achieved an accuracy of 76% on the test dataset using a carefully chosen feature set of 10 words.
- Identified common patterns in misclassifications, such as blended genres (e.g., comedy with thriller elements).
- Developed an efficient classifier with a limited feature set of 5 words, achieving 66% accuracy while improving computational efficiency.
Explorations:
- Extended the classifier to predict genres for a new dataset of movies curated by peers.
- Highlighted the trade-offs between accuracy and computational efficiency in classifier design.

Technologies Used:

Languages & Libraries: Python (datascience, numpy, matplotlib)
Algorithm: k-nearest neighbors (k-NN)
Tools: Data visualization (Matplotlib), feature engineering

PDF

File

Project: Movie Genre Classification using k-NN

Address

Contacts