This project performs sentiment analysis on the IMDb movie reviews dataset using Python and machine learning techniques. The main script, sentiment_analysis_imdb.py, loads, preprocesses, and classifies movie reviews as positive or negative using Logistic Regression.
- Loads IMDb dataset from local folders (positive and negative reviews)
- Text preprocessing: lowercasing, stopword removal, stemming
- Feature extraction using TF-IDF vectorization
- Label encoding for sentiment classes
- Model training and evaluation using Logistic Regression
- Plots and saves confusion matrix and precision/recall/F1-score bar charts
- Python 3.7+
- pandas
- numpy
- scikit-learn
- nltk
- matplotlib
- seaborn
Install dependencies with:
pip install pandas numpy scikit-learn nltk matplotlib seabornDownload the IMDb Large Movie Review Dataset and extract it. Update the script if your dataset path differs from ./aclImdb/train and ./aclImdb/test.
- Ensure the dataset is extracted and the directory structure matches the script's expectations.
- Run the script:
python sentiment_analysis_imdb.py
- The script will output model metrics and save plots:
confusion_matrix_lr.png: Confusion matrix for Logistic Regressionprecision_recall_f1_lr.png: Precision, Recall, and F1-Score bar chart
- Console output: Accuracy, F1 Score, classification report
- Images: Confusion matrix and metrics bar chart saved in the project directory