{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"
\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentiment Analysis Project\n",
"For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.\n",
"\n",
"The 2,000 record IMDb movie review database is accessible through NLTK directly with\n",
"
from nltk.corpus import movie_reviews
\n",
"\n",
"However, since we already have it in a tab-delimited file we'll use that instead."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" label | \n",
" review | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" neg | \n",
" how do films like mouse hunt get into theatres... | \n",
"
\n",
" \n",
" | 1 | \n",
" neg | \n",
" some talented actresses are blessed with a dem... | \n",
"
\n",
" \n",
" | 2 | \n",
" pos | \n",
" this has been an extraordinary year for austra... | \n",
"
\n",
" \n",
" | 3 | \n",
" pos | \n",
" according to hollywood movies made in last few... | \n",
"
\n",
" \n",
" | 4 | \n",
" neg | \n",
" my first press screening of 1998 and already i... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label review\n",
"0 neg how do films like mouse hunt get into theatres...\n",
"1 neg some talented actresses are blessed with a dem...\n",
"2 pos this has been an extraordinary year for austra...\n",
"3 pos according to hollywood movies made in last few...\n",
"4 neg my first press screening of 1998 and already i..."
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Remove Blank Records (optional)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# REMOVE NaN VALUES AND EMPTY STRINGS:\n",
"df.dropna(inplace=True)\n",
"\n",
"blanks = [] # start with an empty list\n",
"\n",
"for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n",
" if type(rv)==str: # avoid NaN values\n",
" if rv.isspace(): # test 'review' for whitespace\n",
" blanks.append(i) # add matching index numbers to the list\n",
"\n",
"df.drop(blanks, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pos 969\n",
"neg 969\n",
"Name: label, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['label'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import `SentimentIntensityAnalyzer` and create an sid object\n",
"This assumes that the VADER lexicon has been downloaded."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
"\n",
"sid = SentimentIntensityAnalyzer()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use sid to append a `comp_score` to the dataset"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" label | \n",
" review | \n",
" scores | \n",
" compound | \n",
" comp_score | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" neg | \n",
" how do films like mouse hunt get into theatres... | \n",
" {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... | \n",
" -0.9125 | \n",
" neg | \n",
"
\n",
" \n",
" | 1 | \n",
" neg | \n",
" some talented actresses are blessed with a dem... | \n",
" {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... | \n",
" -0.8618 | \n",
" neg | \n",
"
\n",
" \n",
" | 2 | \n",
" pos | \n",
" this has been an extraordinary year for austra... | \n",
" {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... | \n",
" 0.9953 | \n",
" pos | \n",
"
\n",
" \n",
" | 3 | \n",
" pos | \n",
" according to hollywood movies made in last few... | \n",
" {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... | \n",
" 0.9972 | \n",
" pos | \n",
"
\n",
" \n",
" | 4 | \n",
" neg | \n",
" my first press screening of 1998 and already i... | \n",
" {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... | \n",
" -0.7264 | \n",
" neg | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label review \\\n",
"0 neg how do films like mouse hunt get into theatres... \n",
"1 neg some talented actresses are blessed with a dem... \n",
"2 pos this has been an extraordinary year for austra... \n",
"3 pos according to hollywood movies made in last few... \n",
"4 neg my first press screening of 1998 and already i... \n",
"\n",
" scores compound comp_score \n",
"0 {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... -0.9125 neg \n",
"1 {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... -0.8618 neg \n",
"2 {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... 0.9953 pos \n",
"3 {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... 0.9972 pos \n",
"4 {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... -0.7264 neg "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))\n",
"\n",
"df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])\n",
"\n",
"df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a comparison analysis between the original `label` and `comp_score`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score,classification_report,confusion_matrix"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6367389060887513"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(df['label'],df['comp_score'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" neg 0.72 0.44 0.55 969\n",
" pos 0.60 0.83 0.70 969\n",
"\n",
" micro avg 0.64 0.64 0.64 1938\n",
" macro avg 0.66 0.64 0.62 1938\n",
"weighted avg 0.66 0.64 0.62 1938\n",
"\n"
]
}
],
"source": [
"print(classification_report(df['label'],df['comp_score']))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[427 542]\n",
" [162 807]]\n"
]
}
],
"source": [
"print(confusion_matrix(df['label'],df['comp_score']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.\n",
"## Great Job!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}