408 lines
12 KiB
Plaintext
408 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Sentiment Analysis Project\n",
|
|
"For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.\n",
|
|
"\n",
|
|
"The 2,000 record IMDb movie review database is accessible through NLTK directly with\n",
|
|
"<pre>from nltk.corpus import movie_reviews</pre>\n",
|
|
"\n",
|
|
"However, since we already have it in a tab-delimited file we'll use that instead."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>how do films like mouse hunt get into theatres...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>some talented actresses are blessed with a dem...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>this has been an extraordinary year for austra...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>according to hollywood movies made in last few...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>my first press screening of 1998 and already i...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"0 neg how do films like mouse hunt get into theatres...\n",
|
|
"1 neg some talented actresses are blessed with a dem...\n",
|
|
"2 pos this has been an extraordinary year for austra...\n",
|
|
"3 pos according to hollywood movies made in last few...\n",
|
|
"4 neg my first press screening of 1998 and already i..."
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Remove Blank Records (optional)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# REMOVE NaN VALUES AND EMPTY STRINGS:\n",
|
|
"df.dropna(inplace=True)\n",
|
|
"\n",
|
|
"blanks = [] # start with an empty list\n",
|
|
"\n",
|
|
"for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n",
|
|
" if type(rv)==str: # avoid NaN values\n",
|
|
" if rv.isspace(): # test 'review' for whitespace\n",
|
|
" blanks.append(i) # add matching index numbers to the list\n",
|
|
"\n",
|
|
"df.drop(blanks, inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"pos 969\n",
|
|
"neg 969\n",
|
|
"Name: label, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df['label'].value_counts()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Import `SentimentIntensityAnalyzer` and create an sid object\n",
|
|
"This assumes that the VADER lexicon has been downloaded."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
|
|
"\n",
|
|
"sid = SentimentIntensityAnalyzer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Use sid to append a `comp_score` to the dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" <th>scores</th>\n",
|
|
" <th>compound</th>\n",
|
|
" <th>comp_score</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>how do films like mouse hunt get into theatres...</td>\n",
|
|
" <td>{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...</td>\n",
|
|
" <td>-0.9125</td>\n",
|
|
" <td>neg</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>some talented actresses are blessed with a dem...</td>\n",
|
|
" <td>{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...</td>\n",
|
|
" <td>-0.8618</td>\n",
|
|
" <td>neg</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>this has been an extraordinary year for austra...</td>\n",
|
|
" <td>{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...</td>\n",
|
|
" <td>0.9953</td>\n",
|
|
" <td>pos</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>according to hollywood movies made in last few...</td>\n",
|
|
" <td>{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...</td>\n",
|
|
" <td>0.9972</td>\n",
|
|
" <td>pos</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>my first press screening of 1998 and already i...</td>\n",
|
|
" <td>{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...</td>\n",
|
|
" <td>-0.7264</td>\n",
|
|
" <td>neg</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review \\\n",
|
|
"0 neg how do films like mouse hunt get into theatres... \n",
|
|
"1 neg some talented actresses are blessed with a dem... \n",
|
|
"2 pos this has been an extraordinary year for austra... \n",
|
|
"3 pos according to hollywood movies made in last few... \n",
|
|
"4 neg my first press screening of 1998 and already i... \n",
|
|
"\n",
|
|
" scores compound comp_score \n",
|
|
"0 {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... -0.9125 neg \n",
|
|
"1 {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... -0.8618 neg \n",
|
|
"2 {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... 0.9953 pos \n",
|
|
"3 {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... 0.9972 pos \n",
|
|
"4 {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... -0.7264 neg "
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))\n",
|
|
"\n",
|
|
"df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])\n",
|
|
"\n",
|
|
"df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')\n",
|
|
"\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Perform a comparison analysis between the original `label` and `comp_score`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import accuracy_score,classification_report,confusion_matrix"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0.6367389060887513"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"accuracy_score(df['label'],df['comp_score'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" precision recall f1-score support\n",
|
|
"\n",
|
|
" neg 0.72 0.44 0.55 969\n",
|
|
" pos 0.60 0.83 0.70 969\n",
|
|
"\n",
|
|
" micro avg 0.64 0.64 0.64 1938\n",
|
|
" macro avg 0.66 0.64 0.62 1938\n",
|
|
"weighted avg 0.66 0.64 0.62 1938\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(classification_report(df['label'],df['comp_score']))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[[427 542]\n",
|
|
" [162 807]]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(confusion_matrix(df['label'],df['comp_score']))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.\n",
|
|
"## Great Job!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|