materi-praktikum/Praktikum Python Code/04-Semantics-and-Sentiment-Analysis/02-Sentiment-Analysis-Project.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "\n",
    "<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis Project\n",
    "For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.\n",
    "\n",
    "The 2,000 record IMDb movie review database is accessible through NLTK directly with\n",
    "<pre>from nltk.corpus import movie_reviews</pre>\n",
    "\n",
    "However, since we already have it in a tab-delimited file we'll use that instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>neg</td>\n",
       "      <td>how do films like mouse hunt get into theatres...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>neg</td>\n",
       "      <td>some talented actresses are blessed with a dem...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>pos</td>\n",
       "      <td>this has been an extraordinary year for austra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>pos</td>\n",
       "      <td>according to hollywood movies made in last few...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>neg</td>\n",
       "      <td>my first press screening of 1998 and already i...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                             review\n",
       "0   neg  how do films like mouse hunt get into theatres...\n",
       "1   neg  some talented actresses are blessed with a dem...\n",
       "2   pos  this has been an extraordinary year for austra...\n",
       "3   pos  according to hollywood movies made in last few...\n",
       "4   neg  my first press screening of 1998 and already i..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove Blank Records (optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# REMOVE NaN VALUES AND EMPTY STRINGS:\n",
    "df.dropna(inplace=True)\n",
    "\n",
    "blanks = []  # start with an empty list\n",
    "\n",
    "for i,lb,rv in df.itertuples():  # iterate over the DataFrame\n",
    "    if type(rv)==str:            # avoid NaN values\n",
    "        if rv.isspace():         # test 'review' for whitespace\n",
    "            blanks.append(i)     # add matching index numbers to the list\n",
    "\n",
    "df.drop(blanks, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos    969\n",
       "neg    969\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import `SentimentIntensityAnalyzer` and create an sid object\n",
    "This assumes that the VADER lexicon has been downloaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
    "\n",
    "sid = SentimentIntensityAnalyzer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use sid to append a `comp_score` to the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "      <th>scores</th>\n",
       "      <th>compound</th>\n",
       "      <th>comp_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>neg</td>\n",
       "      <td>how do films like mouse hunt get into theatres...</td>\n",
       "      <td>{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...</td>\n",
       "      <td>-0.9125</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>neg</td>\n",
       "      <td>some talented actresses are blessed with a dem...</td>\n",
       "      <td>{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...</td>\n",
       "      <td>-0.8618</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>pos</td>\n",
       "      <td>this has been an extraordinary year for austra...</td>\n",
       "      <td>{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...</td>\n",
       "      <td>0.9953</td>\n",
       "      <td>pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>pos</td>\n",
       "      <td>according to hollywood movies made in last few...</td>\n",
       "      <td>{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...</td>\n",
       "      <td>0.9972</td>\n",
       "      <td>pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>neg</td>\n",
       "      <td>my first press screening of 1998 and already i...</td>\n",
       "      <td>{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...</td>\n",
       "      <td>-0.7264</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                             review  \\\n",
       "0   neg  how do films like mouse hunt get into theatres...   \n",
       "1   neg  some talented actresses are blessed with a dem...   \n",
       "2   pos  this has been an extraordinary year for austra...   \n",
       "3   pos  according to hollywood movies made in last few...   \n",
       "4   neg  my first press screening of 1998 and already i...   \n",
       "\n",
       "                                              scores  compound comp_score  \n",
       "0  {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...   -0.9125        neg  \n",
       "1  {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...   -0.8618        neg  \n",
       "2  {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...    0.9953        pos  \n",
       "3  {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...    0.9972        pos  \n",
       "4  {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...   -0.7264        neg  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))\n",
    "\n",
    "df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])\n",
    "\n",
    "df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perform a comparison analysis between the original `label` and `comp_score`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score,classification_report,confusion_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6367389060887513"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(df['label'],df['comp_score'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         neg       0.72      0.44      0.55       969\n",
      "         pos       0.60      0.83      0.70       969\n",
      "\n",
      "   micro avg       0.64      0.64      0.64      1938\n",
      "   macro avg       0.66      0.64      0.62      1938\n",
      "weighted avg       0.66      0.64      0.62      1938\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(df['label'],df['comp_score']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[427 542]\n",
      " [162 807]]\n"
     ]
    }
   ],
   "source": [
    "print(confusion_matrix(df['label'],df['comp_score']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.\n",
    "## Great Job!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}