materi-praktikum/Praktikum Python Code/03-Text-Classification/04-Text-Classification-Assessment-Solution.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "___\n",
    "\n",
    "<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Classification Assessment - Solution\n",
    "This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n",
    "\n",
    "The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. \n",
    "\n",
    "We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n",
    "\n",
    "For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #1: Perform imports and load the dataset into a pandas DataFrame\n",
    "For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>pos</td>\n",
       "      <td>I loved this movie and will watch it again. Or...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>pos</td>\n",
       "      <td>A warm, touching movie that has a fantasy-like...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>pos</td>\n",
       "      <td>I was not expecting the powerful filmmaking ex...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>neg</td>\n",
       "      <td>This so-called \"documentary\" tries to tell tha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>pos</td>\n",
       "      <td>This show has been my escape from reality for ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                             review\n",
       "0   pos  I loved this movie and will watch it again. Or...\n",
       "1   pos  A warm, touching movie that has a fantasy-like...\n",
       "2   pos  I was not expecting the powerful filmmaking ex...\n",
       "3   neg  This so-called \"documentary\" tries to tell tha...\n",
       "4   pos  This show has been my escape from reality for ..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\\t')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #2: Check for missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "label      0\n",
       "review    20\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check for NaN values:\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check for whitespace strings (it's OK if there aren't any!):\n",
    "blanks = []  # start with an empty list\n",
    "\n",
    "for i,lb,rv in df.itertuples():  # iterate over the DataFrame\n",
    "    if type(rv)==str:            # avoid NaN values\n",
    "        if rv.isspace():         # test 'review' for whitespace\n",
    "            blanks.append(i)     # add matching index numbers to the list\n",
    "        \n",
    "len(blanks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #3:  Remove NaN values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #4: Take a quick look at the `label` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos    2990\n",
       "neg    2990\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #5: Split the data into train & test sets:\n",
    "You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df['review']\n",
    "y = df['label']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n",
    "You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,\n",
       "     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
       "     verbose=0))])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.svm import LinearSVC\n",
    "\n",
    "text_clf = Pipeline([('tfidf', TfidfVectorizer()),\n",
    "                     ('clf', LinearSVC()),\n",
    "])\n",
    "\n",
    "# Feed the training data through the pipeline\n",
    "text_clf.fit(X_train, y_train)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task #7: Run predictions and analyze the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Form a prediction set\n",
    "predictions = text_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[900  91]\n",
      " [ 63 920]]\n"
     ]
    }
   ],
   "source": [
    "# Report the confusion matrix\n",
    "from sklearn import metrics\n",
    "print(metrics.confusion_matrix(y_test,predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         neg       0.93      0.91      0.92       991\n",
      "         pos       0.91      0.94      0.92       983\n",
      "\n",
      "   micro avg       0.92      0.92      0.92      1974\n",
      "   macro avg       0.92      0.92      0.92      1974\n",
      "weighted avg       0.92      0.92      0.92      1974\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print a classification report\n",
    "print(metrics.classification_report(y_test,predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.921985815603\n"
     ]
    }
   ],
   "source": [
    "# Print the overall accuracy\n",
    "print(metrics.accuracy_score(y_test,predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Great job!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}