{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification Assessment - Solution\n", "This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n", "\n", "The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. \n", "\n", "We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n", "\n", "For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #1: Perform imports and load the dataset into a pandas DataFrame\n", "For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelreview
0posI loved this movie and will watch it again. Or...
1posA warm, touching movie that has a fantasy-like...
2posI was not expecting the powerful filmmaking ex...
3negThis so-called \"documentary\" tries to tell tha...
4posThis show has been my escape from reality for ...
\n", "
" ], "text/plain": [ " label review\n", "0 pos I loved this movie and will watch it again. Or...\n", "1 pos A warm, touching movie that has a fantasy-like...\n", "2 pos I was not expecting the powerful filmmaking ex...\n", "3 neg This so-called \"documentary\" tries to tell tha...\n", "4 pos This show has been my escape from reality for ..." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\\t')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #2: Check for missing values:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label 0\n", "review 20\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for NaN values:\n", "df.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for whitespace strings (it's OK if there aren't any!):\n", "blanks = [] # start with an empty list\n", "\n", "for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n", " if type(rv)==str: # avoid NaN values\n", " if rv.isspace(): # test 'review' for whitespace\n", " blanks.append(i) # add matching index numbers to the list\n", " \n", "len(blanks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #3: Remove NaN values:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df.dropna(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #4: Take a quick look at the `label` column:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pos 2990\n", "neg 2990\n", "Name: label, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #5: Split the data into train & test sets:\n", "You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = df['review']\n", "y = df['label']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n", "You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,\n", " multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n", " verbose=0))])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.svm import LinearSVC\n", "\n", "text_clf = Pipeline([('tfidf', TfidfVectorizer()),\n", " ('clf', LinearSVC()),\n", "])\n", "\n", "# Feed the training data through the pipeline\n", "text_clf.fit(X_train, y_train) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #7: Run predictions and analyze the results" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Form a prediction set\n", "predictions = text_clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[900 91]\n", " [ 63 920]]\n" ] } ], "source": [ "# Report the confusion matrix\n", "from sklearn import metrics\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.93 0.91 0.92 991\n", " pos 0.91 0.94 0.92 983\n", "\n", " micro avg 0.92 0.92 0.92 1974\n", " macro avg 0.92 0.92 0.92 1974\n", "weighted avg 0.92 0.92 0.92 1974\n", "\n" ] } ], "source": [ "# Print a classification report\n", "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.921985815603\n" ] } ], "source": [ "# Print the overall accuracy\n", "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Great job!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }