{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification Project\n", "Now we're at the point where we should be able to:\n", "* Read in a collection of documents - a *corpus*\n", "* Transform text into numerical vector data using a pipeline\n", "* Create a classifier\n", "* Fit/train the classifier\n", "* Test the classifier on new data\n", "* Evaluate performance\n", "\n", "For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/\n", "\n", "In this exercise we'll try to develop a classification model as we did for the SMSSpamCollection dataset - that is, we'll try to predict the Positive/Negative labels based on text content alone. In an upcoming section we'll apply *Sentiment Analysis* to train models that have a deeper understanding of each review." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform imports and load the dataset\n", "The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelreview
0neghow do films like mouse hunt get into theatres...
1negsome talented actresses are blessed with a dem...
2posthis has been an extraordinary year for austra...
3posaccording to hollywood movies made in last few...
4negmy first press screening of 1998 and already i...
\n", "
" ], "text/plain": [ " label review\n", "0 neg how do films like mouse hunt get into theatres...\n", "1 neg some talented actresses are blessed with a dem...\n", "2 pos this has been an extraordinary year for austra...\n", "3 pos according to hollywood movies made in last few...\n", "4 neg my first press screening of 1998 and already i..." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Take a look at a typical review. This one is labeled \"negative\":" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "> how do films like mouse hunt get into theatres ? \r\n", "isn't there a law or something ? \r\n", "this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . \r\n", "mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . \r\n", "writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . \r\n", "the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . \r\n", "deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . \r\n", "but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . \r\n", "the story alternates between unfunny scenes of the brothers bickering over what to do with their inheritance and endless action sequences as the two take on their increasingly determined furry foe . \r\n", "whatever promise the film starts with soon deteriorates into boring dialogue , terrible overacting , and increasingly uninspired slapstick that becomes all sound and fury , signifying nothing . \r\n", "the script becomes so unspeakably bad that the best line poor lee evens can utter after another run in with the rodent is : \" i hate that mouse \" . \r\n", "oh cringe ! \r\n", "this is home alone all over again , and ten times worse . \r\n", "one touching scene early on is worth mentioning . \r\n", "we follow the mouse through a maze of walls and pipes until he arrives at his makeshift abode somewhere in a wall . \r\n", "he jumps into a tiny bed , pulls up a makeshift sheet and snuggles up to sleep , seemingly happy and just wanting to be left alone . \r\n", "it's a magical little moment in an otherwise soulless film . \r\n", "a message to speilberg : if you want dreamworks to be associated with some kind of artistic credibility , then either give all concerned in mouse hunt a swift kick up the arse or hire yourself some decent writers and directors . \r\n", "this kind of rubbish will just not do at all . \r\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import Markdown, display\n", "display(Markdown('> '+df['review'][0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for missing values:\n", "We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.\n", "* NaN records are efficiently handled with [.isnull()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) and [.dropna()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)\n", "* Strings that contain only whitespace can be handled with [.isspace()](https://docs.python.org/3/library/stdtypes.html#str.isspace), [.itertuples()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html), and [.drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)\n", "\n", "### Detect & remove NaN values:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label 0\n", "review 35\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for the existence of NaN values in a cell:\n", "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "35 records show **NaN** (this stands for \"not a number\" and is equivalent to *None*). These are easily removed using the `.dropna()` pandas function.\n", "
CAUTION: By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can't be undone. However, it does *not* affect the original source data. If we needed to, we could always load the original DataFrame from scratch.
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1965" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(inplace=True)\n", "\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detect & remove empty strings\n", "Technically, we're dealing with \"whitespace only\" strings. If the original .tsv file had contained empty strings, pandas **.read_csv()** would have assigned NaN values to those cells by default.\n", "\n", "In order to detect these strings we need to iterate over each row in the DataFrame. The **.itertuples()** pandas method is a good tool for this as it provides access to every field. For brevity we'll assign the names `i`, `lb` and `rv` to the `index`, `label` and `review` columns." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27 blanks: [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]\n" ] } ], "source": [ "blanks = [] # start with an empty list\n", "\n", "for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n", " if type(rv)==str: # avoid NaN values\n", " if rv.isspace(): # test 'review' for whitespace\n", " blanks.append(i) # add matching index numbers to the list\n", " \n", "print(len(blanks), 'blanks: ', blanks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll pass our list of index numbers to the **.drop()** method, and set `inplace=True` to make the change permanent." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1938" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(blanks, inplace=True)\n", "\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We dropped 62 records from the original 2000. Let's continue with the analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Take a quick look at the `label` column:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label\n", "neg 969\n", "pos 969\n", "Name: count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split the data into train & test sets:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = df['review']\n", "y = df['label']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build pipelines to vectorize the data, then train and fit a model\n", "Now that we have sets to train and test, we'll develop a selection of pipelines, each with a different model." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.svm import LinearSVC\n", "\n", "# Naïve Bayes:\n", "text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),\n", " ('clf', MultinomialNB()),\n", "])\n", "\n", "# Linear SVC:\n", "text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),\n", " ('clf', LinearSVC()),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feed the training data through the first pipeline\n", "We'll run naïve Bayes first" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_clf_nb.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run predictions and analyze the results (naïve Bayes)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Form a prediction set\n", "predictions = text_clf_nb.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[287 21]\n", " [130 202]]\n" ] } ], "source": [ "# Report the confusion matrix\n", "from sklearn import metrics\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.69 0.93 0.79 308\n", " pos 0.91 0.61 0.73 332\n", "\n", " accuracy 0.76 640\n", " macro avg 0.80 0.77 0.76 640\n", "weighted avg 0.80 0.76 0.76 640\n", "\n" ] } ], "source": [ "# Print a classification report\n", "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7640625\n" ] } ], "source": [ "# Print the overall accuracy\n", "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naïve Bayes gave us better-than-average results at 76.4% for classifying reviews as positive or negative based on text alone. Let's see if we can do better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feed the training data through the second pipeline\n", "Next we'll run Linear SVC" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_clf_lsvc.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run predictions and analyze the results (Linear SVC)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Form a prediction set\n", "predictions = text_clf_lsvc.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[259 49]\n", " [ 49 283]]\n" ] } ], "source": [ "# Report the confusion matrix\n", "from sklearn import metrics\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.84 0.84 0.84 308\n", " pos 0.85 0.85 0.85 332\n", "\n", " accuracy 0.85 640\n", " macro avg 0.85 0.85 0.85 640\n", "weighted avg 0.85 0.85 0.85 640\n", "\n" ] } ], "source": [ "# Print a classification report\n", "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.846875\n" ] } ], "source": [ "# Print the overall accuracy\n", "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad! Based on text alone we correctly classified reviews as positive or negative **84.7%** of the time. In an upcoming section we'll try to improve this score even further by performing *sentiment analysis* on the reviews." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Topic - Adding Stopwords to CountVectorizer\n", "By default, **CountVectorizer** and **TfidfVectorizer** do *not* filter stopwords. However, they offer some optional settings, including passing in your own stopword list.\n", "
CAUTION: There are some [known issues](http://aclweb.org/anthology/W18-2502) using Scikit-learn's built-in stopwords list. Some words that are filtered may in fact aid in classification. In this section we'll pass in our own stopword list, so that we know exactly what's being filtered.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class accepts the following arguments:\n", "> *CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, **stop_words=None**, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=)*\n", "\n", "[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) supports the same arguments and more. Under *stop_words* we have the following options:\n", "> stop_words : *string {'english'}, list, or None (default)*\n", "\n", "That is, we can run `TfidVectorizer(stop_words='english')` to accept scikit-learn's built-in list,
\n", "or `TfidVectorizer(stop_words=[a, and, the])` to filter these three words. In practice we would assign our list to a variable and pass that in instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn's built-in list contains 318 stopwords:\n", ">
from sklearn.feature_extraction import text\n",
    "> print(text.ENGLISH_STOP_WORDS)
\n", "['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']\n", "\n", "However, there are words in this list that may influence a classification of movie reviews. With this in mind, let's trim the list to just 60 words:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \\\n", " 'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \\\n", " 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \\\n", " 'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \\\n", " 'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's repeat the process above and see if the removal of stopwords improves or impairs our score." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# YOU DO NOT NEED TO RUN THIS CELL UNLESS YOU HAVE\n", "# RECENTLY OPENED THIS NOTEBOOK OR RESTARTED THE KERNEL:\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n", "df.dropna(inplace=True)\n", "blanks = []\n", "for i,lb,rv in df.itertuples():\n", " if type(rv)==str:\n", " if rv.isspace():\n", " blanks.append(i)\n", "df.drop(blanks, inplace=True)\n", "from sklearn.model_selection import train_test_split\n", "X = df['review']\n", "y = df['label']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n", "\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.svm import LinearSVC\n", "from sklearn import metrics" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('tfidf',\n",
       "                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',\n",
       "                                             'as', 'at', 'be', 'been', 'but',\n",
       "                                             'by', 'can', 'even', 'ever', 'for',\n",
       "                                             'from', 'get', 'had', 'has',\n",
       "                                             'have', 'he', 'her', 'hers', 'his',\n",
       "                                             'how', 'i', 'if', 'in', 'into',\n",
       "                                             'is', ...])),\n",
       "                ('clf', LinearSVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('tfidf',\n", " TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',\n", " 'as', 'at', 'be', 'been', 'but',\n", " 'by', 'can', 'even', 'ever', 'for',\n", " 'from', 'get', 'had', 'has',\n", " 'have', 'he', 'her', 'hers', 'his',\n", " 'how', 'i', 'if', 'in', 'into',\n", " 'is', ...])),\n", " ('clf', LinearSVC())])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RUN THIS CELL TO ADD STOPWORDS TO THE LINEAR SVC PIPELINE:\n", "text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),\n", " ('clf', LinearSVC()),\n", "])\n", "text_clf_lsvc2.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[256 52]\n", " [ 48 284]]\n" ] } ], "source": [ "predictions = text_clf_lsvc2.predict(X_test)\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.84 0.83 0.84 308\n", " pos 0.85 0.86 0.85 332\n", "\n", " accuracy 0.84 640\n", " macro avg 0.84 0.84 0.84 640\n", "weighted avg 0.84 0.84 0.84 640\n", "\n" ] } ], "source": [ "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.84375\n" ] } ], "source": [ "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our score didn't change that much. We went from 84.7% without filtering stopwords to 84.4% after adding a stopword filter to our pipeline. Keep in mind that 2000 movie reviews is a relatively small dataset. The real gain from stripping stopwords is improved processing speed; depending on the size of the corpus, it might save hours." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feed new data into a trained model\n", "Once we've developed a fairly accurate model, it's time to feed new data through it. In this last section we'll write our own review, and see how accurately our model assigns a \"positive\" or \"negative\" label to it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First, train the model" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# YOU DO NOT NEED TO RUN THIS CELL UNLESS YOU HAVE\n", "# RECENTLY OPENED THIS NOTEBOOK OR RESTARTED THE KERNEL:\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\\t')\n", "df.dropna(inplace=True)\n", "blanks = []\n", "for i,lb,rv in df.itertuples():\n", " if type(rv)==str:\n", " if rv.isspace():\n", " blanks.append(i)\n", "df.drop(blanks, inplace=True)\n", "from sklearn.model_selection import train_test_split\n", "X = df['review']\n", "y = df['label']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n", "\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.svm import LinearSVC\n", "from sklearn import metrics\n", "\n", "# Naïve Bayes Model:\n", "text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),\n", " ('clf', MultinomialNB()),\n", "])\n", "\n", "# Linear SVC Model:\n", "text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),\n", " ('clf', LinearSVC()),\n", "])\n", "\n", "# Train both models on the moviereviews.tsv training set:\n", "text_clf_nb.fit(X_train, y_train)\n", "text_clf_lsvc.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next, feed new data to the model's `predict()` method" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "myreview = \"A movie I really wanted to love was terrible. \\\n", "I'm sure the producers had the best intentions, but the execution was lacking.\"" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (425273183.py, line 2)", "output_type": "error", "traceback": [ "\u001b[0;36m Cell \u001b[0;32mIn[29], line 2\u001b[0;36m\u001b[0m\n\u001b[0;31m myreview =\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "# Use this space to write your own review. Experiment with different lengths and writing styles.\n", "myreview = \n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(text_clf_nb.predict([myreview])) # be sure to put \"myreview\" inside square brackets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(text_clf_lsvc.predict([myreview]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now you should be able to build text classification pipelines in scikit-learn, apply a variety of algorithms like naïve Bayes and Linear SVC, handle stopwords, and test a fitted model on new data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Up next: Text Classification Assessment" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 4 }