398 lines
10 KiB
Plaintext
398 lines
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Text Classification Assessment - Solution\n",
|
|
"This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n",
|
|
"\n",
|
|
"The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. \n",
|
|
"\n",
|
|
"We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n",
|
|
"\n",
|
|
"For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #1: Perform imports and load the dataset into a pandas DataFrame\n",
|
|
"For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style>\n",
|
|
" .dataframe thead tr:only-child th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: left;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>I loved this movie and will watch it again. Or...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>A warm, touching movie that has a fantasy-like...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>I was not expecting the powerful filmmaking ex...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>This so-called \"documentary\" tries to tell tha...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>pos</td>\n",
|
|
" <td>This show has been my escape from reality for ...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"0 pos I loved this movie and will watch it again. Or...\n",
|
|
"1 pos A warm, touching movie that has a fantasy-like...\n",
|
|
"2 pos I was not expecting the powerful filmmaking ex...\n",
|
|
"3 neg This so-called \"documentary\" tries to tell tha...\n",
|
|
"4 pos This show has been my escape from reality for ..."
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\\t')\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #2: Check for missing values:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"label 0\n",
|
|
"review 20\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Check for NaN values:\n",
|
|
"df.isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Check for whitespace strings (it's OK if there aren't any!):\n",
|
|
"blanks = [] # start with an empty list\n",
|
|
"\n",
|
|
"for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n",
|
|
" if type(rv)==str: # avoid NaN values\n",
|
|
" if rv.isspace(): # test 'review' for whitespace\n",
|
|
" blanks.append(i) # add matching index numbers to the list\n",
|
|
" \n",
|
|
"len(blanks)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #3: Remove NaN values:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.dropna(inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #4: Take a quick look at the `label` column:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"pos 2990\n",
|
|
"neg 2990\n",
|
|
"Name: label, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df['label'].value_counts()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #5: Split the data into train & test sets:\n",
|
|
"You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"X = df['review']\n",
|
|
"y = df['label']\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n",
|
|
"You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Pipeline(memory=None,\n",
|
|
" steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
|
|
" dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',\n",
|
|
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
|
|
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,\n",
|
|
" multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
|
|
" verbose=0))])"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.pipeline import Pipeline\n",
|
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|
"from sklearn.svm import LinearSVC\n",
|
|
"\n",
|
|
"text_clf = Pipeline([('tfidf', TfidfVectorizer()),\n",
|
|
" ('clf', LinearSVC()),\n",
|
|
"])\n",
|
|
"\n",
|
|
"# Feed the training data through the pipeline\n",
|
|
"text_clf.fit(X_train, y_train) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task #7: Run predictions and analyze the results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Form a prediction set\n",
|
|
"predictions = text_clf.predict(X_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[[900 91]\n",
|
|
" [ 63 920]]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Report the confusion matrix\n",
|
|
"from sklearn import metrics\n",
|
|
"print(metrics.confusion_matrix(y_test,predictions))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" precision recall f1-score support\n",
|
|
"\n",
|
|
" neg 0.93 0.91 0.92 991\n",
|
|
" pos 0.91 0.94 0.92 983\n",
|
|
"\n",
|
|
" micro avg 0.92 0.92 0.92 1974\n",
|
|
" macro avg 0.92 0.92 0.92 1974\n",
|
|
"weighted avg 0.92 0.92 0.92 1974\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Print a classification report\n",
|
|
"print(metrics.classification_report(y_test,predictions))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0.921985815603\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Print the overall accuracy\n",
|
|
"print(metrics.accuracy_score(y_test,predictions))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Great job!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|