materi-praktikum/Praktikum Python Code/03-Text-Classification/04-Text-Classification-Assessment-Solution.ipynb

398 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Classification Assessment - Solution\n",
"This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n",
"\n",
"The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. \n",
"\n",
"We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n",
"\n",
"For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #1: Perform imports and load the dataset into a pandas DataFrame\n",
"For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>pos</td>\n",
" <td>I loved this movie and will watch it again. Or...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>pos</td>\n",
" <td>A warm, touching movie that has a fantasy-like...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>pos</td>\n",
" <td>I was not expecting the powerful filmmaking ex...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>neg</td>\n",
" <td>This so-called \"documentary\" tries to tell tha...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>pos</td>\n",
" <td>This show has been my escape from reality for ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"0 pos I loved this movie and will watch it again. Or...\n",
"1 pos A warm, touching movie that has a fantasy-like...\n",
"2 pos I was not expecting the powerful filmmaking ex...\n",
"3 neg This so-called \"documentary\" tries to tell tha...\n",
"4 pos This show has been my escape from reality for ..."
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\\t')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #2: Check for missing values:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"label 0\n",
"review 20\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check for NaN values:\n",
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check for whitespace strings (it's OK if there aren't any!):\n",
"blanks = [] # start with an empty list\n",
"\n",
"for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n",
" if type(rv)==str: # avoid NaN values\n",
" if rv.isspace(): # test 'review' for whitespace\n",
" blanks.append(i) # add matching index numbers to the list\n",
" \n",
"len(blanks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #3: Remove NaN values:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"df.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #4: Take a quick look at the `label` column:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pos 2990\n",
"neg 2990\n",
"Name: label, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['label'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #5: Split the data into train & test sets:\n",
"You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X = df['review']\n",
"y = df['label']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n",
"You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,\n",
" multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
" verbose=0))])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.svm import LinearSVC\n",
"\n",
"text_clf = Pipeline([('tfidf', TfidfVectorizer()),\n",
" ('clf', LinearSVC()),\n",
"])\n",
"\n",
"# Feed the training data through the pipeline\n",
"text_clf.fit(X_train, y_train) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #7: Run predictions and analyze the results"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Form a prediction set\n",
"predictions = text_clf.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[900 91]\n",
" [ 63 920]]\n"
]
}
],
"source": [
"# Report the confusion matrix\n",
"from sklearn import metrics\n",
"print(metrics.confusion_matrix(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" neg 0.93 0.91 0.92 991\n",
" pos 0.91 0.94 0.92 983\n",
"\n",
" micro avg 0.92 0.92 0.92 1974\n",
" macro avg 0.92 0.92 0.92 1974\n",
"weighted avg 0.92 0.92 0.92 1974\n",
"\n"
]
}
],
"source": [
"# Print a classification report\n",
"print(metrics.classification_report(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.921985815603\n"
]
}
],
"source": [
"# Print the overall accuracy\n",
"print(metrics.accuracy_score(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Great job!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}