materi-praktikum/Praktikum Python Code/03-Text-Classification/03-Text-Classification-Assessment.ipynb

230 lines
4.7 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Classification Assessment\n",
"This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n",
"\n",
"The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.\n",
"\n",
"We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n",
"\n",
"For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #1: Perform imports and load the dataset into a pandas DataFrame\n",
"For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #2: Check for missing values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for NaN values:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for whitespace strings (it's OK if there aren't any!):\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #3: Remove NaN values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #4: Take a quick look at the `label` column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #5: Split the data into train & test sets:\n",
"You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n",
"You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task #7: Run predictions and analyze the results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Form a prediction set\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Report the confusion matrix\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print a classification report\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print the overall accuracy\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Great job!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}