{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification Assessment\n", "This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.\n", "\n", "The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.\n", "\n", "We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n", "\n", "For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #1: Perform imports and load the dataset into a pandas DataFrame\n", "For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #2: Check for missing values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for NaN values:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for whitespace strings (it's OK if there aren't any!):\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #3: Remove NaN values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #4: Take a quick look at the `label` column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #5: Split the data into train & test sets:\n", "You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n", "You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task #7: Run predictions and analyze the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Form a prediction set\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Report the confusion matrix\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print a classification report\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the overall accuracy\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Great job!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }