{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scikit-learn Primer\n", "\n", "**Scikit-learn** (http://scikit-learn.org/) is an open-source machine learning library for Python that offers a variety of regression, classification and clustering algorithms.\n", "\n", "In this section we'll perform a fairly simple classification exercise with scikit-learn. In the next section we'll leverage the machine learning strength of scikit-learn to perform natural language classifications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Installation and Setup\n", "\n", "### From the command line or terminal:\n", "> `conda install scikit-learn`\n", ">
*or*
\n", "> `pip install -U scikit-learn`\n", "\n", "Scikit-learn additionally requires that NumPy and SciPy be installed. For more info visit http://scikit-learn.org/stable/install.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Perform Imports and Load Data\n", "For this exercise we'll be using the **SMSSpamCollection** dataset from [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) that contains more than 5 thousand SMS phone messages.
You can check out the [**sms_readme**](../TextFiles/sms_readme.txt) file for more info.\n", "\n", "The file is a [tab-separated-values](https://en.wikipedia.org/wiki/Tab-separated_values) (tsv) file with four columns:\n", "> **label** - every message is labeled as either ***ham*** or ***spam***
\n", "> **message** - the message itself
\n", "> **length** - the number of characters in each message
\n", "> **punct** - the number of punctuation characters in each message" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelengthpunct
0hamGo until jurong point, crazy.. Available only ...1119
1hamOk lar... Joking wif u oni...296
2spamFree entry in 2 a wkly comp to win FA Cup fina...1556
3hamU dun say so early hor... U c already then say...496
4hamNah I don't think he goes to usf, he lives aro...612
\n", "
" ], "text/plain": [ " label message length punct\n", "0 ham Go until jurong point, crazy.. Available only ... 111 9\n", "1 ham Ok lar... Joking wif u oni... 29 6\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 6\n", "3 ham U dun say so early hor... U c already then say... 49 6\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61 2" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('../TextFiles/smsspamcollection.tsv', sep='\\t')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5572" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for missing values:\n", "Machine learning models usually require complete data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label 0\n", "message 0\n", "length 0\n", "punct 0\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Take a quick look at the *ham* and *spam* `label` column:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array(['ham', 'spam'], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].unique()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ham 4825\n", "spam 747\n", "Name: label, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that 4825 out of 5572 messages, or 86.6%, are ham.
This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the data:\n", "Since we're not ready to do anything with the message text, let's see if we can predict ham/spam labels based on message length and punctuation counts. We'll look at message `length` first:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5572.000000\n", "mean 80.489950\n", "std 59.942907\n", "min 2.000000\n", "25% 36.000000\n", "50% 62.000000\n", "75% 122.000000\n", "max 910.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['length'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFnBJREFUeJzt3X2QXXWd5/H3lxjJOCLR0LCxO9pxiLMBugjaJrBSNYgsBBwEFGZgR02UJUoBKzpqiGUVjC5VigtRxtlIMCxhi+VhgVnCw+giT0oVTx0GCCFjEaFXepMibXgYEGFJ+O4ffRKbeDv39n3o2336/arquvf8zu+c+21+3M89+fW550RmIkkqrz3aXYAkqbUMekkqOYNekkrOoJekkjPoJankDHpJKjmDXpJKzqCXpJIz6CWp5Ax6SSq5t7W7AIB99tknu7u7212GJE0oa9eu/W1mdlTrNy6Cvru7m76+vnaXIUkTSkT8n1r6OXUjSSVn0EtSydUc9BExJSL+OSJuLZZnR8SDEfFURFwXEW8v2vcsljcW67tbU7okqRajmaP/MrABeFex/D1geWZeGxE/Bk4HVhSPL2Tm/hFxatHvr5tYs6RJ6o033mBgYIDXXnut3aWMqWnTptHV1cXUqVPr2r6moI+ILuATwIXAVyMigCOB/1B0WQ1cwFDQn1A8B7gB+FFERHqHE0kNGhgYYK+99qK7u5uhGCq/zGTr1q0MDAwwe/bsuvZR69TND4BvAG8WyzOAFzNzW7E8AHQWzzuBZ4sCtwEvFf0lqSGvvfYaM2bMmDQhDxARzJgxo6F/xVQN+oj4S2BLZq4d3lyha9awbvh+l0REX0T0DQ4O1lSsJE2mkN+h0d+5liP6jwKfjIh+4FqGpmx+AEyPiB1TP13ApuL5ADCrKO5twN7A87vuNDNXZmZvZvZ2dFQ931+SxoX+/n4OOuigdpcxKlXn6DNzGbAMICKOAL6WmX8TEf8TOJmh8F8E3FxssqZYvr9Yf5fz8yq74//+vortt5xz+BhXMrmM9N+9XmUdr0bOo1/K0B9mNzI0B7+qaF8FzCjavwqc11iJkjS+bN++nTPOOIMDDzyQo48+mt///vdcfvnlfOQjH+Hggw/m05/+NK+++ioAixcv5swzz+RjH/sYH/jAB7j33nv5whe+wNy5c1m8ePGY1DuqoM/MezLzL4vnT2fm/MzcPzNPyczXi/bXiuX9i/VPt6JwSWqXp556irPOOov169czffp0brzxRj71qU/x8MMP89hjjzF37lxWrVq1s/8LL7zAXXfdxfLlyzn++OP5yle+wvr161m3bh2PPvpoy+sdF9e6kcrKKZ1ymj17NvPmzQPgwx/+MP39/TzxxBN861vf4sUXX+SVV17hmGOO2dn/+OOPJyLo6elhv/32o6enB4ADDzyQ/v7+nftqFS+BIEmjtOeee+58PmXKFLZt28bixYv50Y9+xLp16zj//PPfcjrkjv577LHHW7bdY4892LZtG61m0EtSE7z88svMnDmTN954g6uvvrrd5byFUzeS1ATf+c53WLBgAe9///vp6enh5ZdfbndJO8V4OPOxt7c3vR69JrLRnubnHH19NmzYwNy5c9tdRltU+t0jYm1m9lbb1qkbSSo5g16SSs6gl6SSM+glqeQMekkqOYNekkrOoJekkvMLU5Imrsv+orn7++K9zd3fOOERvSTV6He/+x2f+MQnOPjggznooIO47rrr6O7uZunSpcyfP5/58+ezceNGAG655RYWLFjAIYccwlFHHcVzzz0HwAUXXMCiRYs4+uij6e7u5qabbuIb3/gGPT09LFy4kDfeeKPpdXtEL41Cs290oYnlpz/9Ke9973u57bbbAHjppZdYunQp73rXu3jooYe46qqrOPfcc7n11ls5/PDDeeCBB4gIfvKTn3DRRRdx8cUXA/DrX/+au+++myeffJLDDjuMG2+8kYsuuoiTTjqJ2267jRNPPLGpdXtEL0k16unp4ec//zlLly7ll7/8JXvvvTcAp5122s7H+++/H4CBgQGOOeYYenp6+P73v8/69et37ufYY49l6tSp9PT0sH37dhYuXLhz//39/U2v26CXpBp98IMfZO3atfT09LBs2TK+/e1vA2+9efeO5+eccw5nn30269at47LLLhvxssVTp07duU2rLltcNegjYlpEPBQRj0XE+oj4u6L9yoh4JiIeLX7mFe0REZdGxMaIeDwiPtT0qiWpDTZt2sQ73vEOPvOZz/C1r32NRx55BIDrrrtu5+Nhhx0GDE3rdHZ2ArB69er2FFyoZY7+deDIzHwlIqYC90XEPxXrvp6ZN+zS/1hgTvGzAFhRPErShLZu3Tq+/vWv7zwSX7FiBSeffDKvv/46CxYs4M033+Saa64Bhv7oesopp9DZ2cmhhx7KM88807a6R3WZ4oh4B3AfcGbxc+uuQR8RlwH3ZOY1xfKvgCMyc/NI+/UyxZoomvXHWC9TXJ/xeJni7u5u+vr62GeffVr6Oi2/THFETImIR4EtwB2Z+WCx6sJiemZ5ROy4P1Yn8OywzQeKNklSG9QU9Jm5PTPnAV3A/Ig4CFgG/FvgI8B7gKVF96i0i10bImJJRPRFRN/g4GBdxUtSu/X397f8aL5RozrrJjNfBO4BFmbm5hzyOvDfgPlFtwFg1rDNuoBNFfa1MjN7M7O3o6OjruIlSdXVctZNR0RML57/CXAU8C8RMbNoC+BE4IlikzXA54qzbw4FXtrd/LwkjcZ4uP3pWGv0d67lrJuZwOqImMLQB8P1mXlrRNwVER0MTdU8Cnyp6H87cBywEXgV+HxDFUpSYdq0aWzdupUZM2a85dz1MstMtm7dyrRp0+reR9Wgz8zHgUMqtB85Qv8Ezqq7IkkaQVdXFwMDA0y2v+tNmzaNrq6uurf3WjeSJoypU6cye/bsdpcx4XgJBEkqOY/oNWmN9OUnv8yksvGIXpJKzqCXpJJz6kbahTcXUdl4RC9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxfmFLp+QUoTXYGvdQG9Xz4eLE11cupG0kquVruGTstIh6KiMciYn1E/F3RPjsiHoyIpyLiuoh4e9G+Z7G8sVjf3dpfQZK0O7Uc0b8OHJmZBwPzgIXFTb+/ByzPzDnAC8DpRf/TgRcyc39gedFPktQmVYM+h7xSLE4tfhI4ErihaF8NnFg8P6FYplj/8Zgsd/GVpHGopj/GRsQUYC2wP/APwK+BFzNzW9FlAOgsnncCzwJk5raIeAmYAfy2iXVLk453xFK9avpjbGZuz8x5QBcwH5hbqVvxWOnoPXdtiIglEdEXEX2T7Y7ukjSWRnXWTWa+CNwDHApMj4gd/yLoAjYVzweAWQDF+r2B5yvsa2Vm9mZmb0dHR33VS5KqquWsm46ImF48/xPgKGADcDdwctFtEXBz8XxNsUyx/q7M/KMjeknS2Khljn4msLqYp98DuD4zb42IJ4FrI+I/A/8MrCr6rwL+e0RsZOhI/tQW1C1JqlHVoM/Mx4FDKrQ/zdB8/a7trwGnNKU6SVLDvASCpPpd9heV279479jWod3yEgiSVHIGvSSVnEEvSSVn0EtSyRn0klRynnUjTXBeA0fVeEQvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcp5HL6mqEc/Vf/sYF6K6eEQvSSVn0EtSyVWduomIWcBVwL8B3gRWZuYPI+IC4AxgsOj6zcy8vdhmGXA6sB34T5n5sxbULmmMXPLilyuv2PedY1uI6lLLHP024G8z85GI2AtYGxF3FOuWZ+Z/Gd45Ig5g6D6xBwLvBX4eER/MzO3NLFySVJuqUzeZuTkzHymevwxsADp3s8kJwLWZ+XpmPgNspMK9ZSVJY2NUc/QR0c3QjcIfLJrOjojHI+KKiHh30dYJPDtsswF2/8EgSWqhmoM+It4J3Aicm5n/CqwA/gyYB2wGLt7RtcLmWWF/SyKiLyL6BgcHK2wiSWqGmoI+IqYyFPJXZ+ZNAJn5XGZuz8w3gcv5w/TMADBr2OZdwKZd95mZKzOzNzN7Ozo6GvkdJEm7UTXoIyKAVcCGzLxkWPvMYd1OAp4onq8BTo2IPSNiNjAHeKh5JUuSRqOWs24+CnwWWBcRjxZt3wROi4h5DE3L9ANfBMjM9RFxPfAkQ2fsnOUZN9LEMNI3YC+p2KqJomrQZ+Z9VJ53v30321wIXNhAXZKkJvGbsZJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klVwt17qRpIqe2vJKxfY5Y1yHds8jekkqOYNekkrOoJekkjPoJank/GOsVFIj3UTklnMOH+NK1G4GvTTJjPQBoPJy6kaSSq6Wm4PPioi7I2JDRKyPiC8X7e+JiDsi4qni8d1Fe0TEpRGxMSIej4gPtfqXkCSNrJYj+m3A32bmXOBQ4KyIOAA4D7gzM+cAdxbLAMcy9H2JOcASYEXTq5Yk1axq0Gfm5sx8pHj+MrAB6AROAFYX3VYDJxbPTwCuyiEPANMjYmbTK5ck1WRUc/QR0Q0cAjwI7JeZm2HowwDYt+jWCTw7bLOBom3XfS2JiL6I6BscHBx95ZKkmtQc9BHxTuBG4NzM/Nfdda3Qln/UkLkyM3szs7ejo6PWMiRJo1RT0EfEVIZC/urMvKlofm7HlEzxuKVoHwBmDdu8C9jUnHIlSaNVy1k3AawCNmTmJcNWrQEWFc8XATcPa/9ccfbNocBLO6Z4JEljr5YvTH0U+CywLiIeLdq+CXwXuD4iTgd+A5xSrLsdOA7YCLwKfL6pFUuSRqVq0GfmfVSedwf4eIX+CZzVYF2SpCbxm7GSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRytdwz9oqI2BIRTwxruyAi/m9EPFr8HDds3bKI2BgRv4qIY1pVuCSpNrUc0V8JLKzQvjwz5xU/twNExAHAqcCBxTb/NSKmNKtYSdLoVQ36zPwF8HyN+zsBuDYzX8/MZxi6Qfj8BuqTJDWokTn6syPi8WJq591FWyfw7LA+A0WbJKlN6g36FcCfAfOAzcDFRXtU6JuVdhARSyKiLyL6BgcH6yxDklRNXUGfmc9l5vbMfBO4nD9MzwwAs4Z17QI2jbCPlZnZm5m9HR0d9ZQhSapBXUEfETOHLZ4E7DgjZw1wakTsGRGzgTnAQ42VKElqxNuqdYiIa4AjgH0iYgA4HzgiIuYxNC3TD3wRIDPXR8T1wJPANuCszNzemtIlSbWoGvSZeVqF5lW76X8hcGEjRUmSmsdvxkpSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUslVDfqIuCIitkTEE8Pa3hMRd0TEU8Xju4v2iIhLI2JjRDweER9qZfGSpOpqOaK/Eli4S9t5wJ2ZOQe4s1gGOJahG4LPAZYAK5pTpiSpXlWDPjN/ATy/S/MJwOri+WrgxGHtV+WQB4DpETGzWcVKkkav3jn6/TJzM0DxuG/R3gk8O6zfQNEmSWqTZv8xNiq0ZcWOEUsioi8i+gYHB5tchiRph3qD/rkdUzLF45aifQCYNaxfF7Cp0g4yc2Vm9mZmb0dHR51lSJKqqTfo1wCLiueLgJuHtX+uOPvmUOClHVM8kqT2eFu1DhFxDXAEsE9EDADnA98Fro+I04HfAKcU3W8HjgM2Aq8Cn29BzZKkUaga9Jl52girPl6hbwJnNVqUJKl5/GasJJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUslVvdaNpMnjkhe/3O4S1AIe0UtSyRn0klRyBr0klZxBL0klZ9BLUsk1dNZNRPQDLwPbgW2Z2RsR7wGuA7qBfuCvMvOFxsqUJNWrGUf0H8vMeZnZWyyfB9yZmXOAO4tlSVKbtGLq5gRgdfF8NXBiC15DklSjRoM+gf8dEWsjYknRtl9mbgYoHvdt8DUkSQ1o9JuxH83MTRGxL3BHRPxLrRsWHwxLAN73vvc1WIYkaSQNBX1mbioet0TEPwLzgeciYmZmbo6ImcCWEbZdCawE6O3tzUbqkACO//v72l3ChDAWlzkYaSxuOefwlr+2/ljdUzcR8acRsdeO58DRwBPAGmBR0W0RcHOjRUqS6tfIEf1+wD9GxI79/I/M/GlEPAxcHxGnA78BTmm8TElSveoO+sx8Gji4QvtW4OONFCVJah6/GStJJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klVyjFzWTxpzXtJlcvG5O4wx6SWNmdx/SBnfrGPSSxgX/pdY6Br3ayje31HoGvTRBjHQd+a9O/+Go+mvyMeglTUj+kbZ2Br00wXnkrmoM+hIaj0c6zsVL7WPQq64PhvH4YTIejXZeXa01Wf+/bVnQR8RC4IfAFOAnmfndVr3WRDJZ/0eTxor/evxjLQn6iJgC/APw74EB4OGIWJOZT7bi9cqsnV8w8Q3THmWYc/dfMuNLq47o5wMbi/vKEhHXAicAEzLoy3IUPhbBXfYPhzKEsGpXlvd+q4K+E3h22PIAsKBFr6UJrJ4jv2adT97O888n6wfG7n7vdh7tj/YApZkHNGPxoRGZ2fydRpwCHJOZ/7FY/iwwPzPPGdZnCbCkWPxz4FcVdrU38FKVtn2A3zap9NGqVN9Y7afWbar12936kdbVMi4wOcfGcdk93zMjt9UzLu/PzI6qvTKz6T/AYcDPhi0vA5bVsZ+V1dqAvlb8DvXWN1b7qXWbav12t36kdbWMy2QdG8dlfI7LRBibVo5Lq65H/zAwJyJmR8TbgVOBNXXs55Ya29qlWbXUs59at6nWb3frR1o33scF2jc2jsvu+Z6p/XWapiVTNwARcRzwA4ZOr7wiMy9s0ev0ZWZvK/atxjg245PjMj61clxadh59Zt4O3N6q/Q+zcgxeQ/VxbMYnx2V8atm4tOyIXpI0PnjPWEkqOYNekkrOoJekkitd0EfEn0bE6oi4PCL+pt31aEhEfCAiVkXEDe2uRW8VEScW75ebI+LodtejIRExNyJ+HBE3RMSZjexrQgR9RFwREVsi4old2hdGxK8iYmNEnFc0fwq4ITPPAD455sVOIqMZl8x8OjNPb0+lk88ox+Z/Fe+XxcBft6HcSWOU47IhM78E/BXQ0GmXEyLogSuBhcMbhl0h81jgAOC0iDgA6OIP19nZPoY1TkZXUvu4aGxdyejH5lvFerXOlYxiXCLik8B9wJ2NvOiECPrM/AXw/C7NO6+QmZn/D9hxhcwBhsIeJsjvN1GNclw0hkYzNjHke8A/ZeYjY13rZDLa90xmrsnMfwc0NA09kYOw0hUyO4GbgE9HxArG39e/J4OK4xIRMyLix8AhEbGsPaVNeiO9Z84BjgJOjogvtaOwSW6k98wREXFpRFxGg18+nci3EowKbZmZvwM+P9bFaKeRxmUrYIi010hjcylw6VgXo51GGpd7gHua8QIT+Yh+AJg1bLkL2NSmWvQHjsv45diMTy0fl4kc9M26Qqaay3EZvxyb8anl4zIhgj4irgHuB/48IgYi4vTM3AacDfwM2ABcn5nr21nnZOO4jF+OzfjUrnHxomaSVHIT4oheklQ/g16SSs6gl6SSM+glqeQMekkqOYNekkrOoJekkjPoJankDHpJKrn/DyDmFU1Id+LIAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.xscale('log')\n", "bins = 1.15**(np.arange(0,50))\n", "plt.hist(df[df['label']=='ham']['length'],bins=bins,alpha=0.8)\n", "plt.hist(df[df['label']=='spam']['length'],bins=bins,alpha=0.8)\n", "plt.legend(('ham','spam'))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like there's a small range of values where a message is more likely to be spam than ham.\n", "\n", "Now let's look at the `punct` column:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5572.000000\n", "mean 4.177495\n", "std 4.623919\n", "min 0.000000\n", "25% 2.000000\n", "50% 3.000000\n", "75% 6.000000\n", "max 133.000000\n", "Name: punct, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['punct'].describe()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAETlJREFUeJzt3X+MldWdx/H3lx+Kdqu0QI2dsQ6mtEvlhtpOAbe4G6uhUJdiVLKauoWWyG4jtOq2IptNbNps0h+bUrUNK5ZuaUIsXTWLrK7dVlu3Jmod/NGBsg1UWb2Lq1OKLKvSAp79Yx5wxBnm3pn7Y+7h/UrIPD/Oc+73ejKfOZ557jORUkKSlK9RzS5AklRfBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpcwa9JGVuTLMLAJg4cWLq6OhodhmS1FI2b97825TSpMHajYig7+jooKurq9llSFJLiYj/qqSdSzeSlDmDXpIyZ9BLUuZGxBq9JFXiwIEDlMtl9u/f3+xSGmrcuHG0t7czduzYIV1v0EtqGeVymbe+9a10dHQQEc0upyFSSuzevZtyuczkyZOH1IdLN5Jaxv79+5kwYcJxE/IAEcGECROG9X8xBr2klnI8hfxhw33PBr0kVWjnzp1Mmzat2WVUzTX6Acy/5aG69Ltp+ey69Csdj2r9fZrr96czekmqwqFDh7jqqqs4++yzmTNnDq+++iq33XYbH/rQh5g+fTqXXnopr7zyCgCLFy/mM5/5DOeffz5nnXUWDz74IJ/+9KeZOnUqixcvbljNBr0kVWH79u1cffXVbN26lfHjx3PnnXdyySWX8Nhjj/HUU08xdepU1q5de6T9nj17eOCBB1i1ahXz58/n2muvZevWrXR3d/Pkk082pGaDXpKqMHnyZN7//vcD8MEPfpCdO3eyZcsWzjvvPEqlEuvXr2fr1q1H2s+fP5+IoFQqcdppp1EqlRg1ahRnn302O3fubEjNBr0kVeHEE088sj169GgOHjzI4sWL+da3vkV3dzc33njjG26FPNx+1KhRb7h21KhRHDx4sCE1G/SSNEz79u3j9NNP58CBA6xfv77Z5byJd91I0jB9+ctfZubMmZx55pmUSiX27dvX7JLeIFJKza6Bzs7ONNKeR+/tldLIs23bNqZOndrsMpqiv/ceEZtTSp2DXevSjSRlzqCXpMy1/Bp9vZZYJCkXzuglKXMGvSRlzqCXpMwZ9JKUuZb/Zayk49itf1bb/v7qwdr2N0I4o5ekCr388stcdNFFTJ8+nWnTprFhwwY6OjpYsWIFM2bMYMaMGezYsQOATZs2MXPmTM455xwuvPBCXnjhBQC++MUvsmjRIubMmUNHRwd33XUX119/PaVSiblz53LgwIGa123QS1KF7rvvPt75znfy1FNPsWXLFubOnQvAKaecwi9+8QuWLVvGNddcA8Ds2bN55JFHeOKJJ7j88sv52te+dqSf3/zmN9xzzz1s3LiRK6+8kvPPP5/u7m5OOukk7rnnnprXbdBLUoVKpRI/+clPWLFiBT//+c859dRTAbjiiiuOfH344YcBKJfLfPSjH6VUKvH1r3/9DY8unjdvHmPHjqVUKnHo0KEjPzBKpVJdHl1s0EtShd7znvewefNmSqUSK1eu5Etf+hLwxj/efXh7+fLlLFu2jO7ubm699dYBH108duzYI9fU69HFBr0kVWjXrl2cfPLJXHnllXz+85/n8ccfB2DDhg1Hvp577rkA7N27l7a2NgDWrVvXnIIL3nUjSRXq7u7mC1/4wpGZ+OrVq7nsssv4/e9/z8yZM3nttde4/fbbgd5fui5cuJC2tjZmzZrFM88807S6W/4xxa32rBsfUywN3Uh8THFHRwddXV1MnDixrq/jY4olSQNy6UaShqFRf+B7OCqa0UfEtRGxNSK2RMTtETEuIiZHxKMRsT0iNkTECUXbE4v9HcX5jnq+AUnSsQ0a9BHRBnwW6EwpTQNGA5cDXwVWpZSmAHuAJcUlS4A9KaV3A6uKdpJUEyPh94qNNtz3XOka/RjgpIgYA5wMPA98BLijOL8OuLjYXlDsU5y/IPreZCpJQzRu3Dh27959XIV9Sondu3czbty4Ifcx6Bp9Sum/I+IfgGeBV4F/BzYDL6WUDt/ZXwbaiu024Lni2oMRsReYAPx2yFVKEtDe3k65XKanp6fZpTTUuHHjaG9vH/L1gwZ9RLyN3ln6ZOAl4J+Bef00Pfwjtr/Z+5t+/EbEUmApwLve9a4Ky5V0PBs7diyTJ09udhktp5KlmwuBZ1JKPSmlA8BdwJ8A44ulHIB2YFexXQbOACjOnwr87uhOU0prUkqdKaXOSZMmDfNtSJIGUknQPwvMioiTi7X2C4BfAT8FLivaLAI2Ftt3F/sU5x9Ix9OCmiSNMIMGfUrpUXp/qfo40F1cswZYAVwXETvoXYNfW1yyFphQHL8OuKEOdUuSKlTRB6ZSSjcCNx51+GlgRj9t9wMLh1+aJKkWfASCJGXOoJekzPmsm0zU4ymePmlTyoMzeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypyfjG2wenyCVZKOxRm9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKXEVBHxHjI+KOiPjPiNgWEedGxNsj4scRsb34+raibUTEzRGxIyJ+GREfqO9bkCQdS6Uz+puA+1JKfwxMB7YBNwD3p5SmAPcX+wDzgCnFv6XA6ppWLEmqyqBBHxGnAH8KrAVIKf0hpfQSsABYVzRbB1xcbC8Avp96PQKMj4jTa165JKkilczozwJ6gH+KiCci4jsR8RbgtJTS8wDF13cU7duA5/pcXy6OSZKaoJKgHwN8AFidUjoHeJnXl2n6E/0cS29qFLE0Iroioqunp6eiYiVJ1ask6MtAOaX0aLF/B73B/8LhJZni64t92p/R5/p2YNfRnaaU1qSUOlNKnZMmTRpq/ZKkQQwa9Cml/wGei4j3FocuAH4F3A0sKo4tAjYW23cDnyzuvpkF7D28xCNJarwxFbZbDqyPiBOAp4FP0ftD4ocRsQR4FlhYtL0X+BiwA3ilaCtJapKKgj6l9CTQ2c+pC/ppm4Crh1mXJKlG/GSsJGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpc2OaXYBGrvm3PFSXfjctn12XfiX1zxm9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUuYqDPiJGR8QTEfGvxf7kiHg0IrZHxIaIOKE4fmKxv6M431Gf0iVJlahmRv85YFuf/a8Cq1JKU4A9wJLi+BJgT0rp3cCqop0kqUkqCvqIaAcuAr5T7AfwEeCOosk64OJie0GxT3H+gqK9JKkJKp3RfxO4Hnit2J8AvJRSOljsl4G2YrsNeA6gOL+3aC9JaoJBgz4i/hx4MaW0ue/hfpqmCs717XdpRHRFRFdPT09FxUqSqlfJjP7DwMcjYifwA3qXbL4JjI+Iw3+4pB3YVWyXgTMAivOnAr87utOU0pqUUmdKqXPSpEnDehOSpIENGvQppZUppfaUUgdwOfBASukTwE+By4pmi4CNxfbdxT7F+QdSSm+a0UuSGmM499GvAK6LiB30rsGvLY6vBSYUx68DbhheiZKk4ajqb8amlH4G/KzYfhqY0U+b/cDCGtQmSaoBPxkrSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMlfV7ZVSLcy/5aG69Ltp+ey69Cu1Omf0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZGzToI+KMiPhpRGyLiK0R8bni+Nsj4scRsb34+rbieETEzRGxIyJ+GREfqPebkCQNrJIZ/UHgb1JKU4FZwNUR8T7gBuD+lNIU4P5iH2AeMKX4txRYXfOqJUkVGzToU0rPp5QeL7b3AduANmABsK5otg64uNheAHw/9XoEGB8Rp9e8cklSRapao4+IDuAc4FHgtJTS89D7wwB4R9GsDXiuz2Xl4tjRfS2NiK6I6Orp6am+cklSRSoO+oj4I+BO4JqU0v8eq2k/x9KbDqS0JqXUmVLqnDRpUqVlSJKqVFHQR8RYekN+fUrpruLwC4eXZIqvLxbHy8AZfS5vB3bVplxJUrUquesmgLXAtpTSN/qcuhtYVGwvAjb2Of7J4u6bWcDew0s8kqTGG1NBmw8Dfwl0R8STxbG/Bb4C/DAilgDPAguLc/cCHwN2AK8An6ppxZKkqgwa9Cmlh+h/3R3ggn7aJ+DqYdYlSaqRSmb0UkuYf8tDNe9z0/LZNe9TajQfgSBJmTPoJSlzLt1oQN946XN16fe68TfVpV9J/TPo1XD+AJEay6UbScqcQS9JmXPpJhP1Wg6R1Pqc0UtS5gx6ScqcQS9JmTPoJSlzBr0kZc67bhrMu2MkNZozeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypyfjB2An2CVlAtn9JKUOYNekjJn0EtS5gx6Scqcv4yVjmH+LQ/Vpd9Ny2fXpV+pP87oJSlzzuiVjXrcEnvd+Jtq3qfUaM7oJSlzBr0kZa7ll278BKskHZszeknKXF1m9BExF7gJGA18J6X0lXq8jtSqvG1TjVTzGX1EjAa+DcwD3gdcERHvq/XrSJIqU4+lmxnAjpTS0ymlPwA/ABbU4XUkSRWox9JNG/Bcn/0yMLMOryPVXb1+2V+v+/NdElJ/6hH00c+x9KZGEUuBpcXu/0XEr4vtU4G9A/Td37mJwG+HUGc9Hes9NLPfaq+vpP1w2wx0bqDjmYz3eXXqd8jXHrN9fLbifocy1gOdy2Ss69rvmRW1SinV9B9wLvCjPvsrgZVVXL+mmnNAV63fQw3+Gwz4HprZb7XXV9J+uG0GOneM4453k8a6knZDGeuBzjnWtftXjzX6x4ApETE5Ik4ALgfuruL6TUM8N5LUq87h9lvt9ZW0H26bgc61yljDyBzveox1Je2G+v3bKuM9Esd6UFH8NKltpxEfA75J7+2V300p/X3NX+T11+pKKXXWq3+NLI738cOxrp263EefUroXuLceffdjTYNeRyOD4338cKxrpC4zeknSyOEjECQpcwa9JGXOoJekzGUX9BHxlohYFxG3RcQnml2P6icizoqItRFxR7NrUf1FxMXF9/XGiJjT7HpaSUsEfUR8NyJejIgtRx2fGxG/jogdEXFDcfgS4I6U0lXAxxterIalmrFOvc9TWtKcSlULVY73vxTf14uBv2hCuS2rJYIe+B4wt++BYzwls53Xn7VzqIE1qja+R+Vjrdb3Paof778rzqtCLRH0KaX/AH531OGBnpJZpjfsoUXen15X5VirxVUz3tHrq8C/pZQeb3StrayVg7C/p2S2AXcBl0bEalrnY9U6tn7HOiImRMQ/AudExMrmlKY6GOh7ezlwIXBZRPx1MwprVa38N2P7fUpmSull4FONLkZ1NdBY7wb8hs/PQON9M3Bzo4vJQSvP6MvAGX3224FdTapF9eVYH18c7xpr5aAf7lMy1Toc6+OL411jLRH0EXE78DDw3ogoR8SSlNJBYBnwI2Ab8MOU0tZm1qnhc6yPL453Y/hQM0nKXEvM6CVJQ2fQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjL3//lUoXk3DdZMAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.xscale('log')\n", "bins = 1.5**(np.arange(0,15))\n", "plt.hist(df[df['label']=='ham']['punct'],bins=bins,alpha=0.8)\n", "plt.hist(df[df['label']=='spam']['punct'],bins=bins,alpha=0.8)\n", "plt.legend(('ham','spam'))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Split the data into train & test sets:\n", "\n", "If we wanted to divide the DataFrame into two smaller sets, we could use\n", "> `train, test = train_test_split(df)`\n", "\n", "For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the `label` column in our data. For Features we'll use the `length` and `punct` columns. *By convention, **X** is capitalized and **y** is lowercase.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting features\n", "There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:\n", "> `X = df[['length','punct']]`\n", "\n", "If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:\n", "> `X = df.drop(['label','message'], axis=1)`\n", "\n", "These operations make copies of **df**, but do not change the original DataFrame in place. All the original data is preserved." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Create Feature and Label sets\n", "X = df[['length','punct']] # note the double set of brackets\n", "y = df['label']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional train/test/split arguments:\n", "The default test size for `train_test_split` is 30%. Here we'll assign 33% of the data for testing.
\n", "Also, we can set a `random_state` seed value to ensure that everyone uses the same \"random\" training & testing sets." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training Data Shape: (3733, 2)\n", "Testing Data Shape: (1839, 2)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n", "\n", "print('Training Data Shape:', X_train.shape)\n", "print('Testing Data Shape: ', X_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can pass these sets into a series of different training & testing algorithms and compare their results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Train a Logistic Regression classifier\n", "One of the simplest multi-class classification tools is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Scikit-learn offers a variety of algorithmic solvers; we'll use [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS). " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='warn',\n", " n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',\n", " tol=0.0001, verbose=0, warm_start=False)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "lr_model = LogisticRegression(solver='lbfgs')\n", "\n", "lr_model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test the Accuracy of the Model" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1547 46]\n", " [ 241 5]]\n" ] } ], "source": [ "from sklearn import metrics\n", "\n", "# Create a prediction set:\n", "predictions = lr_model.predict(X_test)\n", "\n", "# Print a confusion matrix\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hamspam
ham154746
spam2415
\n", "
" ], "text/plain": [ " ham spam\n", "ham 1547 46\n", "spam 241 5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can make the confusion matrix less confusing by adding labels:\n", "df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.87 0.97 0.92 1593\n", " spam 0.10 0.02 0.03 246\n", "\n", " micro avg 0.84 0.84 0.84 1839\n", " macro avg 0.48 0.50 0.47 1839\n", "weighted avg 0.76 0.84 0.80 1839\n", "\n" ] } ], "source": [ "# Print a classification report\n", "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.84393692224\n" ] } ], "source": [ "# Print the overall accuracy\n", "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model performed *worse* than a classifier that assigned all messages as \"ham\" would have!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Train a naïve Bayes classifier:\n", "One of the most common - and successful - classifiers is [naïve Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "\n", "nb_model = MultinomialNB()\n", "\n", "nb_model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run predictions and report on metrics" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1583 10]\n", " [ 246 0]]\n" ] } ], "source": [ "predictions = nb_model.predict(X_test)\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.87 0.99 0.93 1593\n", " spam 0.00 0.00 0.00 246\n", "\n", " micro avg 0.86 0.86 0.86 1839\n", " macro avg 0.43 0.50 0.46 1839\n", "weighted avg 0.75 0.86 0.80 1839\n", "\n" ] } ], "source": [ "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.860793909734\n" ] } ], "source": [ "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better, but still less accurate than 86.6%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Train a support vector machine (SVM) classifier\n", "Among the SVM options available, we'll use [C-Support Vector Classification (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.svm import SVC\n", "svc_model = SVC(gamma='auto')\n", "svc_model.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run predictions and report on metrics" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1515 78]\n", " [ 131 115]]\n" ] } ], "source": [ "predictions = svc_model.predict(X_test)\n", "print(metrics.confusion_matrix(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total number of confusions dropped even further to **209**." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.92 0.95 0.94 1593\n", " spam 0.60 0.47 0.52 246\n", "\n", " micro avg 0.89 0.89 0.89 1839\n", " macro avg 0.76 0.71 0.73 1839\n", "weighted avg 0.88 0.89 0.88 1839\n", "\n" ] } ], "source": [ "print(metrics.classification_report(y_test,predictions))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.886351277868\n" ] } ], "source": [ "print(metrics.accuracy_score(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally we have a model that performs *slightly* better than random chance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now you should be able to load a dataset, divide it into training and testing sets, and perform simple analyses using scikit-learn.\n", "## Next up: Feature Extraction from Text" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }