{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"\n",
"\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scikit-learn Primer\n",
"\n",
"**Scikit-learn** (http://scikit-learn.org/) is an open-source machine learning library for Python that offers a variety of regression, classification and clustering algorithms.\n",
"\n",
"In this section we'll perform a fairly simple classification exercise with scikit-learn. In the next section we'll leverage the machine learning strength of scikit-learn to perform natural language classifications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Installation and Setup\n",
"\n",
"### From the command line or terminal:\n",
"> `conda install scikit-learn`\n",
"> *or* \n",
"> `pip install -U scikit-learn`\n",
"\n",
"Scikit-learn additionally requires that NumPy and SciPy be installed. For more info visit http://scikit-learn.org/stable/install.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Perform Imports and Load Data\n",
"For this exercise we'll be using the **SMSSpamCollection** dataset from [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) that contains more than 5 thousand SMS phone messages. You can check out the [**sms_readme**](../TextFiles/sms_readme.txt) file for more info.\n",
"\n",
"The file is a [tab-separated-values](https://en.wikipedia.org/wiki/Tab-separated_values) (tsv) file with four columns:\n",
"> **label** - every message is labeled as either ***ham*** or ***spam*** \n",
"> **message** - the message itself \n",
"> **length** - the number of characters in each message \n",
"> **punct** - the number of punctuation characters in each message"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
label
\n",
"
message
\n",
"
length
\n",
"
punct
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ham
\n",
"
Go until jurong point, crazy.. Available only ...
\n",
"
111
\n",
"
9
\n",
"
\n",
"
\n",
"
1
\n",
"
ham
\n",
"
Ok lar... Joking wif u oni...
\n",
"
29
\n",
"
6
\n",
"
\n",
"
\n",
"
2
\n",
"
spam
\n",
"
Free entry in 2 a wkly comp to win FA Cup fina...
\n",
"
155
\n",
"
6
\n",
"
\n",
"
\n",
"
3
\n",
"
ham
\n",
"
U dun say so early hor... U c already then say...
\n",
"
49
\n",
"
6
\n",
"
\n",
"
\n",
"
4
\n",
"
ham
\n",
"
Nah I don't think he goes to usf, he lives aro...
\n",
"
61
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label message length punct\n",
"0 ham Go until jurong point, crazy.. Available only ... 111 9\n",
"1 ham Ok lar... Joking wif u oni... 29 6\n",
"2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 6\n",
"3 ham U dun say so early hor... U c already then say... 49 6\n",
"4 ham Nah I don't think he goes to usf, he lives aro... 61 2"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"df = pd.read_csv('../TextFiles/smsspamcollection.tsv', sep='\\t')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5572"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check for missing values:\n",
"Machine learning models usually require complete data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"label 0\n",
"message 0\n",
"length 0\n",
"punct 0\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Take a quick look at the *ham* and *spam* `label` column:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array(['ham', 'spam'], dtype=object)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['label'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ham 4825\n",
"spam 747\n",
"Name: label, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['label'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that 4825 out of 5572 messages, or 86.6%, are ham. This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the data:\n",
"Since we're not ready to do anything with the message text, let's see if we can predict ham/spam labels based on message length and punctuation counts. We'll look at message `length` first:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 5572.000000\n",
"mean 80.489950\n",
"std 59.942907\n",
"min 2.000000\n",
"25% 36.000000\n",
"50% 62.000000\n",
"75% 122.000000\n",
"max 910.000000\n",
"Name: length, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['length'].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFnBJREFUeJzt3X2QXXWd5/H3lxjJOCLR0LCxO9pxiLMBugjaJrBSNYgsBBwEFGZgR02UJUoBKzpqiGUVjC5VigtRxtlIMCxhi+VhgVnCw+giT0oVTx0GCCFjEaFXepMibXgYEGFJ+O4ffRKbeDv39n3o2336/arquvf8zu+c+21+3M89+fW550RmIkkqrz3aXYAkqbUMekkqOYNekkrOoJekkjPoJankDHpJKjmDXpJKzqCXpJIz6CWp5Ax6SSq5t7W7AIB99tknu7u7212GJE0oa9eu/W1mdlTrNy6Cvru7m76+vnaXIUkTSkT8n1r6OXUjSSVn0EtSydUc9BExJSL+OSJuLZZnR8SDEfFURFwXEW8v2vcsljcW67tbU7okqRajmaP/MrABeFex/D1geWZeGxE/Bk4HVhSPL2Tm/hFxatHvr5tYs6RJ6o033mBgYIDXXnut3aWMqWnTptHV1cXUqVPr2r6moI+ILuATwIXAVyMigCOB/1B0WQ1cwFDQn1A8B7gB+FFERHqHE0kNGhgYYK+99qK7u5uhGCq/zGTr1q0MDAwwe/bsuvZR69TND4BvAG8WyzOAFzNzW7E8AHQWzzuBZ4sCtwEvFf0lqSGvvfYaM2bMmDQhDxARzJgxo6F/xVQN+oj4S2BLZq4d3lyha9awbvh+l0REX0T0DQ4O1lSsJE2mkN+h0d+5liP6jwKfjIh+4FqGpmx+AEyPiB1TP13ApuL5ADCrKO5twN7A87vuNDNXZmZvZvZ2dFQ931+SxoX+/n4OOuigdpcxKlXn6DNzGbAMICKOAL6WmX8TEf8TOJmh8F8E3FxssqZYvr9Yf5fz8yq74//+vortt5xz+BhXMrmM9N+9XmUdr0bOo1/K0B9mNzI0B7+qaF8FzCjavwqc11iJkjS+bN++nTPOOIMDDzyQo48+mt///vdcfvnlfOQjH+Hggw/m05/+NK+++ioAixcv5swzz+RjH/sYH/jAB7j33nv5whe+wNy5c1m8ePGY1DuqoM/MezLzL4vnT2fm/MzcPzNPyczXi/bXiuX9i/VPt6JwSWqXp556irPOOov169czffp0brzxRj71qU/x8MMP89hjjzF37lxWrVq1s/8LL7zAXXfdxfLlyzn++OP5yle+wvr161m3bh2PPvpoy+sdF9e6kcrKKZ1ymj17NvPmzQPgwx/+MP39/TzxxBN861vf4sUXX+SVV17hmGOO2dn/+OOPJyLo6elhv/32o6enB4ADDzyQ/v7+nftqFS+BIEmjtOeee+58PmXKFLZt28bixYv50Y9+xLp16zj//PPfcjrkjv577LHHW7bdY4892LZtG61m0EtSE7z88svMnDmTN954g6uvvrrd5byFUzeS1ATf+c53WLBgAe9///vp6enh5ZdfbndJO8V4OPOxt7c3vR69JrLRnubnHH19NmzYwNy5c9tdRltU+t0jYm1m9lbb1qkbSSo5g16SSs6gl6SSM+glqeQMekkqOYNekkrOoJekkvMLU5Imrsv+orn7++K9zd3fOOERvSTV6He/+x2f+MQnOPjggznooIO47rrr6O7uZunSpcyfP5/58+ezceNGAG655RYWLFjAIYccwlFHHcVzzz0HwAUXXMCiRYs4+uij6e7u5qabbuIb3/gGPT09LFy4kDfeeKPpdXtEL41Cs290oYnlpz/9Ke9973u57bbbAHjppZdYunQp73rXu3jooYe46qqrOPfcc7n11ls5/PDDeeCBB4gIfvKTn3DRRRdx8cUXA/DrX/+au+++myeffJLDDjuMG2+8kYsuuoiTTjqJ2267jRNPPLGpdXtEL0k16unp4ec//zlLly7ll7/8JXvvvTcAp5122s7H+++/H4CBgQGOOeYYenp6+P73v8/69et37ufYY49l6tSp9PT0sH37dhYuXLhz//39/U2v26CXpBp98IMfZO3atfT09LBs2TK+/e1vA2+9efeO5+eccw5nn30269at47LLLhvxssVTp07duU2rLltcNegjYlpEPBQRj0XE+oj4u6L9yoh4JiIeLX7mFe0REZdGxMaIeDwiPtT0qiWpDTZt2sQ73vEOPvOZz/C1r32NRx55BIDrrrtu5+Nhhx0GDE3rdHZ2ArB69er2FFyoZY7+deDIzHwlIqYC90XEPxXrvp6ZN+zS/1hgTvGzAFhRPErShLZu3Tq+/vWv7zwSX7FiBSeffDKvv/46CxYs4M033+Saa64Bhv7oesopp9DZ2cmhhx7KM88807a6R3WZ4oh4B3AfcGbxc+uuQR8RlwH3ZOY1xfKvgCMyc/NI+/UyxZoomvXHWC9TXJ/xeJni7u5u+vr62GeffVr6Oi2/THFETImIR4EtwB2Z+WCx6sJiemZ5ROy4P1Yn8OywzQeKNklSG9QU9Jm5PTPnAV3A/Ig4CFgG/FvgI8B7gKVF96i0i10bImJJRPRFRN/g4GBdxUtSu/X397f8aL5RozrrJjNfBO4BFmbm5hzyOvDfgPlFtwFg1rDNuoBNFfa1MjN7M7O3o6OjruIlSdXVctZNR0RML57/CXAU8C8RMbNoC+BE4IlikzXA54qzbw4FXtrd/LwkjcZ4uP3pWGv0d67lrJuZwOqImMLQB8P1mXlrRNwVER0MTdU8Cnyp6H87cBywEXgV+HxDFUpSYdq0aWzdupUZM2a85dz1MstMtm7dyrRp0+reR9Wgz8zHgUMqtB85Qv8Ezqq7IkkaQVdXFwMDA0y2v+tNmzaNrq6uurf3WjeSJoypU6cye/bsdpcx4XgJBEkqOY/oNWmN9OUnv8yksvGIXpJKzqCXpJJz6kbahTcXUdl4RC9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxfmFLp+QUoTXYGvdQG9Xz4eLE11cupG0kquVruGTstIh6KiMciYn1E/F3RPjsiHoyIpyLiuoh4e9G+Z7G8sVjf3dpfQZK0O7Uc0b8OHJmZBwPzgIXFTb+/ByzPzDnAC8DpRf/TgRcyc39gedFPktQmVYM+h7xSLE4tfhI4ErihaF8NnFg8P6FYplj/8Zgsd/GVpHGopj/GRsQUYC2wP/APwK+BFzNzW9FlAOgsnncCzwJk5raIeAmYAfy2iXVLk453xFK9avpjbGZuz8x5QBcwH5hbqVvxWOnoPXdtiIglEdEXEX2T7Y7ukjSWRnXWTWa+CNwDHApMj4gd/yLoAjYVzweAWQDF+r2B5yvsa2Vm9mZmb0dHR33VS5KqquWsm46ImF48/xPgKGADcDdwctFtEXBz8XxNsUyx/q7M/KMjeknS2Khljn4msLqYp98DuD4zb42IJ4FrI+I/A/8MrCr6rwL+e0RsZOhI/tQW1C1JqlHVoM/Mx4FDKrQ/zdB8/a7trwGnNKU6SVLDvASCpPpd9heV279479jWod3yEgiSVHIGvSSVnEEvSSVn0EtSyRn0klRynnUjTXBeA0fVeEQvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcp5HL6mqEc/Vf/sYF6K6eEQvSSVn0EtSyVWduomIWcBVwL8B3gRWZuYPI+IC4AxgsOj6zcy8vdhmGXA6sB34T5n5sxbULmmMXPLilyuv2PedY1uI6lLLHP024G8z85GI2AtYGxF3FOuWZ+Z/Gd45Ig5g6D6xBwLvBX4eER/MzO3NLFySVJuqUzeZuTkzHymevwxsADp3s8kJwLWZ+XpmPgNspMK9ZSVJY2NUc/QR0c3QjcIfLJrOjojHI+KKiHh30dYJPDtsswF2/8EgSWqhmoM+It4J3Aicm5n/CqwA/gyYB2wGLt7RtcLmWWF/SyKiLyL6BgcHK2wiSWqGmoI+IqYyFPJXZ+ZNAJn5XGZuz8w3gcv5w/TMADBr2OZdwKZd95mZKzOzNzN7Ozo6GvkdJEm7UTXoIyKAVcCGzLxkWPvMYd1OAp4onq8BTo2IPSNiNjAHeKh5JUuSRqOWs24+CnwWWBcRjxZt3wROi4h5DE3L9ANfBMjM9RFxPfAkQ2fsnOUZN9LEMNI3YC+p2KqJomrQZ+Z9VJ53v30321wIXNhAXZKkJvGbsZJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klVwt17qRpIqe2vJKxfY5Y1yHds8jekkqOYNekkrOoJekkjPoJank/GOsVFIj3UTklnMOH+NK1G4GvTTJjPQBoPJy6kaSSq6Wm4PPioi7I2JDRKyPiC8X7e+JiDsi4qni8d1Fe0TEpRGxMSIej4gPtfqXkCSNrJYj+m3A32bmXOBQ4KyIOAA4D7gzM+cAdxbLAMcy9H2JOcASYEXTq5Yk1axq0Gfm5sx8pHj+MrAB6AROAFYX3VYDJxbPTwCuyiEPANMjYmbTK5ck1WRUc/QR0Q0cAjwI7JeZm2HowwDYt+jWCTw7bLOBom3XfS2JiL6I6BscHBx95ZKkmtQc9BHxTuBG4NzM/Nfdda3Qln/UkLkyM3szs7ejo6PWMiRJo1RT0EfEVIZC/urMvKlofm7HlEzxuKVoHwBmDdu8C9jUnHIlSaNVy1k3AawCNmTmJcNWrQEWFc8XATcPa/9ccfbNocBLO6Z4JEljr5YvTH0U+CywLiIeLdq+CXwXuD4iTgd+A5xSrLsdOA7YCLwKfL6pFUuSRqVq0GfmfVSedwf4eIX+CZzVYF2SpCbxm7GSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRytdwz9oqI2BIRTwxruyAi/m9EPFr8HDds3bKI2BgRv4qIY1pVuCSpNrUc0V8JLKzQvjwz5xU/twNExAHAqcCBxTb/NSKmNKtYSdLoVQ36zPwF8HyN+zsBuDYzX8/MZxi6Qfj8BuqTJDWokTn6syPi8WJq591FWyfw7LA+A0WbJKlN6g36FcCfAfOAzcDFRXtU6JuVdhARSyKiLyL6BgcH6yxDklRNXUGfmc9l5vbMfBO4nD9MzwwAs4Z17QI2jbCPlZnZm5m9HR0d9ZQhSapBXUEfETOHLZ4E7DgjZw1wakTsGRGzgTnAQ42VKElqxNuqdYiIa4AjgH0iYgA4HzgiIuYxNC3TD3wRIDPXR8T1wJPANuCszNzemtIlSbWoGvSZeVqF5lW76X8hcGEjRUmSmsdvxkpSyRn0klRyBr0klZxBL0klZ9BLUskZ9JJUcga9JJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUslVDfqIuCIitkTEE8Pa3hMRd0TEU8Xju4v2iIhLI2JjRDweER9qZfGSpOpqOaK/Eli4S9t5wJ2ZOQe4s1gGOJahG4LPAZYAK5pTpiSpXlWDPjN/ATy/S/MJwOri+WrgxGHtV+WQB4DpETGzWcVKkkav3jn6/TJzM0DxuG/R3gk8O6zfQNEmSWqTZv8xNiq0ZcWOEUsioi8i+gYHB5tchiRph3qD/rkdUzLF45aifQCYNaxfF7Cp0g4yc2Vm9mZmb0dHR51lSJKqqTfo1wCLiueLgJuHtX+uOPvmUOClHVM8kqT2eFu1DhFxDXAEsE9EDADnA98Fro+I04HfAKcU3W8HjgM2Aq8Cn29BzZKkUaga9Jl52girPl6hbwJnNVqUJKl5/GasJJWcQS9JJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klZxBL0klZ9BLUslVvdaNpMnjkhe/3O4S1AIe0UtSyRn0klRyBr0klZxBL0klZ9BLUsk1dNZNRPQDLwPbgW2Z2RsR7wGuA7qBfuCvMvOFxsqUJNWrGUf0H8vMeZnZWyyfB9yZmXOAO4tlSVKbtGLq5gRgdfF8NXBiC15DklSjRoM+gf8dEWsjYknRtl9mbgYoHvdt8DUkSQ1o9JuxH83MTRGxL3BHRPxLrRsWHwxLAN73vvc1WIYkaSQNBX1mbioet0TEPwLzgeciYmZmbo6ImcCWEbZdCawE6O3tzUbqkACO//v72l3ChDAWlzkYaSxuOefwlr+2/ljdUzcR8acRsdeO58DRwBPAGmBR0W0RcHOjRUqS6tfIEf1+wD9GxI79/I/M/GlEPAxcHxGnA78BTmm8TElSveoO+sx8Gji4QvtW4OONFCVJah6/GStJJWfQS1LJGfSSVHIGvSSVnEEvSSVn0EtSyRn0klRyBr0klVyjFzWTxpzXtJlcvG5O4wx6SWNmdx/SBnfrGPSSxgX/pdY6Br3ayje31HoGvTRBjHQd+a9O/+Go+mvyMeglTUj+kbZ2Br00wXnkrmoM+hIaj0c6zsVL7WPQq64PhvH4YTIejXZeXa01Wf+/bVnQR8RC4IfAFOAnmfndVr3WRDJZ/0eTxor/evxjLQn6iJgC/APw74EB4OGIWJOZT7bi9cqsnV8w8Q3THmWYc/dfMuNLq47o5wMbi/vKEhHXAicAEzLoy3IUPhbBXfYPhzKEsGpXlvd+q4K+E3h22PIAsKBFr6UJrJ4jv2adT97O888n6wfG7n7vdh7tj/YApZkHNGPxoRGZ2fydRpwCHJOZ/7FY/iwwPzPPGdZnCbCkWPxz4FcVdrU38FKVtn2A3zap9NGqVN9Y7afWbar12936kdbVMi4wOcfGcdk93zMjt9UzLu/PzI6qvTKz6T/AYcDPhi0vA5bVsZ+V1dqAvlb8DvXWN1b7qXWbav12t36kdbWMy2QdG8dlfI7LRBibVo5Lq65H/zAwJyJmR8TbgVOBNXXs55Ya29qlWbXUs59at6nWb3frR1o33scF2jc2jsvu+Z6p/XWapiVTNwARcRzwA4ZOr7wiMy9s0ev0ZWZvK/atxjg245PjMj61clxadh59Zt4O3N6q/Q+zcgxeQ/VxbMYnx2V8atm4tOyIXpI0PnjPWEkqOYNekkrOoJekkitd0EfEn0bE6oi4PCL+pt31aEhEfCAiVkXEDe2uRW8VEScW75ebI+LodtejIRExNyJ+HBE3RMSZjexrQgR9RFwREVsi4old2hdGxK8iYmNEnFc0fwq4ITPPAD455sVOIqMZl8x8OjNPb0+lk88ox+Z/Fe+XxcBft6HcSWOU47IhM78E/BXQ0GmXEyLogSuBhcMbhl0h81jgAOC0iDgA6OIP19nZPoY1TkZXUvu4aGxdyejH5lvFerXOlYxiXCLik8B9wJ2NvOiECPrM/AXw/C7NO6+QmZn/D9hxhcwBhsIeJsjvN1GNclw0hkYzNjHke8A/ZeYjY13rZDLa90xmrsnMfwc0NA09kYOw0hUyO4GbgE9HxArG39e/J4OK4xIRMyLix8AhEbGsPaVNeiO9Z84BjgJOjogvtaOwSW6k98wREXFpRFxGg18+nci3EowKbZmZvwM+P9bFaKeRxmUrYIi010hjcylw6VgXo51GGpd7gHua8QIT+Yh+AJg1bLkL2NSmWvQHjsv45diMTy0fl4kc9M26Qqaay3EZvxyb8anl4zIhgj4irgHuB/48IgYi4vTM3AacDfwM2ABcn5nr21nnZOO4jF+OzfjUrnHxomaSVHIT4oheklQ/g16SSs6gl6SSM+glqeQMekkqOYNekkrOoJekkjPoJankDHpJKrn/DyDmFU1Id+LIAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"plt.xscale('log')\n",
"bins = 1.15**(np.arange(0,50))\n",
"plt.hist(df[df['label']=='ham']['length'],bins=bins,alpha=0.8)\n",
"plt.hist(df[df['label']=='spam']['length'],bins=bins,alpha=0.8)\n",
"plt.legend(('ham','spam'))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like there's a small range of values where a message is more likely to be spam than ham.\n",
"\n",
"Now let's look at the `punct` column:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 5572.000000\n",
"mean 4.177495\n",
"std 4.623919\n",
"min 0.000000\n",
"25% 2.000000\n",
"50% 3.000000\n",
"75% 6.000000\n",
"max 133.000000\n",
"Name: punct, dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['punct'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAETlJREFUeJzt3X+MldWdx/H3lx+Kdqu0QI2dsQ6mtEvlhtpOAbe4G6uhUJdiVLKauoWWyG4jtOq2IptNbNps0h+bUrUNK5ZuaUIsXTWLrK7dVlu3Jmod/NGBsg1UWb2Lq1OKLKvSAp79Yx5wxBnm3pn7Y+7h/UrIPD/Oc+73ejKfOZ557jORUkKSlK9RzS5AklRfBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpcwa9JGVuTLMLAJg4cWLq6OhodhmS1FI2b97825TSpMHajYig7+jooKurq9llSFJLiYj/qqSdSzeSlDmDXpIyZ9BLUuZGxBq9JFXiwIEDlMtl9u/f3+xSGmrcuHG0t7czduzYIV1v0EtqGeVymbe+9a10dHQQEc0upyFSSuzevZtyuczkyZOH1IdLN5Jaxv79+5kwYcJxE/IAEcGECROG9X8xBr2klnI8hfxhw33PBr0kVWjnzp1Mmzat2WVUzTX6Acy/5aG69Ltp+ey69Csdj2r9fZrr96czekmqwqFDh7jqqqs4++yzmTNnDq+++iq33XYbH/rQh5g+fTqXXnopr7zyCgCLFy/mM5/5DOeffz5nnXUWDz74IJ/+9KeZOnUqixcvbljNBr0kVWH79u1cffXVbN26lfHjx3PnnXdyySWX8Nhjj/HUU08xdepU1q5de6T9nj17eOCBB1i1ahXz58/n2muvZevWrXR3d/Pkk082pGaDXpKqMHnyZN7//vcD8MEPfpCdO3eyZcsWzjvvPEqlEuvXr2fr1q1H2s+fP5+IoFQqcdppp1EqlRg1ahRnn302O3fubEjNBr0kVeHEE088sj169GgOHjzI4sWL+da3vkV3dzc33njjG26FPNx+1KhRb7h21KhRHDx4sCE1G/SSNEz79u3j9NNP58CBA6xfv77Z5byJd91I0jB9+ctfZubMmZx55pmUSiX27dvX7JLeIFJKza6Bzs7ONNKeR+/tldLIs23bNqZOndrsMpqiv/ceEZtTSp2DXevSjSRlzqCXpMy1/Bp9vZZYJCkXzuglKXMGvSRlzqCXpMwZ9JKUuZb/Zayk49itf1bb/v7qwdr2N0I4o5ekCr388stcdNFFTJ8+nWnTprFhwwY6OjpYsWIFM2bMYMaMGezYsQOATZs2MXPmTM455xwuvPBCXnjhBQC++MUvsmjRIubMmUNHRwd33XUX119/PaVSiblz53LgwIGa123QS1KF7rvvPt75znfy1FNPsWXLFubOnQvAKaecwi9+8QuWLVvGNddcA8Ds2bN55JFHeOKJJ7j88sv52te+dqSf3/zmN9xzzz1s3LiRK6+8kvPPP5/u7m5OOukk7rnnnprXbdBLUoVKpRI/+clPWLFiBT//+c859dRTAbjiiiuOfH344YcBKJfLfPSjH6VUKvH1r3/9DY8unjdvHmPHjqVUKnHo0KEjPzBKpVJdHl1s0EtShd7znvewefNmSqUSK1eu5Etf+hLwxj/efXh7+fLlLFu2jO7ubm699dYBH108duzYI9fU69HFBr0kVWjXrl2cfPLJXHnllXz+85/n8ccfB2DDhg1Hvp577rkA7N27l7a2NgDWrVvXnIIL3nUjSRXq7u7mC1/4wpGZ+OrVq7nsssv4/e9/z8yZM3nttde4/fbbgd5fui5cuJC2tjZmzZrFM88807S6W/4xxa32rBsfUywN3Uh8THFHRwddXV1MnDixrq/jY4olSQNy6UaShqFRf+B7OCqa0UfEtRGxNSK2RMTtETEuIiZHxKMRsT0iNkTECUXbE4v9HcX5jnq+AUnSsQ0a9BHRBnwW6EwpTQNGA5cDXwVWpZSmAHuAJcUlS4A9KaV3A6uKdpJUEyPh94qNNtz3XOka/RjgpIgYA5wMPA98BLijOL8OuLjYXlDsU5y/IPreZCpJQzRu3Dh27959XIV9Sondu3czbty4Ifcx6Bp9Sum/I+IfgGeBV4F/BzYDL6WUDt/ZXwbaiu024Lni2oMRsReYAPx2yFVKEtDe3k65XKanp6fZpTTUuHHjaG9vH/L1gwZ9RLyN3ln6ZOAl4J+Bef00Pfwjtr/Z+5t+/EbEUmApwLve9a4Ky5V0PBs7diyTJ09udhktp5KlmwuBZ1JKPSmlA8BdwJ8A44ulHIB2YFexXQbOACjOnwr87uhOU0prUkqdKaXOSZMmDfNtSJIGUknQPwvMioiTi7X2C4BfAT8FLivaLAI2Ftt3F/sU5x9Ix9OCmiSNMIMGfUrpUXp/qfo40F1cswZYAVwXETvoXYNfW1yyFphQHL8OuKEOdUuSKlTRB6ZSSjcCNx51+GlgRj9t9wMLh1+aJKkWfASCJGXOoJekzPmsm0zU4ymePmlTyoMzeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypyfjG2wenyCVZKOxRm9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKXEVBHxHjI+KOiPjPiNgWEedGxNsj4scRsb34+raibUTEzRGxIyJ+GREfqO9bkCQdS6Uz+puA+1JKfwxMB7YBNwD3p5SmAPcX+wDzgCnFv6XA6ppWLEmqyqBBHxGnAH8KrAVIKf0hpfQSsABYVzRbB1xcbC8Avp96PQKMj4jTa165JKkilczozwJ6gH+KiCci4jsR8RbgtJTS8wDF13cU7duA5/pcXy6OSZKaoJKgHwN8AFidUjoHeJnXl2n6E/0cS29qFLE0Iroioqunp6eiYiVJ1ask6MtAOaX0aLF/B73B/8LhJZni64t92p/R5/p2YNfRnaaU1qSUOlNKnZMmTRpq/ZKkQQwa9Cml/wGei4j3FocuAH4F3A0sKo4tAjYW23cDnyzuvpkF7D28xCNJarwxFbZbDqyPiBOAp4FP0ftD4ocRsQR4FlhYtL0X+BiwA3ilaCtJapKKgj6l9CTQ2c+pC/ppm4Crh1mXJKlG/GSsJGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpc2OaXYBGrvm3PFSXfjctn12XfiX1zxm9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUuYqDPiJGR8QTEfGvxf7kiHg0IrZHxIaIOKE4fmKxv6M431Gf0iVJlahmRv85YFuf/a8Cq1JKU4A9wJLi+BJgT0rp3cCqop0kqUkqCvqIaAcuAr5T7AfwEeCOosk64OJie0GxT3H+gqK9JKkJKp3RfxO4Hnit2J8AvJRSOljsl4G2YrsNeA6gOL+3aC9JaoJBgz4i/hx4MaW0ue/hfpqmCs717XdpRHRFRFdPT09FxUqSqlfJjP7DwMcjYifwA3qXbL4JjI+Iw3+4pB3YVWyXgTMAivOnAr87utOU0pqUUmdKqXPSpEnDehOSpIENGvQppZUppfaUUgdwOfBASukTwE+By4pmi4CNxfbdxT7F+QdSSm+a0UuSGmM499GvAK6LiB30rsGvLY6vBSYUx68DbhheiZKk4ajqb8amlH4G/KzYfhqY0U+b/cDCGtQmSaoBPxkrSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMlfV7ZVSLcy/5aG69Ltp+ey69Cu1Omf0kpQ5g16SMmfQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjJn0EtS5gx6ScqcQS9JmTPoJSlzBr0kZc6gl6TMGfSSlDmDXpIyZ9BLUuYMeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypxBL0mZGzToI+KMiPhpRGyLiK0R8bni+Nsj4scRsb34+rbieETEzRGxIyJ+GREfqPebkCQNrJIZ/UHgb1JKU4FZwNUR8T7gBuD+lNIU4P5iH2AeMKX4txRYXfOqJUkVGzToU0rPp5QeL7b3AduANmABsK5otg64uNheAHw/9XoEGB8Rp9e8cklSRapao4+IDuAc4FHgtJTS89D7wwB4R9GsDXiuz2Xl4tjRfS2NiK6I6Orp6am+cklSRSoO+oj4I+BO4JqU0v8eq2k/x9KbDqS0JqXUmVLqnDRpUqVlSJKqVFHQR8RYekN+fUrpruLwC4eXZIqvLxbHy8AZfS5vB3bVplxJUrUquesmgLXAtpTSN/qcuhtYVGwvAjb2Of7J4u6bWcDew0s8kqTGG1NBmw8Dfwl0R8STxbG/Bb4C/DAilgDPAguLc/cCHwN2AK8An6ppxZKkqgwa9Cmlh+h/3R3ggn7aJ+DqYdYlSaqRSmb0UkuYf8tDNe9z0/LZNe9TajQfgSBJmTPoJSlzLt1oQN946XN16fe68TfVpV9J/TPo1XD+AJEay6UbScqcQS9JmXPpJhP1Wg6R1Pqc0UtS5gx6ScqcQS9JmTPoJSlzBr0kZc67bhrMu2MkNZozeknKnEEvSZkz6CUpcwa9JGXOoJekzBn0kpQ5g16SMmfQS1LmDHpJypyfjB2An2CVlAtn9JKUOYNekjJn0EtS5gx6Scqcv4yVjmH+LQ/Vpd9Ny2fXpV+pP87oJSlzzuiVjXrcEnvd+Jtq3qfUaM7oJSlzBr0kZa7ll278BKskHZszeknKXF1m9BExF7gJGA18J6X0lXq8jtSqvG1TjVTzGX1EjAa+DcwD3gdcERHvq/XrSJIqU4+lmxnAjpTS0ymlPwA/ABbU4XUkSRWox9JNG/Bcn/0yMLMOryPVXb1+2V+v+/NdElJ/6hH00c+x9KZGEUuBpcXu/0XEr4vtU4G9A/Td37mJwG+HUGc9Hes9NLPfaq+vpP1w2wx0bqDjmYz3eXXqd8jXHrN9fLbifocy1gOdy2Ss69rvmRW1SinV9B9wLvCjPvsrgZVVXL+mmnNAV63fQw3+Gwz4HprZb7XXV9J+uG0GOneM4453k8a6knZDGeuBzjnWtftXjzX6x4ApETE5Ik4ALgfuruL6TUM8N5LUq87h9lvt9ZW0H26bgc61yljDyBzveox1Je2G+v3bKuM9Esd6UFH8NKltpxEfA75J7+2V300p/X3NX+T11+pKKXXWq3+NLI738cOxrp263EefUroXuLceffdjTYNeRyOD4338cKxrpC4zeknSyOEjECQpcwa9JGXOoJekzGUX9BHxlohYFxG3RcQnml2P6icizoqItRFxR7NrUf1FxMXF9/XGiJjT7HpaSUsEfUR8NyJejIgtRx2fGxG/jogdEXFDcfgS4I6U0lXAxxterIalmrFOvc9TWtKcSlULVY73vxTf14uBv2hCuS2rJYIe+B4wt++BYzwls53Xn7VzqIE1qja+R+Vjrdb3Paof778rzqtCLRH0KaX/AH531OGBnpJZpjfsoUXen15X5VirxVUz3tHrq8C/pZQeb3StrayVg7C/p2S2AXcBl0bEalrnY9U6tn7HOiImRMQ/AudExMrmlKY6GOh7ezlwIXBZRPx1MwprVa38N2P7fUpmSull4FONLkZ1NdBY7wb8hs/PQON9M3Bzo4vJQSvP6MvAGX3224FdTapF9eVYH18c7xpr5aAf7lMy1Toc6+OL411jLRH0EXE78DDw3ogoR8SSlNJBYBnwI2Ab8MOU0tZm1qnhc6yPL453Y/hQM0nKXEvM6CVJQ2fQS1LmDHpJypxBL0mZM+glKXMGvSRlzqCXpMwZ9JKUOYNekjL3//lUoXk3DdZMAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.xscale('log')\n",
"bins = 1.5**(np.arange(0,15))\n",
"plt.hist(df[df['label']=='ham']['punct'],bins=bins,alpha=0.8)\n",
"plt.hist(df[df['label']=='spam']['punct'],bins=bins,alpha=0.8)\n",
"plt.legend(('ham','spam'))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Split the data into train & test sets:\n",
"\n",
"If we wanted to divide the DataFrame into two smaller sets, we could use\n",
"> `train, test = train_test_split(df)`\n",
"\n",
"For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the `label` column in our data. For Features we'll use the `length` and `punct` columns. *By convention, **X** is capitalized and **y** is lowercase.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting features\n",
"There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:\n",
"> `X = df[['length','punct']]`\n",
"\n",
"If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:\n",
"> `X = df.drop(['label','message'], axis=1)`\n",
"\n",
"These operations make copies of **df**, but do not change the original DataFrame in place. All the original data is preserved."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Create Feature and Label sets\n",
"X = df[['length','punct']] # note the double set of brackets\n",
"y = df['label']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional train/test/split arguments:\n",
"The default test size for `train_test_split` is 30%. Here we'll assign 33% of the data for testing. \n",
"Also, we can set a `random_state` seed value to ensure that everyone uses the same \"random\" training & testing sets."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training Data Shape: (3733, 2)\n",
"Testing Data Shape: (1839, 2)\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n",
"\n",
"print('Training Data Shape:', X_train.shape)\n",
"print('Testing Data Shape: ', X_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can pass these sets into a series of different training & testing algorithms and compare their results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Train a Logistic Regression classifier\n",
"One of the simplest multi-class classification tools is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Scikit-learn offers a variety of algorithmic solvers; we'll use [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS). "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"lr_model = LogisticRegression(solver='lbfgs')\n",
"\n",
"lr_model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the Accuracy of the Model"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1547 46]\n",
" [ 241 5]]\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"\n",
"# Create a prediction set:\n",
"predictions = lr_model.predict(X_test)\n",
"\n",
"# Print a confusion matrix\n",
"print(metrics.confusion_matrix(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ham
\n",
"
spam
\n",
"
\n",
" \n",
" \n",
"
\n",
"
ham
\n",
"
1547
\n",
"
46
\n",
"
\n",
"
\n",
"
spam
\n",
"
241
\n",
"
5
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ham spam\n",
"ham 1547 46\n",
"spam 241 5"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can make the confusion matrix less confusing by adding labels:\n",
"df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" ham 0.87 0.97 0.92 1593\n",
" spam 0.10 0.02 0.03 246\n",
"\n",
" micro avg 0.84 0.84 0.84 1839\n",
" macro avg 0.48 0.50 0.47 1839\n",
"weighted avg 0.76 0.84 0.80 1839\n",
"\n"
]
}
],
"source": [
"# Print a classification report\n",
"print(metrics.classification_report(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.84393692224\n"
]
}
],
"source": [
"# Print the overall accuracy\n",
"print(metrics.accuracy_score(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model performed *worse* than a classifier that assigned all messages as \"ham\" would have!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Train a naïve Bayes classifier:\n",
"One of the most common - and successful - classifiers is [naïve Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"\n",
"nb_model = MultinomialNB()\n",
"\n",
"nb_model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run predictions and report on metrics"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1583 10]\n",
" [ 246 0]]\n"
]
}
],
"source": [
"predictions = nb_model.predict(X_test)\n",
"print(metrics.confusion_matrix(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" ham 0.87 0.99 0.93 1593\n",
" spam 0.00 0.00 0.00 246\n",
"\n",
" micro avg 0.86 0.86 0.86 1839\n",
" macro avg 0.43 0.50 0.46 1839\n",
"weighted avg 0.75 0.86 0.80 1839\n",
"\n"
]
}
],
"source": [
"print(metrics.classification_report(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.860793909734\n"
]
}
],
"source": [
"print(metrics.accuracy_score(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Better, but still less accurate than 86.6%"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Train a support vector machine (SVM) classifier\n",
"Among the SVM options available, we'll use [C-Support Vector Classification (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.svm import SVC\n",
"svc_model = SVC(gamma='auto')\n",
"svc_model.fit(X_train,y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run predictions and report on metrics"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1515 78]\n",
" [ 131 115]]\n"
]
}
],
"source": [
"predictions = svc_model.predict(X_test)\n",
"print(metrics.confusion_matrix(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The total number of confusions dropped even further to **209**."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" ham 0.92 0.95 0.94 1593\n",
" spam 0.60 0.47 0.52 246\n",
"\n",
" micro avg 0.89 0.89 0.89 1839\n",
" macro avg 0.76 0.71 0.73 1839\n",
"weighted avg 0.88 0.89 0.88 1839\n",
"\n"
]
}
],
"source": [
"print(metrics.classification_report(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.886351277868\n"
]
}
],
"source": [
"print(metrics.accuracy_score(y_test,predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally we have a model that performs *slightly* better than random chance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! Now you should be able to load a dataset, divide it into training and testing sets, and perform simple analyses using scikit-learn.\n",
"## Next up: Feature Extraction from Text"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}