{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"
\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This unit is divided into two sections:\n",
"* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features*.\n",
"* Next we'll show how to perform these steps using real tools."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building a Natural Language Processor From Scratch\n",
"In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.
\n",
"
| \n", " | label | \n", "message | \n", "length | \n", "punct | \n", "
|---|---|---|---|---|
| 0 | \n", "ham | \n", "Go until jurong point, crazy.. Available only ... | \n", "111 | \n", "9 | \n", "
| 1 | \n", "ham | \n", "Ok lar... Joking wif u oni... | \n", "29 | \n", "6 | \n", "
| 2 | \n", "spam | \n", "Free entry in 2 a wkly comp to win FA Cup fina... | \n", "155 | \n", "6 | \n", "
| 3 | \n", "ham | \n", "U dun say so early hor... U c already then say... | \n", "49 | \n", "6 | \n", "
| 4 | \n", "ham | \n", "Nah I don't think he goes to usf, he lives aro... | \n", "61 | \n", "2 | \n", "