318 lines
8.6 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stemming\n",
"Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for \"boat\" might also return \"boats\" and \"boating\". Here, \"boat\" would be the **stem** for [boat, boater, boating, boats].\n",
"\n",
"Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.\n",
"\n",
"Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Porter Stemmer\n",
"\n",
"One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined, such as:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![stemming1.png](../stemming1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, `caresses` reduces to `caress` but not `cares`.\n",
"\n",
"More sophisticated phases consider the length/complexity of the word before applying a rule. For example:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![stemming1.png](../stemming2.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here `m>0` describes the \"measure\" of the stem, such that the rule is applied to all but the most basic stems."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import the toolkit and the full Porter Stemmer library\n",
"import nltk\n",
"\n",
"from nltk.stem.porter import *"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"p_stemmer = PorterStemmer()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"words = ['run','runner','running','ran','runs','easily','fairly']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"run --> run\n",
"runner --> runner\n",
"running --> run\n",
"ran --> ran\n",
"runs --> run\n",
"easily --> easili\n",
"fairly --> fairli\n"
]
}
],
"source": [
"for word in words:\n",
" print(word+' --> '+p_stemmer.stem(word))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>Note how the stemmer recognizes \"runner\" as a noun, not a verb form or participle. Also, the adverbs \"easily\" and \"fairly\" are stemmed to the unusual root \"easili\" and \"fairli\"</font>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Snowball Stemmer\n",
"This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the \"English Stemmer\" or \"Porter2 Stemmer\". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since **nltk** uses the name SnowballStemmer, we'll use it here."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from nltk.stem.snowball import SnowballStemmer\n",
"\n",
"# The Snowball Stemmer requires that you pass a language parameter\n",
"s_stemmer = SnowballStemmer(language='english')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"words = ['run','runner','running','ran','runs','easily','fairly']\n",
"# words = ['generous','generation','generously','generate']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"run --> run\n",
"runner --> runner\n",
"running --> run\n",
"ran --> ran\n",
"runs --> run\n",
"easily --> easili\n",
"fairly --> fair\n"
]
}
],
"source": [
"for word in words:\n",
" print(word+' --> '+s_stemmer.stem(word))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<font color=green>In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of \"fairly\" more appropriately with \"fair\"</font>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Try it yourself!\n",
"#### Pass in some of your own words and test each stemmer on them. Remember to pass them as strings!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"words = ['consolingly']"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Porter Stemmer:\n",
"consolingly --> consolingli\n"
]
}
],
"source": [
"print('Porter Stemmer:')\n",
"for word in words:\n",
" print(word+' --> '+p_stemmer.stem(word))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Porter2 Stemmer:\n",
"consolingly --> consol\n"
]
}
],
"source": [
"print('Porter2 Stemmer:')\n",
"for word in words:\n",
" print(word+' --> '+s_stemmer.stem(word))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"Stemming has its drawbacks. If given the token `saw`, stemming might always return `saw`, whereas lemmatization would likely return either `see` or `saw` depending on whether the use of the token was as a verb or a noun. As an example, consider the following:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I --> I\n",
"am --> am\n",
"meeting --> meet\n",
"him --> him\n",
"tomorrow --> tomorrow\n",
"at --> at\n",
"the --> the\n",
"meeting --> meet\n"
]
}
],
"source": [
"phrase = 'I am meeting him tomorrow at the meeting'\n",
"for word in phrase.split():\n",
" print(word+' --> '+p_stemmer.stem(word))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the word \"meeting\" appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Next up: Lemmatization"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}