{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"\n",
"
\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lemmatization\n",
"In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Perform standard imports:\n",
"import spacy\n",
"nlp = spacy.load('en_core_web_sm')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I \t PRON \t 561228191312463089 \t -PRON-\n",
"am \t VERB \t 10382539506755952630 \t be\n",
"a \t DET \t 11901859001352538922 \t a\n",
"runner \t NOUN \t 12640964157389618806 \t runner\n",
"running \t VERB \t 12767647472892411841 \t run\n",
"in \t ADP \t 3002984154512732771 \t in\n",
"a \t DET \t 11901859001352538922 \t a\n",
"race \t NOUN \t 8048469955494714898 \t race\n",
"because \t ADP \t 16950148841647037698 \t because\n",
"I \t PRON \t 561228191312463089 \t -PRON-\n",
"love \t VERB \t 3702023516439754181 \t love\n",
"to \t PART \t 3791531372978436496 \t to\n",
"run \t VERB \t 12767647472892411841 \t run\n",
"since \t ADP \t 10066841407251338481 \t since\n",
"I \t PRON \t 561228191312463089 \t -PRON-\n",
"ran \t VERB \t 12767647472892411841 \t run\n",
"today \t NOUN \t 11042482332948150395 \t today\n"
]
}
],
"source": [
"doc1 = nlp(u\"I am a runner running in a race because I love to run since I ran today\")\n",
"\n",
"for token in doc1:\n",
" print(token.text, '\\t', token.pos_, '\\t', token.lemma, '\\t', token.lemma_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Function to display lemmas\n",
"Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def show_lemmas(text):\n",
" for token in text:\n",
" print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I PRON 561228191312463089 -PRON-\n",
"saw VERB 11925638236994514241 see\n",
"eighteen NUM 9609336664675087640 eighteen\n",
"mice NOUN 1384165645700560590 mouse\n",
"today NOUN 11042482332948150395 today\n",
"! PUNCT 17494803046312582752 !\n"
]
}
],
"source": [
"doc2 = nlp(u\"I saw eighteen mice today!\")\n",
"\n",
"show_lemmas(doc2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I PRON 561228191312463089 -PRON-\n",
"am VERB 10382539506755952630 be\n",
"meeting VERB 6880656908171229526 meet\n",
"him PRON 561228191312463089 -PRON-\n",
"tomorrow NOUN 3573583789758258062 tomorrow\n",
"at ADP 11667289587015813222 at\n",
"the DET 7425985699627899538 the\n",
"meeting NOUN 14798207169164081740 meeting\n",
". PUNCT 12646065887601541794 .\n"
]
}
],
"source": [
"doc3 = nlp(u\"I am meeting him tomorrow at the meeting.\")\n",
"\n",
"show_lemmas(doc3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the lemma of `meeting` is determined by its Part of Speech tag."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"That DET 4380130941430378203 that\n",
"'s VERB 10382539506755952630 be\n",
"an DET 15099054000809333061 an\n",
"enormous ADJ 17917224542039855524 enormous\n",
"automobile NOUN 7211811266693931283 automobile\n"
]
}
],
"source": [
"doc4 = nlp(u\"That's an enormous automobile\")\n",
"\n",
"show_lemmas(doc4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.\n",
"\n",
"## Next up: Stop Words"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}