230 lines
6.8 KiB
Plaintext
230 lines
6.8 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Lemmatization\n",
|
|
"In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Perform standard imports:\n",
|
|
"import spacy\n",
|
|
"nlp = spacy.load('en_core_web_sm')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"I \t PRON \t 561228191312463089 \t -PRON-\n",
|
|
"am \t VERB \t 10382539506755952630 \t be\n",
|
|
"a \t DET \t 11901859001352538922 \t a\n",
|
|
"runner \t NOUN \t 12640964157389618806 \t runner\n",
|
|
"running \t VERB \t 12767647472892411841 \t run\n",
|
|
"in \t ADP \t 3002984154512732771 \t in\n",
|
|
"a \t DET \t 11901859001352538922 \t a\n",
|
|
"race \t NOUN \t 8048469955494714898 \t race\n",
|
|
"because \t ADP \t 16950148841647037698 \t because\n",
|
|
"I \t PRON \t 561228191312463089 \t -PRON-\n",
|
|
"love \t VERB \t 3702023516439754181 \t love\n",
|
|
"to \t PART \t 3791531372978436496 \t to\n",
|
|
"run \t VERB \t 12767647472892411841 \t run\n",
|
|
"since \t ADP \t 10066841407251338481 \t since\n",
|
|
"I \t PRON \t 561228191312463089 \t -PRON-\n",
|
|
"ran \t VERB \t 12767647472892411841 \t run\n",
|
|
"today \t NOUN \t 11042482332948150395 \t today\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"doc1 = nlp(u\"I am a runner running in a race because I love to run since I ran today\")\n",
|
|
"\n",
|
|
"for token in doc1:\n",
|
|
" print(token.text, '\\t', token.pos_, '\\t', token.lemma, '\\t', token.lemma_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Function to display lemmas\n",
|
|
"Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def show_lemmas(text):\n",
|
|
" for token in text:\n",
|
|
" print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"I PRON 561228191312463089 -PRON-\n",
|
|
"saw VERB 11925638236994514241 see\n",
|
|
"eighteen NUM 9609336664675087640 eighteen\n",
|
|
"mice NOUN 1384165645700560590 mouse\n",
|
|
"today NOUN 11042482332948150395 today\n",
|
|
"! PUNCT 17494803046312582752 !\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"doc2 = nlp(u\"I saw eighteen mice today!\")\n",
|
|
"\n",
|
|
"show_lemmas(doc2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"I PRON 561228191312463089 -PRON-\n",
|
|
"am VERB 10382539506755952630 be\n",
|
|
"meeting VERB 6880656908171229526 meet\n",
|
|
"him PRON 561228191312463089 -PRON-\n",
|
|
"tomorrow NOUN 3573583789758258062 tomorrow\n",
|
|
"at ADP 11667289587015813222 at\n",
|
|
"the DET 7425985699627899538 the\n",
|
|
"meeting NOUN 14798207169164081740 meeting\n",
|
|
". PUNCT 12646065887601541794 .\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"doc3 = nlp(u\"I am meeting him tomorrow at the meeting.\")\n",
|
|
"\n",
|
|
"show_lemmas(doc3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"That DET 4380130941430378203 that\n",
|
|
"'s VERB 10382539506755952630 be\n",
|
|
"an DET 15099054000809333061 an\n",
|
|
"enormous ADJ 17917224542039855524 enormous\n",
|
|
"automobile NOUN 7211811266693931283 automobile\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"doc4 = nlp(u\"That's an enormous automobile\")\n",
|
|
"\n",
|
|
"show_lemmas(doc4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.\n",
|
|
"\n",
|
|
"## Next up: Stop Words"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|