materi-praktikum/Praktikum Python Code/01-NLP-Python-Basics/03-Lemmatization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "___\n",
    "\n",
    "<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lemmatization\n",
    "In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform standard imports:\n",
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t 561228191312463089 \t -PRON-\n",
      "am \t VERB \t 10382539506755952630 \t be\n",
      "a \t DET \t 11901859001352538922 \t a\n",
      "runner \t NOUN \t 12640964157389618806 \t runner\n",
      "running \t VERB \t 12767647472892411841 \t run\n",
      "in \t ADP \t 3002984154512732771 \t in\n",
      "a \t DET \t 11901859001352538922 \t a\n",
      "race \t NOUN \t 8048469955494714898 \t race\n",
      "because \t ADP \t 16950148841647037698 \t because\n",
      "I \t PRON \t 561228191312463089 \t -PRON-\n",
      "love \t VERB \t 3702023516439754181 \t love\n",
      "to \t PART \t 3791531372978436496 \t to\n",
      "run \t VERB \t 12767647472892411841 \t run\n",
      "since \t ADP \t 10066841407251338481 \t since\n",
      "I \t PRON \t 561228191312463089 \t -PRON-\n",
      "ran \t VERB \t 12767647472892411841 \t run\n",
      "today \t NOUN \t 11042482332948150395 \t today\n"
     ]
    }
   ],
   "source": [
    "doc1 = nlp(u\"I am a runner running in a race because I love to run since I ran today\")\n",
    "\n",
    "for token in doc1:\n",
    "    print(token.text, '\\t', token.pos_, '\\t', token.lemma, '\\t', token.lemma_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Function to display lemmas\n",
    "Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_lemmas(text):\n",
    "    for token in text:\n",
    "        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I            PRON   561228191312463089     -PRON-\n",
      "saw          VERB   11925638236994514241   see\n",
      "eighteen     NUM    9609336664675087640    eighteen\n",
      "mice         NOUN   1384165645700560590    mouse\n",
      "today        NOUN   11042482332948150395   today\n",
      "!            PUNCT  17494803046312582752   !\n"
     ]
    }
   ],
   "source": [
    "doc2 = nlp(u\"I saw eighteen mice today!\")\n",
    "\n",
    "show_lemmas(doc2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I            PRON   561228191312463089     -PRON-\n",
      "am           VERB   10382539506755952630   be\n",
      "meeting      VERB   6880656908171229526    meet\n",
      "him          PRON   561228191312463089     -PRON-\n",
      "tomorrow     NOUN   3573583789758258062    tomorrow\n",
      "at           ADP    11667289587015813222   at\n",
      "the          DET    7425985699627899538    the\n",
      "meeting      NOUN   14798207169164081740   meeting\n",
      ".            PUNCT  12646065887601541794   .\n"
     ]
    }
   ],
   "source": [
    "doc3 = nlp(u\"I am meeting him tomorrow at the meeting.\")\n",
    "\n",
    "show_lemmas(doc3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "That         DET    4380130941430378203    that\n",
      "'s           VERB   10382539506755952630   be\n",
      "an           DET    15099054000809333061   an\n",
      "enormous     ADJ    17917224542039855524   enormous\n",
      "automobile   NOUN   7211811266693931283    automobile\n"
     ]
    }
   ],
   "source": [
    "doc4 = nlp(u\"That's an enormous automobile\")\n",
    "\n",
    "show_lemmas(doc4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.\n",
    "\n",
    "## Next up: Stop Words"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}