materi-praktikum/Praktikum Python Code/01-NLP-Python-Basics/05-Vocabulary-and-Matching.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "\n",
    "<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vocabulary and Matching\n",
    "So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.\n",
    "\n",
    "In this section we will identify and label specific phrases that match patterns we can define ourselves. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rule-based Matching\n",
    "spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform standard imports\n",
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the Matcher library\n",
    "from spacy.matcher import Matcher\n",
    "matcher = Matcher(nlp.vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating patterns\n",
    "In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern1 = [{'LOWER': 'solarpower'}]\n",
    "pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]\n",
    "pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]\n",
    "\n",
    "matcher.add('SolarPower', None, pattern1, pattern2, pattern3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's break this down:\n",
    "* `pattern1` looks for a single token whose lowercase text reads 'solarpower'\n",
    "* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order\n",
    "* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>\n",
    "\n",
    "<font color=green>\\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>\n",
    "<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Applying the matcher to a Doc object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = nlp(u'The Solar Power industry continues to grow as demand \\\n",
    "for solarpower increases. Solar-power cars are gaining popularity.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]\n"
     ]
    }
   ],
   "source": [
    "found_matches = matcher(doc)\n",
    "print(found_matches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8656102463236116519 SolarPower 1 3 Solar Power\n",
      "8656102463236116519 SolarPower 10 11 solarpower\n",
      "8656102463236116519 SolarPower 13 16 Solar-power\n"
     ]
    }
   ],
   "source": [
    "for match_id, start, end in found_matches:\n",
    "    string_id = nlp.vocab.strings[match_id]  # get string representation\n",
    "    span = doc[start:end]                    # get the matched span\n",
    "    print(match_id, string_id, start, end, span.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `match_id` is simply the hash value of the `string_ID` 'SolarPower'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting pattern options and quantifiers\n",
    "You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Redefine the patterns:\n",
    "pattern1 = [{'LOWER': 'solarpower'}]\n",
    "pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]\n",
    "\n",
    "# Remove the old patterns to avoid duplication:\n",
    "matcher.remove('SolarPower')\n",
    "\n",
    "# Add the new set of patterns to the 'SolarPower' matcher:\n",
    "matcher.add('SolarPower', None, pattern1, pattern2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]\n"
     ]
    }
   ],
   "source": [
    "found_matches = matcher(doc)\n",
    "print(found_matches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This found both two-word patterns, with and without the hyphen!\n",
    "\n",
    "The following quantifiers can be passed to the `'OP'` key:\n",
    "<table><tr><th>OP</th><th>Description</th></tr>\n",
    "\n",
    "<tr ><td><span >\\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>\n",
    "<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>\n",
    "<tr ><td><span >\\+</span></td><td>Require the pattern to match 1 or more times</td></tr>\n",
    "<tr ><td><span >\\*</span></td><td>Allow the pattern to match zero or more times</td></tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Be careful with lemmas!\n",
    "If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern1 = [{'LOWER': 'solarpower'}]\n",
    "pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN\n",
    "\n",
    "# Remove the old patterns to avoid duplication:\n",
    "matcher.remove('SolarPower')\n",
    "\n",
    "# Add the new set of patterns to the 'SolarPower' matcher:\n",
    "matcher.add('SolarPower', None, pattern1, pattern2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(8656102463236116519, 0, 3)]\n"
     ]
    }
   ],
   "source": [
    "found_matches = matcher(doc2)\n",
    "print(found_matches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.<br>For this case it may be better to set explicit token patterns.</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern1 = [{'LOWER': 'solarpower'}]\n",
    "pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]\n",
    "pattern3 = [{'LOWER': 'solarpowered'}]\n",
    "pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]\n",
    "\n",
    "# Remove the old patterns to avoid duplication:\n",
    "matcher.remove('SolarPower')\n",
    "\n",
    "# Add the new set of patterns to the 'SolarPower' matcher:\n",
    "matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]\n"
     ]
    }
   ],
   "source": [
    "found_matches = matcher(doc2)\n",
    "print(found_matches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other token attributes\n",
    "Besides lemmas, there are a variety of token attributes we can use to determine matching rules:\n",
    "<table><tr><th>Attribute</th><th>Description</th></tr>\n",
    "\n",
    "<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>\n",
    "<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>\n",
    "<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>\n",
    "<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>\n",
    "<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>\n",
    "<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>\n",
    "<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>\n",
    "<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>\n",
    "<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>\n",
    "\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Token wildcard\n",
    "You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:\n",
    ">`[{'ORTH': '#'}, {}]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "## PhraseMatcher\n",
    "In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform standard imports, reset nlp\n",
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the PhraseMatcher library\n",
    "from spacy.matcher import PhraseMatcher\n",
    "matcher = PhraseMatcher(nlp.vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this exercise we're going to import a Wikipedia article on *Reaganomics*<br>\n",
    "Source: https://en.wikipedia.org/wiki/Reaganomics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('../TextFiles/reaganomics.txt', encoding='utf8') as f:\n",
    "    doc3 = nlp(f.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, create a list of match phrases:\n",
    "phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']\n",
    "\n",
    "# Next, convert each phrase to a Doc object:\n",
    "phrase_patterns = [nlp(text) for text in phrase_list]\n",
    "\n",
    "# Pass each Doc object into matcher (note the use of the asterisk!):\n",
    "matcher.add('VoodooEconomics', None, *phrase_patterns)\n",
    "\n",
    "# Build a list of matches:\n",
    "matches = matcher(doc3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(3473369816841043438, 41, 45),\n",
       " (3473369816841043438, 49, 53),\n",
       " (3473369816841043438, 54, 56),\n",
       " (3473369816841043438, 61, 65),\n",
       " (3473369816841043438, 673, 677),\n",
       " (3473369816841043438, 2985, 2989)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# (match_id, start, end)\n",
    "matches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=green>The first four matches are where these terms are used in the definition of Reaganomics:</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "REAGANOMICS\n",
       "https://en.wikipedia.org/wiki/Reaganomics\n",
       "\n",
       "Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.\n"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc3[:70]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viewing Matches\n",
    "There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc3[665:685]  # Note that the fifth match starts at doc3[673]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "against institutions.[66] His policies became widely known as \"trickle-down economics\", due to the significant"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc3[2975:2995]  # The sixth match starts at doc3[2985]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 35\n"
     ]
    }
   ],
   "source": [
    "# Build a list of sentences\n",
    "sents = [sent for sent in doc3.sents]\n",
    "\n",
    "# In the next section we'll see that sentences contain start and end token values:\n",
    "print(sents[0].start, sents[0].end)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.\n"
     ]
    }
   ],
   "source": [
    "# Iterate over the sentence list until the sentence end value exceeds a match start value:\n",
    "for sent in sents:\n",
    "    if matches[4][1] < sent.end:  # this is the fifth match, that starts at doc3[673]\n",
    "        print(sent)\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching\n",
    "## Next Up: NLP Basics Assessment"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}