600 lines
18 KiB
Plaintext
600 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Vocabulary and Matching\n",
|
|
"So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.\n",
|
|
"\n",
|
|
"In this section we will identify and label specific phrases that match patterns we can define ourselves. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Rule-based Matching\n",
|
|
"spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Perform standard imports\n",
|
|
"import spacy\n",
|
|
"nlp = spacy.load('en_core_web_sm')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the Matcher library\n",
|
|
"from spacy.matcher import Matcher\n",
|
|
"matcher = Matcher(nlp.vocab)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Creating patterns\n",
|
|
"In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pattern1 = [{'LOWER': 'solarpower'}]\n",
|
|
"pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]\n",
|
|
"pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]\n",
|
|
"\n",
|
|
"matcher.add('SolarPower', None, pattern1, pattern2, pattern3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's break this down:\n",
|
|
"* `pattern1` looks for a single token whose lowercase text reads 'solarpower'\n",
|
|
"* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order\n",
|
|
"* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>\n",
|
|
"\n",
|
|
"<font color=green>\\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>\n",
|
|
"<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Applying the matcher to a Doc object"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"doc = nlp(u'The Solar Power industry continues to grow as demand \\\n",
|
|
"for solarpower increases. Solar-power cars are gaining popularity.')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"found_matches = matcher(doc)\n",
|
|
"print(found_matches)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"8656102463236116519 SolarPower 1 3 Solar Power\n",
|
|
"8656102463236116519 SolarPower 10 11 solarpower\n",
|
|
"8656102463236116519 SolarPower 13 16 Solar-power\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for match_id, start, end in found_matches:\n",
|
|
" string_id = nlp.vocab.strings[match_id] # get string representation\n",
|
|
" span = doc[start:end] # get the matched span\n",
|
|
" print(match_id, string_id, start, end, span.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `match_id` is simply the hash value of the `string_ID` 'SolarPower'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Setting pattern options and quantifiers\n",
|
|
"You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Redefine the patterns:\n",
|
|
"pattern1 = [{'LOWER': 'solarpower'}]\n",
|
|
"pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]\n",
|
|
"\n",
|
|
"# Remove the old patterns to avoid duplication:\n",
|
|
"matcher.remove('SolarPower')\n",
|
|
"\n",
|
|
"# Add the new set of patterns to the 'SolarPower' matcher:\n",
|
|
"matcher.add('SolarPower', None, pattern1, pattern2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"found_matches = matcher(doc)\n",
|
|
"print(found_matches)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This found both two-word patterns, with and without the hyphen!\n",
|
|
"\n",
|
|
"The following quantifiers can be passed to the `'OP'` key:\n",
|
|
"<table><tr><th>OP</th><th>Description</th></tr>\n",
|
|
"\n",
|
|
"<tr ><td><span >\\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>\n",
|
|
"<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>\n",
|
|
"<tr ><td><span >\\+</span></td><td>Require the pattern to match 1 or more times</td></tr>\n",
|
|
"<tr ><td><span >\\*</span></td><td>Allow the pattern to match zero or more times</td></tr>\n",
|
|
"</table>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Be careful with lemmas!\n",
|
|
"If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pattern1 = [{'LOWER': 'solarpower'}]\n",
|
|
"pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN\n",
|
|
"\n",
|
|
"# Remove the old patterns to avoid duplication:\n",
|
|
"matcher.remove('SolarPower')\n",
|
|
"\n",
|
|
"# Add the new set of patterns to the 'SolarPower' matcher:\n",
|
|
"matcher.add('SolarPower', None, pattern1, pattern2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(8656102463236116519, 0, 3)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"found_matches = matcher(doc2)\n",
|
|
"print(found_matches)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.<br>For this case it may be better to set explicit token patterns.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pattern1 = [{'LOWER': 'solarpower'}]\n",
|
|
"pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]\n",
|
|
"pattern3 = [{'LOWER': 'solarpowered'}]\n",
|
|
"pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]\n",
|
|
"\n",
|
|
"# Remove the old patterns to avoid duplication:\n",
|
|
"matcher.remove('SolarPower')\n",
|
|
"\n",
|
|
"# Add the new set of patterns to the 'SolarPower' matcher:\n",
|
|
"matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"found_matches = matcher(doc2)\n",
|
|
"print(found_matches)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Other token attributes\n",
|
|
"Besides lemmas, there are a variety of token attributes we can use to determine matching rules:\n",
|
|
"<table><tr><th>Attribute</th><th>Description</th></tr>\n",
|
|
"\n",
|
|
"<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>\n",
|
|
"<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>\n",
|
|
"<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>\n",
|
|
"<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>\n",
|
|
"<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>\n",
|
|
"<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>\n",
|
|
"<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>\n",
|
|
"<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>\n",
|
|
"<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>\n",
|
|
"\n",
|
|
"</table>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Token wildcard\n",
|
|
"You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:\n",
|
|
">`[{'ORTH': '#'}, {}]`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"## PhraseMatcher\n",
|
|
"In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Perform standard imports, reset nlp\n",
|
|
"import spacy\n",
|
|
"nlp = spacy.load('en_core_web_sm')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the PhraseMatcher library\n",
|
|
"from spacy.matcher import PhraseMatcher\n",
|
|
"matcher = PhraseMatcher(nlp.vocab)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For this exercise we're going to import a Wikipedia article on *Reaganomics*<br>\n",
|
|
"Source: https://en.wikipedia.org/wiki/Reaganomics"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('../TextFiles/reaganomics.txt', encoding='utf8') as f:\n",
|
|
" doc3 = nlp(f.read())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# First, create a list of match phrases:\n",
|
|
"phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']\n",
|
|
"\n",
|
|
"# Next, convert each phrase to a Doc object:\n",
|
|
"phrase_patterns = [nlp(text) for text in phrase_list]\n",
|
|
"\n",
|
|
"# Pass each Doc object into matcher (note the use of the asterisk!):\n",
|
|
"matcher.add('VoodooEconomics', None, *phrase_patterns)\n",
|
|
"\n",
|
|
"# Build a list of matches:\n",
|
|
"matches = matcher(doc3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[(3473369816841043438, 41, 45),\n",
|
|
" (3473369816841043438, 49, 53),\n",
|
|
" (3473369816841043438, 54, 56),\n",
|
|
" (3473369816841043438, 61, 65),\n",
|
|
" (3473369816841043438, 673, 677),\n",
|
|
" (3473369816841043438, 2985, 2989)]"
|
|
]
|
|
},
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# (match_id, start, end)\n",
|
|
"matches"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font color=green>The first four matches are where these terms are used in the definition of Reaganomics:</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"REAGANOMICS\n",
|
|
"https://en.wikipedia.org/wiki/Reaganomics\n",
|
|
"\n",
|
|
"Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.\n"
|
|
]
|
|
},
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"doc3[:70]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Viewing Matches\n",
|
|
"There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"doc3[665:685] # Note that the fifth match starts at doc3[673]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"against institutions.[66] His policies became widely known as \"trickle-down economics\", due to the significant"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"doc3[2975:2995] # The sixth match starts at doc3[2985]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0 35\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Build a list of sentences\n",
|
|
"sents = [sent for sent in doc3.sents]\n",
|
|
"\n",
|
|
"# In the next section we'll see that sentences contain start and end token values:\n",
|
|
"print(sents[0].start, sents[0].end)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Iterate over the sentence list until the sentence end value exceeds a match start value:\n",
|
|
"for sent in sents:\n",
|
|
" if matches[4][1] < sent.end: # this is the fifth match, that starts at doc3[673]\n",
|
|
" print(sent)\n",
|
|
" break"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching\n",
|
|
"## Next Up: NLP Basics Assessment"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|