889 lines
32 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tokenization\n",
"The first step in creating a `Doc` object is to break down the incoming text into component pieces or \"tokens\"."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import spaCy and load the language library\n",
"import spacy\n",
"nlp = spacy.load('en_core_web_sm')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\"We're moving to L.A.!\"\n"
]
}
],
"source": [
"# Create a string that includes opening and closing quotation marks\n",
"mystring = '\"We\\'re moving to L.A.!\"'\n",
"print(mystring)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\" | We | 're | moving | to | L.A. | ! | \" | "
]
}
],
"source": [
"# Create a Doc object and explore tokens\n",
"doc = nlp(mystring)\n",
"\n",
"for token in doc:\n",
" print(token.text, end=' | ')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"../tokenization.png\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Prefix**:\tCharacter(s) at the beginning &#9656; `$ ( “ ¿`\n",
"- **Suffix**:\tCharacter(s) at the end &#9656; `km ) , . ! ”`\n",
"- **Infix**:\tCharacter(s) in between &#9656; `- -- / ...`\n",
"- **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prefixes, Suffixes and Infixes\n",
"spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"We\n",
"'re\n",
"here\n",
"to\n",
"help\n",
"!\n",
"Send\n",
"snail\n",
"-\n",
"mail\n",
",\n",
"email\n",
"support@oursite.com\n",
"or\n",
"visit\n",
"us\n",
"at\n",
"http://www.oursite.com\n",
"!\n"
]
}
],
"source": [
"doc2 = nlp(u\"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!\")\n",
"\n",
"for t in doc2:\n",
" print(t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A\n",
"5\n",
"km\n",
"NYC\n",
"cab\n",
"ride\n",
"costs\n",
"$\n",
"10.30\n"
]
}
],
"source": [
"doc3 = nlp(u'A 5km NYC cab ride costs $10.30')\n",
"\n",
"for t in doc3:\n",
" print(t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exceptions\n",
"Punctuation that exists as part of a known abbreviation will be kept as part of the token."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Let\n",
"'s\n",
"visit\n",
"St.\n",
"Louis\n",
"in\n",
"the\n",
"U.S.\n",
"next\n",
"year\n",
".\n"
]
}
],
"source": [
"doc4 = nlp(u\"Let's visit St. Louis in the U.S. next year.\")\n",
"\n",
"for t in doc4:\n",
" print(t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>Here the abbreviations for \"Saint\" and \"United States\" are both preserved.</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Counting Tokens\n",
"`Doc` objects have a set number of tokens:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Counting Vocab Entries\n",
"`Vocab` objects contain a full library of items!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"57852"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(doc.vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the `vocab` when the `Doc` was created.</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokens can be retrieved by index position and slice\n",
"`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"better"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc5 = nlp(u'It is better to give than to receive.')\n",
"\n",
"# Retrieve the third token:\n",
"doc5[2]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"better to give"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retrieve three tokens from the middle:\n",
"doc5[2:5]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"than to receive."
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retrieve the last four tokens:\n",
"doc5[-4:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokens cannot be reassigned\n",
"Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"doc6 = nlp(u'My dinner was horrible.')\n",
"doc7 = nlp(u'Your dinner was delicious.')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'spacy.tokens.doc.Doc' object does not support item assignment",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-13-d4fb8c39c40b>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Try to change \"My dinner was horrible\" to \"My dinner was delicious\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mdoc6\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdoc7\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: 'spacy.tokens.doc.Doc' object does not support item assignment"
]
}
],
"source": [
"# Try to change \"My dinner was horrible\" to \"My dinner was delicious\"\n",
"doc6[3] = doc7[3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"# Named Entities\n",
"Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | \n",
"----\n",
"Apple - ORG - Companies, agencies, institutions, etc.\n",
"Hong Kong - GPE - Countries, cities, states\n",
"$6 million - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')\n",
"\n",
"for token in doc8:\n",
" print(token.text, end=' | ')\n",
"\n",
"print('\\n----')\n",
"\n",
"for ent in doc8.ents:\n",
" print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=green>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity: `$6 million`</font>"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(doc8.ents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Named Entity Recognition (NER) is an important machine learning tool applied to Natural Language Processing.<br>We'll do a lot more with it in an upcoming section. For more info on **named entities** visit https://spacy.io/usage/linguistic-features#named-entities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Noun Chunks\n",
"Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are \"base noun phrases\" flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *\"one-eyed, one-horned, flying, purple people-eater\"* would be one long noun chunk."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Autonomous cars\n",
"insurance liability\n",
"manufacturers\n"
]
}
],
"source": [
"doc9 = nlp(u\"Autonomous cars shift insurance liability toward manufacturers.\")\n",
"\n",
"for chunk in doc9.noun_chunks:\n",
" print(chunk.text)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Red cars\n",
"higher insurance rates\n"
]
}
],
"source": [
"doc10 = nlp(u\"Red cars do not carry higher insurance rates.\")\n",
"\n",
"for chunk in doc10.noun_chunks:\n",
" print(chunk.text)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"He\n",
"a one-eyed, one-horned, flying, purple people-eater\n"
]
}
],
"source": [
"doc11 = nlp(u\"He was a one-eyed, one-horned, flying, purple people-eater.\")\n",
"\n",
"for chunk in doc11.noun_chunks:\n",
" print(chunk.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll look at additional noun_chunks components besides `.text` in an upcoming section.<br>For more info on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Built-in Visualizers\n",
"\n",
"spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.\n",
"\n",
"For more info visit https://spacy.io/usage/visualizers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualizing the dependency parse\n",
"Run the cell below to import displacy and display the dependency graphic"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" id=\"0\" class=\"displacy\" width=\"1370\" height=\"357.0\" style=\"max-width: none; height: 357.0px; color: #000000; background: #ffffff; font-family: Arial\">\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Apple</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"160\">is</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"160\">VERB</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"270\">going</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"270\">VERB</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"380\">to</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"380\">PART</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"490\">build</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"490\">VERB</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"600\">a</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"600\">DET</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"710\">U.K.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"710\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"820\">factory</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"820\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"930\">for</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"930\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1040\">$</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1040\">SYM</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1150\">6</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1150\">NUM</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"267.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1260\">million.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1260\">NUM</tspan>\n",
"</text>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-0\" stroke-width=\"2px\" d=\"M70,222.0 C70,112.0 260.0,112.0 260.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-0\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M70,224.0 L62,212.0 78,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-1\" stroke-width=\"2px\" d=\"M180,222.0 C180,167.0 255.0,167.0 255.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-1\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M180,224.0 L172,212.0 188,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-2\" stroke-width=\"2px\" d=\"M400,222.0 C400,167.0 475.0,167.0 475.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-2\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M400,224.0 L392,212.0 408,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-3\" stroke-width=\"2px\" d=\"M290,222.0 C290,112.0 480.0,112.0 480.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-3\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">xcomp</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M480.0,224.0 L488.0,212.0 472.0,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-4\" stroke-width=\"2px\" d=\"M620,222.0 C620,112.0 810.0,112.0 810.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-4\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M620,224.0 L612,212.0 628,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-5\" stroke-width=\"2px\" d=\"M730,222.0 C730,167.0 805.0,167.0 805.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-5\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M730,224.0 L722,212.0 738,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-6\" stroke-width=\"2px\" d=\"M510,222.0 C510,57.0 815.0,57.0 815.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-6\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M815.0,224.0 L823.0,212.0 807.0,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-7\" stroke-width=\"2px\" d=\"M510,222.0 C510,2.0 930.0,2.0 930.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-7\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M930.0,224.0 L938.0,212.0 922.0,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-8\" stroke-width=\"2px\" d=\"M1060,222.0 C1060,112.0 1250.0,112.0 1250.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-8\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">quantmod</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1060,224.0 L1052,212.0 1068,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-9\" stroke-width=\"2px\" d=\"M1170,222.0 C1170,167.0 1245.0,167.0 1245.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-9\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1170,224.0 L1162,212.0 1178,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-0-10\" stroke-width=\"2px\" d=\"M950,222.0 C950,57.0 1255.0,57.0 1255.0,222.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-0-10\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1255.0,224.0 L1263.0,212.0 1247.0,212.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"</svg>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from spacy import displacy\n",
"\n",
"doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')\n",
"displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The optional `'distance'` argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualizing the entity recognizer"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"entities\" style=\"line-height: 2.5\">Over \n",
"<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" the last quarter\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
"</mark>\n",
" \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" Apple\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" sold \n",
"<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" nearly 20 thousand\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
"</mark>\n",
" \n",
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" iPods\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
"</mark>\n",
" for a profit of \n",
"<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" $6 million\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
"</mark>\n",
".</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')\n",
"displacy.render(doc, style='ent', jupyter=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Creating Visualizations Outside of Jupyter\n",
"If you're using another Python IDE or writing a script, you can choose to have spaCy serve up html separately:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Serving on port 5000...\n",
" Using the 'dep' visualizer\n",
"\n",
"\n",
" Shutting down server on port 5000.\n",
"\n"
]
}
],
"source": [
"doc = nlp(u'This is a sentence.')\n",
"displacy.serve(doc, style='dep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=blue>**After running the cell above, click the link below to view the dependency parse**:</font>\n",
"\n",
"http://127.0.0.1:5000\n",
"<br><br>\n",
"<font color=red>**To shut down the server and return to jupyter**, interrupt the kernel either through the **Kernel** menu above, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`, `I`, `I`</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! Now you should have an understanding of how tokenization divides text up into individual elements, how named entities provide context, and how certain tools help to visualize grammar rules and entity labels.\n",
"## Next up: Stemming"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}