460 lines
16 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# spaCy Basics\n",
"\n",
"**spaCy** (https://spacy.io/) is an open-source Python library that parses and \"understands\" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).\n",
"\n",
"In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Installation and Setup\n",
"\n",
"Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/\n",
"\n",
"### 1. From the command line or terminal:\n",
"> `conda install -c conda-forge spacy`\n",
"> <br>*or*<br>\n",
"> `pip install -U spacy`\n",
"\n",
"> ### Alternatively you can create a virtual environment:\n",
"> `conda create -n spacyenv python=3 spacy=2`\n",
"\n",
"### 2. Next, also from the command line (you must run this as admin or use sudo):\n",
"\n",
"> `python -m spacy download en`\n",
"\n",
"> ### If successful, you should see a message like:\n",
"\n",
"> **`Linking successful`**<br>\n",
"> ` C:\\Anaconda3\\envs\\spacyenv\\lib\\site-packages\\en_core_web_sm -->`<br>\n",
"> ` C:\\Anaconda3\\envs\\spacyenv\\lib\\site-packages\\spacy\\data\\en`<br>\n",
"> ` `<br>\n",
"> ` You can now load the model via spacy.load('en')`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with spaCy in Python\n",
"\n",
"This is a typical set of instructions for importing and working with spaCy. Don't be surprised if this takes awhile - spaCy has a fairly large library to load:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"ename": "OSError",
"evalue": "[E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mOSError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[3], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Import spaCy and load the language library\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mspacy\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m nlp \u001b[38;5;241m=\u001b[39m spacy\u001b[38;5;241m.\u001b[39mload(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124men_core_web_sm\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 5\u001b[0m \u001b[38;5;66;03m# Create a Doc object\u001b[39;00m\n\u001b[1;32m 6\u001b[0m doc \u001b[38;5;241m=\u001b[39m nlp(\u001b[38;5;124mu\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mTesla is looking at buying U.S. startup for $6 million\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
"File \u001b[0;32m/opt/conda/lib/python3.12/site-packages/spacy/__init__.py:52\u001b[0m, in \u001b[0;36mload\u001b[0;34m(name, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[1;32m 28\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mload\u001b[39m(\n\u001b[1;32m 29\u001b[0m name: Union[\u001b[38;5;28mstr\u001b[39m, Path],\n\u001b[1;32m 30\u001b[0m \u001b[38;5;241m*\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 35\u001b[0m config: Union[Dict[\u001b[38;5;28mstr\u001b[39m, Any], Config] \u001b[38;5;241m=\u001b[39m util\u001b[38;5;241m.\u001b[39mSimpleFrozenDict(),\n\u001b[1;32m 36\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Language:\n\u001b[1;32m 37\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Load a spaCy model from an installed package or a local path.\u001b[39;00m\n\u001b[1;32m 38\u001b[0m \n\u001b[1;32m 39\u001b[0m \u001b[38;5;124;03m name (str): Package name or model path.\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[38;5;124;03m RETURNS (Language): The loaded nlp object.\u001b[39;00m\n\u001b[1;32m 51\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m---> 52\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m util\u001b[38;5;241m.\u001b[39mload_model(\n\u001b[1;32m 53\u001b[0m name,\n\u001b[1;32m 54\u001b[0m vocab\u001b[38;5;241m=\u001b[39mvocab,\n\u001b[1;32m 55\u001b[0m disable\u001b[38;5;241m=\u001b[39mdisable,\n\u001b[1;32m 56\u001b[0m enable\u001b[38;5;241m=\u001b[39menable,\n\u001b[1;32m 57\u001b[0m exclude\u001b[38;5;241m=\u001b[39mexclude,\n\u001b[1;32m 58\u001b[0m config\u001b[38;5;241m=\u001b[39mconfig,\n\u001b[1;32m 59\u001b[0m )\n",
"File \u001b[0;32m/opt/conda/lib/python3.12/site-packages/spacy/util.py:531\u001b[0m, in \u001b[0;36mload_model\u001b[0;34m(name, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[1;32m 529\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m name \u001b[38;5;129;01min\u001b[39;00m OLD_MODEL_SHORTCUTS:\n\u001b[1;32m 530\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIOError\u001b[39;00m(Errors\u001b[38;5;241m.\u001b[39mE941\u001b[38;5;241m.\u001b[39mformat(name\u001b[38;5;241m=\u001b[39mname, full\u001b[38;5;241m=\u001b[39mOLD_MODEL_SHORTCUTS[name])) \u001b[38;5;66;03m# type: ignore[index]\u001b[39;00m\n\u001b[0;32m--> 531\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIOError\u001b[39;00m(Errors\u001b[38;5;241m.\u001b[39mE050\u001b[38;5;241m.\u001b[39mformat(name\u001b[38;5;241m=\u001b[39mname))\n",
"\u001b[0;31mOSError\u001b[0m: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory."
]
}
],
"source": [
"# Import spaCy and load the language library\n",
"import spacy\n",
"nlp = spacy.load('en_core_web_sm')\n",
"\n",
"# Create a Doc object\n",
"doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')\n",
"\n",
"# Print each token separately\n",
"for token in doc:\n",
" print(token.text, token.pos_, token.dep_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This doesn't look very user-friendly, but right away we see some interesting things happen:\n",
"1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence\n",
"2. U.S. is kept together as one entity (we call this a 'token')\n",
"\n",
"As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# spaCy Objects\n",
"\n",
"After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"# Pipeline\n",
"When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"../pipeline1.png\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.pipe_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Tokenization\n",
"The first step in processing text is to split up all the component parts (words & punctuation) into \"tokens\". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc2 = nlp(u\"Tesla isn't looking into startups anymore.\")\n",
"\n",
"for token in doc2:\n",
" print(token.text, token.pos_, token.dep_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.\n",
"\n",
"It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc2[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(doc2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Part-of-Speech Tagging (POS)\n",
"The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow \"the\" are typically nouns.\n",
"\n",
"For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc2[0].pos_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Dependencies\n",
"We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.\n",
"\n",
"For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing\n",
"<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc2[0].dep_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see the full name of a tag use `spacy.explain(tag)`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spacy.explain('PROPN')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spacy.explain('nsubj')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Additional Token Attributes\n",
"We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Tag|Description|doc2[0].tag|\n",
"|:------|:------:|:------|\n",
"|`.text`|The original word text<!-- .element: style=\"text-align:left;\" -->|`Tesla`|\n",
"|`.lemma_`|The base form of the word|`tesla`|\n",
"|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|\n",
"|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|\n",
"|`.shape_`|The word shape capitalization, punctuation, digits|`Xxxxx`|\n",
"|`.is_alpha`|Is the token an alpha character?|`True`|\n",
"|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Lemmas (the base form of the word):\n",
"print(doc2[4].text)\n",
"print(doc2[4].lemma_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Simple Parts-of-Speech & Detailed Tags:\n",
"print(doc2[4].pos_)\n",
"print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Word Shapes:\n",
"print(doc2[0].text+': '+doc2[0].shape_)\n",
"print(doc[5].text+' : '+doc[5].shape_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Boolean Values:\n",
"print(doc2[0].is_alpha)\n",
"print(doc2[0].is_stop)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Spans\n",
"Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc3 = nlp(u'Although commmonly attributed to John Lennon from his song \"Beautiful Boy\", \\\n",
"the phrase \"Life is what happens to us while we are making other plans\" was written by \\\n",
"cartoonist Allen Saunders and published in Reader\\'s Digest in 1957, when Lennon was 17.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"life_quote = doc3[16:30]\n",
"print(life_quote)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(life_quote)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In upcoming lectures we'll see how to create Span objects using `Span()`. This will allow us to assign additional information to the Span."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"## Sentences\n",
"Certain tokens inside a Doc object may also receive a \"start of sentence\" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for sent in doc4.sents:\n",
" print(sent)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc4[6].is_sent_start"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next up: Tokenization"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}