419 lines
9.7 KiB
Plaintext
419 lines
9.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NLP Basics Assessment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# RUN THIS CELL to perform standard imports:\n",
|
|
"import spacy\n",
|
|
"nlp = spacy.load('en_core_web_sm')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**1. Create a Doc object from the file `owlcreek.txt`**<br>\n",
|
|
"> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Enter your code here:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"AN OCCURRENCE AT OWL CREEK BRIDGE\n",
|
|
"\n",
|
|
"by Ambrose Bierce\n",
|
|
"\n",
|
|
"I\n",
|
|
"\n",
|
|
"A man stood upon a railroad bridge in northern Alabama, looking down\n",
|
|
"into the swift water twenty feet below. "
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Run this cell to verify it worked:\n",
|
|
"\n",
|
|
"doc[:36]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**2. How many tokens are contained in the file?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"4833"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"211"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"A man stood upon a railroad bridge in northern Alabama, looking down\n",
|
|
"into the swift water twenty feet below. \n"
|
|
]
|
|
}
|
|
],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>\n",
|
|
"CHALLENGE: Have values line up in columns in the print output.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"A DET det a\n",
|
|
"man NOUN nsubj man\n",
|
|
"stood VERB ROOT stand\n",
|
|
"upon ADP prep upon\n",
|
|
"a DET det a\n",
|
|
"railroad NOUN compound railroad\n",
|
|
"bridge NOUN pobj bridge\n",
|
|
"in ADP prep in\n",
|
|
"northern ADJ amod northern\n",
|
|
"Alabama PROPN pobj alabama\n",
|
|
", PUNCT punct ,\n",
|
|
"looking VERB advcl look\n",
|
|
"down PART prt down\n",
|
|
"\n",
|
|
" SPACE \n",
|
|
"\n",
|
|
"into ADP prep into\n",
|
|
"the DET det the\n",
|
|
"swift ADJ amod swift\n",
|
|
"water NOUN pobj water\n",
|
|
"twenty NUM nummod twenty\n",
|
|
"feet NOUN npadvmod foot\n",
|
|
"below ADV advmod below\n",
|
|
". PUNCT punct .\n",
|
|
" SPACE \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# NORMAL SOLUTION:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"A DET det a \n",
|
|
"man NOUN nsubj man \n",
|
|
"stood VERB ROOT stand \n",
|
|
"upon ADP prep upon \n",
|
|
"a DET det a \n",
|
|
"railroad NOUN compound railroad \n",
|
|
"bridge NOUN pobj bridge \n",
|
|
"in ADP prep in \n",
|
|
"northern ADJ amod northern \n",
|
|
"Alabama PROPN pobj alabama \n",
|
|
", PUNCT punct , \n",
|
|
"looking VERB advcl look \n",
|
|
"down PART prt down \n",
|
|
"\n",
|
|
" SPACE \n",
|
|
" \n",
|
|
"into ADP prep into \n",
|
|
"the DET det the \n",
|
|
"swift ADJ amod swift \n",
|
|
"water NOUN pobj water \n",
|
|
"twenty NUM nummod twenty \n",
|
|
"feet NOUN npadvmod foot \n",
|
|
"below ADV advmod below \n",
|
|
". PUNCT punct . \n",
|
|
" SPACE \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# CHALLENGE SOLUTION:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase \"swimming vigorously\" in the text**<br>\n",
|
|
"HINT: You should include an `'IS_SPACE': True` pattern between the two words!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the Matcher library:\n",
|
|
"\n",
|
|
"from spacy.matcher import Matcher\n",
|
|
"matcher = Matcher(nlp.vocab)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create a pattern and add it to matcher:\n",
|
|
"\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Create a list of matches called \"found_matches\" and print the list:\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**7. Print the text surrounding each found match**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"By diving I could evade the bullets and, swimming\n",
|
|
"vigorously, reach the bank, take to the woods and get away home\n"
|
|
]
|
|
}
|
|
],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"over his shoulder; he was now swimming\n",
|
|
"vigorously with the current. \n"
|
|
]
|
|
}
|
|
],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"By diving I could evade the bullets and, swimming\n",
|
|
"vigorously, reach the bank, take to the woods and get away home. \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The hunted man saw all this over his shoulder; he was now swimming\n",
|
|
"vigorously with the current. \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Great Job!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|