{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NLP Basics Assessment - Solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890).
The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# RUN THIS CELL to perform standard imports:\n", "import spacy\n", "nlp = spacy.load('en_core_web_sm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. Create a Doc object from the file `owlcreek.txt`**
\n", "> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Enter your code here:\n", "\n", "with open('../TextFiles/owlcreek.txt') as f:\n", " doc = nlp(f.read())" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AN OCCURRENCE AT OWL CREEK BRIDGE\n", "\n", "by Ambrose Bierce\n", "\n", "I\n", "\n", "A man stood upon a railroad bridge in northern Alabama, looking down\n", "into the swift water twenty feet below. " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Run this cell to verify it worked:\n", "\n", "doc[:36]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. How many tokens are contained in the file?**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4833" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. How many sentences are contained in the file?**
HINT: You'll want to build a list first!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "211" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sents = [sent for sent in doc.sents]\n", "len(sents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Print the second sentence in the document**
HINT: Indexing starts at zero, and the title counts as the first sentence." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A man stood upon a railroad bridge in northern Alabama, looking down\n", "into the swift water twenty feet below. \n" ] } ], "source": [ "print(sents[1].text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`
\n", "CHALLENGE: Have values line up in columns in the print output.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A DET det a\n", "man NOUN nsubj man\n", "stood VERB ROOT stand\n", "upon ADP prep upon\n", "a DET det a\n", "railroad NOUN compound railroad\n", "bridge NOUN pobj bridge\n", "in ADP prep in\n", "northern ADJ amod northern\n", "Alabama PROPN pobj alabama\n", ", PUNCT punct ,\n", "looking VERB advcl look\n", "down PART prt down\n", "\n", " SPACE \n", "\n", "into ADP prep into\n", "the DET det the\n", "swift ADJ amod swift\n", "water NOUN pobj water\n", "twenty NUM nummod twenty\n", "feet NOUN npadvmod foot\n", "below ADV advmod below\n", ". PUNCT punct .\n", " SPACE \n" ] } ], "source": [ "# NORMAL SOLUTION:\n", "for token in sents[1]:\n", " print(token.text, token.pos_, token.dep_, token.lemma_)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A DET det a \n", "man NOUN nsubj man \n", "stood VERB ROOT stand \n", "upon ADP prep upon \n", "a DET det a \n", "railroad NOUN compound railroad \n", "bridge NOUN pobj bridge \n", "in ADP prep in \n", "northern ADJ amod northern \n", "Alabama PROPN pobj alabama \n", ", PUNCT punct , \n", "looking VERB advcl look \n", "down PART prt down \n", "\n", " SPACE \n", " \n", "into ADP prep into \n", "the DET det the \n", "swift ADJ amod swift \n", "water NOUN pobj water \n", "twenty NUM nummod twenty \n", "feet NOUN npadvmod foot \n", "below ADV advmod below \n", ". PUNCT punct . \n", " SPACE \n" ] } ], "source": [ "# CHALLENGE SOLUTION:\n", "for token in sents[1]:\n", " print(f'{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}} {token.lemma_:{15}}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase \"swimming vigorously\" in the text**
\n", "HINT: You should include an `'IS_SPACE': True` pattern between the two words!" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Import the Matcher library:\n", "\n", "from spacy.matcher import Matcher\n", "matcher = Matcher(nlp.vocab)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Create a pattern and add it to matcher:\n", "\n", "pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]\n", "\n", "matcher.add('Swimming', None, pattern)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]\n" ] } ], "source": [ "# Create a list of matches called \"found_matches\" and print the list:\n", "\n", "found_matches = matcher(doc)\n", "print(found_matches)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7. Print the text surrounding each found match**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "By diving I could evade the bullets and, swimming\n", "vigorously, reach the bank, take to the woods and get away home\n" ] } ], "source": [ "print(doc[1265:1290])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "over his shoulder; he was now swimming\n", "vigorously with the current. \n" ] } ], "source": [ "print(doc[3600:3615])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXTRA CREDIT:
Print the *sentence* that contains each found match**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "By diving I could evade the bullets and, swimming\n", "vigorously, reach the bank, take to the woods and get away home. \n" ] } ], "source": [ "for sent in sents:\n", " if found_matches[0][1] < sent.end:\n", " print(sent)\n", " break" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The hunted man saw all this over his shoulder; he was now swimming\n", "vigorously with the current. \n" ] } ], "source": [ "for sent in sents:\n", " if found_matches[1][1] < sent.end:\n", " print(sent)\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Great Job!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }