{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Semantics and Word Vectors\n", "Sometimes called \"opinion mining\", [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis) defines ***sentiment analysis*** as\n", "
\"the use of natural language processing ... to systematically identify, extract, quantify, and study affective states and subjective information.
\n", "Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event.\"
\n", "\n", "Up to now we've used the occurrence of specific words and word patterns to perform test classifications. In this section we'll take machine learning even further, and try to extract intended meanings from complex phrases. Some simple examples include:\n", "* Python is relatively easy to learn.\n", "* That was the worst movie I've ever seen.\n", "\n", "However, things get harder with phrases like:\n", "* I do not dislike green eggs and ham. (requires negation handling)\n", "\n", "The way this is done is through complex machine learning algorithms like [word2vec](https://en.wikipedia.org/wiki/Word2vec). The idea is to create numerical arrays, or *word embeddings* for every word in a large corpus. Each word is assigned its own vector in such a way that words that frequently appear together in the same context are given vectors that are close together. The result is a model that may not know that a \"lion\" is an animal, but does know that \"lion\" is closer in context to \"cat\" than \"dandelion\".\n", "\n", "It is important to note that *building* useful models takes a long time - hours or days to train a large corpus - and that for our purposes it is best to import an existing model rather than take the time to train our own.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Installing Larger spaCy Models\n", "Up to now we've been using spaCy's smallest English language model, [**en_core_web_sm**](https://spacy.io/models/en#en_core_web_sm) (35MB), which provides vocabulary, syntax, and entities, but not vectors. To take advantage of built-in word vectors we'll need a larger library. We have a few options:\n", "> [**en_core_web_md**](https://spacy.io/models/en#en_core_web_md) (116MB) Vectors: 685k keys, 20k unique vectors (300 dimensions)\n", ">
or
\n", "> [**en_core_web_lg**](https://spacy.io/models/en#en_core_web_lg) (812MB) Vectors: 685k keys, 685k unique vectors (300 dimensions)\n", "\n", "If you plan to rely heavily on word vectors, consider using spaCy's largest vector library containing over one million unique vectors:\n", "> [**en_vectors_web_lg**](https://spacy.io/models/en#en_vectors_web_lg) (631MB) Vectors: 1.1m keys, 1.1m unique vectors (300 dimensions)\n", "\n", "For our purposes **en_core_web_md** should suffice.\n", "\n", "### From the command line (you must run this as admin or use sudo):\n", "\n", "> `activate spacyenv` *if using a virtual environment* \n", "> \n", "> `python -m spacy download en_core_web_md` \n", "> `python -m spacy download en_core_web_lg`   *optional library* \n", "> `python -m spacy download en_vectors_web_lg` *optional library* \n", "\n", "> ### If successful, you should see a message like: \n", ">
\n", "> **Linking successful**
\n", "> C:\\Anaconda3\\envs\\spacyenv\\lib\\site-packages\\en_core_web_md -->
\n", "> C:\\Anaconda3\\envs\\spacyenv\\lib\\site-packages\\spacy\\data\\en_core_web_md
\n", ">
\n", "> You can now load the model via spacy.load('en_core_web_md')
\n", "\n", "Of course, we have a third option, and that is to train our own vectors from a large corpus of documents. Unfortunately this would take a prohibitively large amount of time and processing power. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "# Word Vectors\n", "Word vectors - also called *word embeddings* - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive *context*. As mentioned above, the word vector for \"lion\" will be closer in value to \"cat\" than to \"dandelion\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vector values\n", "So what does a word vector look like? Since spaCy employs 300 dimensions, word vectors are stored as 300-item arrays.\n", "\n", "Note that we would see the same set of values with **en_core_web_md** and **en_core_web_lg**, as both were trained using the [word2vec](https://en.wikipedia.org/wiki/Word2vec) family of algorithms." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import spaCy and load the language library\n", "import spacy\n", "nlp = spacy.load('en_core_web_md') # make sure to use a larger model!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([ 1.89630002e-01, -4.03090000e-01, 3.53500009e-01,\n", " -4.79070008e-01, -4.33109999e-01, 2.38570005e-01,\n", " 2.69620001e-01, 6.43320009e-02, 3.07669997e-01,\n", " 1.37119997e+00, -3.75820011e-01, -2.27129996e-01,\n", " -3.56570005e-01, -2.53549993e-01, 1.75429992e-02,\n", " 3.39619994e-01, 7.47229978e-02, 5.12260020e-01,\n", " -3.97590011e-01, 5.13330009e-03, -3.09289992e-01,\n", " 4.89110015e-02, -1.86100006e-01, -4.17019993e-01,\n", " -8.16389978e-01, -1.69080004e-01, -2.62459993e-01,\n", " -1.59830004e-02, 1.24789998e-01, -3.72759998e-02,\n", " -5.71250021e-01, -1.62959993e-01, 1.23760000e-01,\n", " -5.54639995e-02, 1.32440001e-01, 2.75190007e-02,\n", " 1.25919998e-01, -3.27219993e-01, -4.91649985e-01,\n", " -3.55589986e-01, -3.06300014e-01, 6.11849986e-02,\n", " -1.69320002e-01, -6.24050014e-02, 6.57630026e-01,\n", " -2.79249996e-01, -3.04499990e-03, -2.23999992e-02,\n", " -2.80149996e-01, -2.19750002e-01, -4.31879997e-01,\n", " 3.98639999e-02, -2.21019998e-01, -4.26930003e-02,\n", " 5.27479984e-02, 2.87259996e-01, 1.23149998e-01,\n", " -2.86619999e-02, 7.82940015e-02, 4.67539996e-01,\n", " -2.45890006e-01, -1.10639997e-01, 7.22500011e-02,\n", " -9.49800014e-02, -2.75480002e-01, -5.40970027e-01,\n", " 1.28230006e-01, -8.24080035e-02, 3.10350001e-01,\n", " -6.33940026e-02, -7.37550020e-01, -5.49920022e-01,\n", " 9.99990031e-02, -2.07580000e-01, -3.96739990e-02,\n", " 2.06640005e-01, -9.75570008e-02, -3.70920002e-01,\n", " 2.79009998e-01, -6.22179985e-01, -1.02799997e-01,\n", " 2.32710004e-01, 4.38380003e-01, 3.24449986e-02,\n", " -2.98660010e-01, -7.36109987e-02, 7.15939999e-01,\n", " 1.42409995e-01, 2.77700007e-01, -3.98920000e-01,\n", " 3.66559997e-02, 1.57590002e-01, 8.20140019e-02,\n", " -5.73430002e-01, 3.54570001e-01, 2.24910006e-01,\n", " -6.26990020e-01, -8.81059989e-02, 2.43609995e-01,\n", " 3.85329992e-01, -1.40829995e-01, 1.76909998e-01,\n", " 7.08969980e-02, 1.79509997e-01, -4.59069997e-01,\n", " -8.21200013e-01, -2.66309995e-02, 6.25490025e-02,\n", " 4.24149990e-01, -8.96300003e-02, -2.46539995e-01,\n", " 1.41560003e-01, 4.01870012e-01, -4.12319988e-01,\n", " 8.45159963e-02, -1.06260002e-01, 7.31450021e-01,\n", " 1.92169994e-01, 1.42399997e-01, 2.85109997e-01,\n", " -2.94539988e-01, -2.19479993e-01, 9.04600024e-01,\n", " -1.90980002e-01, -1.03400004e+00, -1.57539994e-01,\n", " -1.19640000e-01, 4.98879999e-01, -1.06239998e+00,\n", " -3.28200012e-01, -1.12319998e-02, -7.94820011e-01,\n", " 3.72750014e-01, -6.87099993e-03, -2.57719994e-01,\n", " -4.70050007e-01, -4.13870007e-01, -6.40890002e-02,\n", " -2.80330002e-01, -4.07779999e-02, -2.48659992e+00,\n", " 6.24939986e-03, -1.02100000e-02, 1.27519995e-01,\n", " 3.49649996e-01, -1.25709996e-01, 3.15699995e-01,\n", " 4.19259995e-01, 2.00560004e-01, -5.59840024e-01,\n", " -2.28009999e-01, 1.20119996e-01, -2.05180002e-03,\n", " -8.97639990e-02, -8.03729966e-02, 1.19690001e-02,\n", " -2.69780010e-01, 3.48289996e-01, 7.36640021e-03,\n", " -1.11369997e-01, 6.34100020e-01, 3.84490013e-01,\n", " -6.22479975e-01, 4.11450006e-02, 2.59220004e-01,\n", " 6.58110023e-01, -4.95480001e-01, -1.30300000e-01,\n", " -3.82789999e-01, 1.11560002e-01, -4.30849999e-01,\n", " 3.44729990e-01, 2.71090008e-02, -2.51080006e-01,\n", " -2.80110002e-01, 2.16619998e-01, 3.26599985e-01,\n", " 5.58950007e-02, 7.60769993e-02, -5.24800010e-02,\n", " 4.59280014e-02, -2.52660006e-01, 5.28450012e-01,\n", " -1.31449997e-01, -1.24530002e-01, 4.05559987e-01,\n", " 3.18769991e-01, 2.44149994e-02, -2.26199999e-01,\n", " -6.19599998e-01, -4.08859998e-01, -3.55339982e-02,\n", " -5.51229995e-03, 2.34380007e-01, 8.78539979e-01,\n", " -2.51610011e-01, 4.05999988e-01, -4.42840010e-01,\n", " 3.49339992e-01, -5.64289987e-01, -2.36760005e-01,\n", " 6.21990025e-01, -2.81749994e-01, 4.20240015e-01,\n", " 1.00429997e-01, -1.47200003e-01, 4.95929986e-01,\n", " -3.58500004e-01, -1.39980003e-01, -2.74940014e-01,\n", " 2.38270000e-01, 5.72679996e-01, 7.90250003e-02,\n", " 1.78720001e-02, -2.18290001e-01, 5.50500005e-02,\n", " -5.41999996e-01, 1.67879999e-01, 3.90650004e-01,\n", " 3.02089989e-01, 2.30399996e-01, -3.93510014e-02,\n", " -2.10779995e-01, -2.72240013e-01, 1.69070005e-01,\n", " 5.48189998e-01, 9.48880017e-02, 7.97980011e-01,\n", " -6.61579967e-02, 1.98440000e-01, 2.03070000e-01,\n", " 4.48080003e-02, -1.02399997e-01, -6.99089989e-02,\n", " -3.67560014e-02, 9.51590016e-02, -2.78299987e-01,\n", " -1.05970003e-01, -1.62760004e-01, -1.82109997e-01,\n", " -3.18969995e-01, -2.16330007e-01, 1.49939999e-01,\n", " -7.20570013e-02, 2.22639993e-01, -4.55509990e-01,\n", " 3.03409994e-01, 1.84310004e-01, 2.16810003e-01,\n", " -3.19400012e-01, 2.64259994e-01, 5.81059992e-01,\n", " 5.46349995e-02, 6.32380009e-01, 4.31690007e-01,\n", " 9.03429985e-02, 1.94940001e-01, 3.54829997e-01,\n", " -2.07059998e-02, -7.31169999e-01, 1.29409999e-01,\n", " 1.74180001e-01, -1.50649995e-01, 5.33550009e-02,\n", " 4.47940007e-02, -1.65999994e-01, 2.20070004e-01,\n", " -5.39699972e-01, -2.49679998e-01, -2.64640003e-01,\n", " -5.55149972e-01, 5.82419991e-01, 2.22949997e-01,\n", " 2.44330004e-01, 4.52749997e-01, 3.46929997e-01,\n", " 1.22550003e-01, -3.90589982e-02, -3.27490002e-01,\n", " -2.78910011e-01, 1.37659997e-01, 3.83920014e-01,\n", " 1.05430000e-03, -1.02420002e-02, 4.92049992e-01,\n", " -1.79220006e-01, 4.12149988e-02, 1.35470003e-01,\n", " -2.05980003e-01, -2.31940001e-01, -7.77010024e-01,\n", " -3.82369995e-01, -7.63830006e-01, 1.94179997e-01,\n", " -1.54410005e-01, 8.97400022e-01, 3.06259990e-01,\n", " 4.03759986e-01, 2.17380002e-01, -3.80499989e-01], dtype=float32)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp(u'lion').vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors.
This makes it possible to compare similarities between whole documents." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -1.96635887e-01, -2.32740352e-03, -5.36607020e-02,\n", " -6.10564947e-02, -4.08843048e-02, 1.45266443e-01,\n", " -1.08268000e-01, -6.27789786e-03, 1.48455709e-01,\n", " 1.90697408e+00, -2.57692993e-01, -1.95818534e-03,\n", " -1.16141019e-02, -1.62858292e-01, -1.62938282e-01,\n", " 1.18210977e-02, 5.12646027e-02, 1.00078702e+00,\n", " -2.01447997e-02, -2.54611671e-01, -1.28316596e-01,\n", " -1.97198763e-02, -2.89733019e-02, -1.94347113e-01,\n", " 1.26644447e-01, -8.69869068e-02, -2.20812604e-01,\n", " -1.58452198e-01, 9.86308008e-02, -1.79210991e-01,\n", " -1.55290633e-01, 1.95643142e-01, 2.66436003e-02,\n", " -1.64984968e-02, 1.18824698e-01, -1.17830629e-03,\n", " 4.99809943e-02, -4.23077159e-02, -3.86111848e-02,\n", " -7.47400150e-03, 1.23448208e-01, 9.60620027e-03,\n", " -3.32463719e-02, -1.77848607e-01, 1.19390726e-01,\n", " 1.87545009e-02, -1.84173390e-01, 6.91781715e-02,\n", " 1.28520593e-01, 1.48827005e-02, -1.78013414e-01,\n", " 1.10003807e-01, -3.35464999e-02, -1.52476998e-02,\n", " -9.41195935e-02, 1.58633105e-02, -1.29811959e-02,\n", " 1.40140295e-01, -1.47720069e-01, -3.81718054e-02,\n", " 4.66808230e-02, 3.31423879e-02, 7.97965974e-02,\n", " 1.60014004e-01, 8.90410226e-03, -1.01237908e-01,\n", " 7.39663988e-02, 2.47380026e-02, 4.26153988e-02,\n", " 9.66729969e-02, 2.87616011e-02, 7.22841993e-02,\n", " 1.76565602e-01, 7.55538046e-02, 1.10501610e-01,\n", " -1.02358103e-01, -5.43345436e-02, -4.12176028e-02,\n", " 3.98623049e-02, -2.98339734e-03, -5.32988012e-02,\n", " 1.90624595e-01, -6.42587021e-02, -1.76225007e-02,\n", " 3.94165330e-02, -1.14773512e-01, 4.25241649e-01,\n", " 2.07243040e-01, 2.60730416e-01, 1.31226778e-01,\n", " -8.00508037e-02, 6.88939020e-02, 7.05293044e-02,\n", " -1.10744104e-01, 4.14580032e-02, 5.13269613e-03,\n", " -1.29179001e-01, -5.84542975e-02, 9.13560018e-02,\n", " -1.75975591e-01, 9.52741057e-02, 1.37699964e-02,\n", " -1.30865201e-01, -4.76420000e-02, 1.61670998e-01,\n", " -6.76959991e-01, 2.68619388e-01, -7.94106945e-02,\n", " 8.56394917e-02, -5.94138019e-02, 7.44821057e-02,\n", " -1.67490095e-01, 1.97447598e-01, -2.71580786e-01,\n", " 1.51915969e-02, 1.12019002e-01, -4.32585999e-02,\n", " -1.03554968e-02, 6.33272156e-02, 5.20200143e-03,\n", " 4.94491048e-02, -1.07016601e-01, -6.45387918e-02,\n", " -1.76269561e-01, -1.98135704e-01, 4.17800918e-02,\n", " 1.23686995e-02, -1.13280594e-01, -4.03523073e-02,\n", " -4.21132054e-03, -9.65667963e-02, -7.12300017e-02,\n", " -2.19088510e-01, 6.41715974e-02, 1.11634992e-01,\n", " -7.12868944e-02, -8.27060193e-02, 1.53889004e-02,\n", " 6.84699565e-02, -5.50561920e-02, -1.84788990e+00,\n", " -4.75010052e-02, 1.31487206e-01, 1.03359401e-01,\n", " 1.80857688e-01, -8.03041980e-02, 2.27739997e-02,\n", " 5.56868985e-02, 9.20986086e-02, 6.22248054e-02,\n", " 4.86670025e-02, -4.06427011e-02, 3.83703932e-02,\n", " -4.05869968e-02, -2.26339817e-01, 3.69174965e-02,\n", " -1.30066186e-01, 1.27621710e-01, 2.76701003e-02,\n", " -1.39992401e-01, -3.75526994e-02, -8.11104029e-02,\n", " -1.78196102e-01, -1.21652998e-01, -5.88919744e-02,\n", " -1.06128812e-01, -4.72390745e-03, -1.14130601e-01,\n", " -7.60087445e-02, -9.48704034e-02, 1.68780401e-01,\n", " 3.82669941e-02, -1.68303996e-01, -1.30991384e-01,\n", " -2.46409744e-01, 1.42855030e-02, 1.23633012e-01,\n", " 7.95699935e-03, -3.22283022e-02, 3.75844017e-02,\n", " -4.48104031e-02, -2.00578898e-01, -2.86081016e-01,\n", " -1.83181003e-01, -5.46899159e-04, 6.52990937e-02,\n", " 2.34263036e-02, -7.60660022e-02, 1.13897599e-01,\n", " -7.05380812e-02, 1.30277812e-01, 2.83973999e-02,\n", " 1.73887815e-02, -1.71358977e-02, 1.78455990e-02,\n", " 1.86773703e-01, 1.83613986e-01, -4.05438878e-02,\n", " 1.28929759e-03, -3.71900201e-03, -1.97373003e-01,\n", " 4.78463694e-02, -2.21408010e-01, 2.68826094e-02,\n", " 2.40951017e-01, 7.42616802e-02, 7.53984973e-02,\n", " -7.67349079e-02, -5.37766796e-03, -8.06540065e-03,\n", " 1.88790001e-02, 8.31135064e-02, -5.20760007e-02,\n", " 1.29393607e-01, 4.09864075e-02, 7.31946975e-02,\n", " -1.64099425e-01, 1.17529690e-01, -6.96440935e-02,\n", " 1.91028208e-01, 1.01721905e-01, 6.34808987e-02,\n", " -8.29815865e-02, -6.95784390e-03, -1.69757873e-01,\n", " -2.02478573e-01, 3.65395918e-02, 1.32345587e-01,\n", " 3.53013016e-02, 2.27603033e-01, -1.52753398e-01,\n", " 7.80210178e-03, 2.06879750e-02, -8.63540452e-03,\n", " 9.85722095e-02, -2.91380938e-02, -1.42988954e-02,\n", " -9.39018354e-02, 1.43968105e-01, 7.82396942e-02,\n", " -1.93540990e-01, -9.36544985e-02, -8.23533013e-02,\n", " 4.40272018e-02, -2.22195080e-03, -1.29856914e-01,\n", " -1.53841600e-01, -1.55329984e-02, -2.55266696e-01,\n", " 1.14425398e-01, -1.03161987e-02, -4.66439016e-02,\n", " -5.69390282e-02, 7.72780031e-02, 1.28908500e-01,\n", " 1.61679000e-01, 1.50837511e-01, 6.18334934e-02,\n", " -9.06937942e-02, -3.52137014e-02, 1.35956988e-01,\n", " 7.52059072e-02, 5.73905036e-02, -1.65402606e-01,\n", " 1.68419987e-01, -1.83722824e-01, 5.91069926e-03,\n", " -1.25354990e-01, 3.95771042e-02, 5.67352995e-02,\n", " -5.63519308e-03, 1.53597593e-01, -6.84822723e-02,\n", " -1.40976995e-01, -3.62732522e-02, -2.61475928e-02,\n", " 2.50091963e-02, 1.18994810e-01, -2.66857035e-02,\n", " 7.50442073e-02, 2.04583794e-01, 4.37736101e-02,\n", " -8.17096978e-02, 6.80228025e-02, 5.50465994e-02,\n", " -2.39979066e-02, 7.68290013e-02, -5.76773956e-02,\n", " 8.30340981e-02, 3.63199934e-02, -1.65820405e-01,\n", " 2.55408939e-02, -5.30679002e-02, -1.35961995e-01,\n", " -1.03501797e-01, 1.36406809e-01, 9.66293067e-02,\n", " 7.33902007e-02, -1.83055893e-01, -2.73141060e-02], dtype=float32)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc = nlp(u'The quick brown fox jumped over the lazy dogs.')\n", "\n", "doc.vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identifying similar vectors\n", "The best way to expose vector relationships is through the `.similarity()` method of Doc tokens." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lion lion 1.0\n", "lion cat 0.526544\n", "lion pet 0.399238\n", "cat lion 0.526544\n", "cat cat 1.0\n", "cat pet 0.750546\n", "pet lion 0.399238\n", "pet cat 0.750546\n", "pet pet 1.0\n" ] } ], "source": [ "# Create a three-token Doc object:\n", "tokens = nlp(u'lion cat pet')\n", "\n", "# Iterate through token combinations:\n", "for token1 in tokens:\n", " for token2 in tokens:\n", " print(token1.text, token2.text, token1.similarity(token2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that order doesn't matter. `token1.similarity(token2)` has the same value as `token2.similarity(token1)`.\n", "#### To view this as a table:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# For brevity, assign each token a name\n", "a,b,c = tokens\n", "\n", "# Display as a Markdown table (this only works in Jupyter!)\n", "from IPython.display import Markdown, display\n", "display(Markdown(f'
lioncatpet
**lion**1.00.52650.3992
**cat**0.52651.00.7505
**pet**0.39920.75051.0
\\\n", "\\\n", "\\\n", "'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, we see the strongest similarity between \"cat\" and \"pet\", the weakest between \"lion\" and \"pet\", and some similarity between \"lion\" and \"cat\". A word will have a perfect (1.0) similarity with itself.\n", "\n", "If you're curious, the similarity between \"lion\" and \"dandelion\" is very small:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.18064451829601527" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp(u'lion').similarity(nlp(u'dandelion'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Opposites are not necessarily different\n", "Words that have opposite meaning, but that often appear in the same *context* may have similar vectors." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "like like 1.0\n", "like love 0.657904\n", "like hate 0.657465\n", "love like 0.657904\n", "love love 1.0\n", "love hate 0.63931\n", "hate like 0.657465\n", "hate love 0.63931\n", "hate hate 1.0\n" ] } ], "source": [ "# Create a three-token Doc object:\n", "tokens = nlp(u'like love hate')\n", "\n", "# Iterate through token combinations:\n", "for token1 in tokens:\n", " for token2 in tokens:\n", " print(token1.text, token2.text, token1.similarity(token2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vector norms\n", "It's sometimes helpful to aggregate 300 dimensions into a [Euclidian (L2) norm](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm), computed as the square root of the sum-of-squared-vectors. This is accessible as the `.vector_norm` token attribute. Other helpful attributes include `.has_vector` and `.is_oov` or *out of vocabulary*.\n", "\n", "For example, our 685k vector library may not have the word \"[nargle](https://en.wikibooks.org/wiki/Muggles%27_Guide_to_Harry_Potter/Magic/Nargle)\". To test this:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dog True 7.03367 False\n", "cat True 6.68082 False\n", "nargle False 0.0 True\n" ] } ], "source": [ "tokens = nlp(u'dog cat nargle')\n", "\n", "for token in tokens:\n", " print(token.text, token.has_vector, token.vector_norm, token.is_oov)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed we see that \"nargle\" does not have a vector, so the vector_norm value is zero, and it identifies as *out of vocabulary*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vector arithmetic\n", "Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests\n", "
\"king\" - \"man\" + \"woman\" = \"queen\"
\n", "Let's try it out!" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['king', 'queen', 'commoner', 'highness', 'prince', 'sultan', 'maharajas', 'princes', 'kumbia', 'kings']\n" ] } ], "source": [ "from scipy import spatial\n", "\n", "cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)\n", "\n", "king = nlp.vocab['king'].vector\n", "man = nlp.vocab['man'].vector\n", "woman = nlp.vocab['woman'].vector\n", "\n", "# Now we find the closest vector in the vocabulary to the result of \"man\" - \"woman\" + \"queen\"\n", "new_vector = king - man + woman\n", "computed_similarities = []\n", "\n", "for word in nlp.vocab:\n", " # Ignore words without vectors and mixed-case words:\n", " if word.has_vector:\n", " if word.is_lower:\n", " if word.is_alpha:\n", " similarity = cosine_similarity(new_vector, word.vector)\n", " computed_similarities.append((word, similarity))\n", "\n", "computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])\n", "\n", "print([w[0].text for w in computed_similarities[:10]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in this case, \"king\" was still closer than \"queen\" to our calculated vector, although \"queen\" did show up!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next up: Sentiment Analysis" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }
{a.text}{b.text}{c.text}
**{a.text}**{a.similarity(a):{.4}}{b.similarity(a):{.4}}{c.similarity(a):{.4}}
**{b.text}**{a.similarity(b):{.4}}{b.similarity(b):{.4}}{c.similarity(b):{.4}}
**{c.text}**{a.similarity(c):{.4}}{b.similarity(c):{.4}}{c.similarity(c):{.4}}