materi-praktikum/Praktikum Python Code/05-Topic-Modeling/00-Latent-Dirichlet-Allocation.ipynb

1042 lines
24 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Latent Dirichlet Allocation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"npr = pd.read_csv('npr.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Article</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>In the Washington of 2016, even when the polic...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Donald Trump has used Twitter — his prefe...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Donald Trump is unabashedly praising Russian...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Updated at 2:50 p. m. ET, Russian President Vl...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>From photography, illustration and video, to d...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Article\n",
"0 In the Washington of 2016, even when the polic...\n",
"1 Donald Trump has used Twitter — his prefe...\n",
"2 Donald Trump is unabashedly praising Russian...\n",
"3 Updated at 2:50 p. m. ET, Russian President Vl...\n",
"4 From photography, illustration and video, to d..."
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"npr.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>\n",
"When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n",
"\n",
"**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>\n",
"When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dtm = cv.fit_transform(npr['Article'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11992x54777 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 3033388 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LDA"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.decomposition import LatentDirichletAllocation"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"LDA = LatentDirichletAllocation(n_components=7,random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
" evaluate_every=-1, learning_decay=0.7,\n",
" learning_method='batch', learning_offset=10.0,\n",
" max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,\n",
" n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,\n",
" random_state=42, topic_word_prior=None,\n",
" total_samples=1000000.0, verbose=0)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This can take awhile, we're dealing with a large amount of documents!\n",
"LDA.fit(dtm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Showing Stored Words"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"54777"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(cv.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cred\n",
"fairly\n",
"occupational\n",
"temer\n",
"tamil\n",
"closest\n",
"condone\n",
"breathes\n",
"tendrils\n",
"pivot\n"
]
}
],
"source": [
"for i in range(10):\n",
" random_word_id = random.randint(0,54776)\n",
" print(cv.get_feature_names()[random_word_id])"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"foremothers\n",
"mocoa\n",
"ellroy\n",
"liron\n",
"ally\n",
"discouraged\n",
"utterance\n",
"provo\n",
"videgaray\n",
"archivist\n"
]
}
],
"source": [
"for i in range(10):\n",
" random_word_id = random.randint(0,54776)\n",
" print(cv.get_feature_names()[random_word_id])"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Showing Top Words Per Topic"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(LDA.components_)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,\n",
" 1.43006821e-01, 1.42902042e-01, 1.42861626e-01],\n",
" [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,\n",
" 1.42861973e-01, 1.42857147e-01, 1.42906875e-01],\n",
" [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,\n",
" 6.14236247e+00, 2.14061364e+00, 1.42923753e-01],\n",
" ...,\n",
" [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,\n",
" 1.42859912e-01, 1.42857146e-01, 1.42866614e-01],\n",
" [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,\n",
" 1.43107628e-01, 1.43902481e-01, 2.14271779e+00],\n",
" [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,\n",
" 1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LDA.components_"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"54777"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(LDA.components_[0])"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"single_topic = LDA.components_[0]"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Returns the indices that would sort this array.\n",
"single_topic.argsort()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.14285714309286987"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Word least representative of this topic\n",
"single_topic[18302]"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6247.245510521082"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Word most representative of this topic\n",
"single_topic[42993]"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([33390, 36310, 21228, 10425, 31464, 8149, 36283, 22673, 42561,\n",
" 42993], dtype=int64)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Top 10 words for this topic:\n",
"single_topic.argsort()[-10:]"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"top_word_indices = single_topic.argsort()[-10:]"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"new\n",
"percent\n",
"government\n",
"company\n",
"million\n",
"care\n",
"people\n",
"health\n",
"said\n",
"says\n"
]
}
],
"source": [
"for index in top_word_indices:\n",
" print(cv.get_feature_names()[index])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"THE TOP 15 WORDS FOR TOPIC #0\n",
"['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #1\n",
"['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #2\n",
"['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #3\n",
"['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #4\n",
"['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #5\n",
"['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #6\n",
"['student', 'years', 'data', 'science', 'university', 'people', 'time', 'schools', 'just', 'education', 'new', 'like', 'students', 'school', 'says']\n",
"\n",
"\n"
]
}
],
"source": [
"for index,topic in enumerate(LDA.components_):\n",
" print(f'THE TOP 15 WORDS FOR TOPIC #{index}')\n",
" print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])\n",
" print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attaching Discovered Topic Labels to Original Articles"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11992x54777 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 3033388 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(11992, 54777)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm.shape"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11992"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(npr)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"topic_results = LDA.transform(dtm)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(11992, 7)"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results.shape"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,\n",
" 2.99652737e-01, 2.25479379e-04, 2.25497980e-04])"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results[0]"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.02, 0.68, 0. , 0. , 0.3 , 0. , 0. ])"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results[0].round(2)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results[0].argmax()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means that our model thinks that the first article belongs to topic #1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining with Original Data"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Article</th>\n",
" <th>Topic</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>In the Washington of 2016, even when the polic...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Donald Trump has used Twitter — his prefe...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Donald Trump is unabashedly praising Russian...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Updated at 2:50 p. m. ET, Russian President Vl...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>From photography, illustration and video, to d...</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Article Topic\n",
"0 In the Washington of 2016, even when the polic... 1\n",
"1 Donald Trump has used Twitter — his prefe... 1\n",
"2 Donald Trump is unabashedly praising Russian... 1\n",
"3 Updated at 2:50 p. m. ET, Russian President Vl... 1\n",
"4 From photography, illustration and video, to d... 6"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"npr.head()"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1, ..., 3, 4, 0], dtype=int64)"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results.argmax(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"npr['Topic'] = topic_results.argmax(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Article</th>\n",
" <th>Topic</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>In the Washington of 2016, even when the polic...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Donald Trump has used Twitter — his prefe...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Donald Trump is unabashedly praising Russian...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Updated at 2:50 p. m. ET, Russian President Vl...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>From photography, illustration and video, to d...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>I did not want to join yoga class. I hated tho...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>With a who has publicly supported the debunk...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>I was standing by the airport exit, debating w...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>If movies were trying to be more realistic, pe...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Eighteen years ago, on New Years Eve, David F...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Article Topic\n",
"0 In the Washington of 2016, even when the polic... 1\n",
"1 Donald Trump has used Twitter — his prefe... 1\n",
"2 Donald Trump is unabashedly praising Russian... 1\n",
"3 Updated at 2:50 p. m. ET, Russian President Vl... 1\n",
"4 From photography, illustration and video, to d... 2\n",
"5 I did not want to join yoga class. I hated tho... 3\n",
"6 With a who has publicly supported the debunk... 3\n",
"7 I was standing by the airport exit, debating w... 2\n",
"8 If movies were trying to be more realistic, pe... 3\n",
"9 Eighteen years ago, on New Years Eve, David F... 2"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"npr.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Great work!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}