materi-praktikum/Praktikum Python Code/05-Topic-Modeling/03-LDA-NMF-Assessment-Project-Solutions.ipynb

577 lines
17 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Topic Modeling Assessment Project"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Task: Import pandas and read in the quora_questions.csv file."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"quora = pd.read_csv('quora_questions.csv')"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Question</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>What is the step by step guide to invest in sh...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>How can I increase the speed of my internet co...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Why am I mentally very lonely? How can I solve...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Which one dissolve in water quikly sugar, salt...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Question\n",
"0 What is the step by step guide to invest in sh...\n",
"1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
"2 How can I increase the speed of my internet co...\n",
"3 Why am I mentally very lonely? How can I solve...\n",
"4 Which one dissolve in water quikly sugar, salt..."
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quora.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preprocessing\n",
"\n",
"#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dtm = tfidf.fit_transform(quora['Question'])"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<404289x38669 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 2002912 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Non-negative Matrix Factorization\n",
"\n",
"#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42).."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.decomposition import NMF"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nmf_model = NMF(n_components=20,random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,\n",
" n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,\n",
" verbose=0)"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nmf_model.fit(dtm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TASK: Print our the top 15 most common words for each of the 20 topics."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"THE TOP 15 WORDS FOR TOPIC #0\n",
"['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #1\n",
"['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #2\n",
"['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #3\n",
"['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #4\n",
"['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #5\n",
"['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics', 'available', 'job', 'spotify', 'war', 'pakistan', 'india']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #6\n",
"['beginners', 'online', 'english', 'book', 'did', 'hacking', 'want', 'python', 'languages', 'java', 'learning', 'start', 'language', 'programming', 'learn']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #7\n",
"['happen', 'presidency', 'think', 'presidential', '2016', 'vote', 'better', 'election', 'did', 'win', 'hillary', 'president', 'clinton', 'donald', 'trump']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #8\n",
"['russia', 'business', 'win', 'coming', 'countries', 'place', 'pakistan', 'happen', 'end', 'country', 'iii', 'start', 'did', 'war', 'world']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #9\n",
"['indian', 'companies', 'don', 'guy', 'men', 'culture', 'women', 'work', 'girls', 'live', 'girl', 'look', 'sex', 'feel', 'like']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #10\n",
"['ca', 'departments', 'positions', 'movies', 'songs', 'business', 'read', 'start', 'job', 'work', 'engineering', 'ways', 'bad', 'books', 'good']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #11\n",
"['money', 'modi', 'currency', 'economy', 'think', 'government', 'ban', 'banning', 'black', 'indian', 'rupee', 'rs', '1000', 'notes', '500']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #12\n",
"['blowing', 'resolutions', 'resolution', 'mind', 'likes', 'girl', '2017', 'year', 'don', 'employees', 'going', 'day', 'things', 'new', 'know']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #13\n",
"['aspects', 'fluent', 'skill', 'spoken', 'ways', 'language', 'fluently', 'speak', 'communication', 'pronunciation', 'speaking', 'writing', 'skills', 'improve', 'english']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #14\n",
"['diet', 'help', 'healthy', 'exercise', 'month', 'pounds', 'reduce', 'quickly', 'loss', 'fast', 'fat', 'ways', 'gain', 'lose', 'weight']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #15\n",
"['having', 'feel', 'long', 'spend', 'did', 'person', 'machine', 'movies', 'favorite', 'job', 'home', 'sex', 'possible', 'travel', 'time']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #16\n",
"['marriage', 'make', 'did', 'girlfriend', 'feel', 'tell', 'forget', 'really', 'friend', 'true', 'know', 'person', 'girl', 'fall', 'love']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #17\n",
"['easy', 'hack', 'prepare', 'quickest', 'facebook', 'increase', 'painless', 'instagram', 'account', 'best', 'commit', 'fastest', 'suicide', 'easiest', 'way']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #18\n",
"['web', 'java', 'scripting', 'phone', 'mechanical', 'better', 'job', 'use', 'account', 'data', 'software', 'science', 'computer', 'engineering', 'difference']\n",
"\n",
"\n",
"THE TOP 15 WORDS FOR TOPIC #19\n",
"['earth', 'blowing', 'stop', 'use', 'easily', 'mind', 'google', 'flat', 'questions', 'hate', 'believe', 'ask', 'don', 'think', 'people']\n",
"\n",
"\n"
]
}
],
"source": [
"for index,topic in enumerate(nmf_model.components_):\n",
" print(f'THE TOP 15 WORDS FOR TOPIC #{index}')\n",
" print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])\n",
" print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Question</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>What is the step by step guide to invest in sh...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>How can I increase the speed of my internet co...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Why am I mentally very lonely? How can I solve...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Which one dissolve in water quikly sugar, salt...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Question\n",
"0 What is the step by step guide to invest in sh...\n",
"1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
"2 How can I increase the speed of my internet co...\n",
"3 Why am I mentally very lonely? How can I solve...\n",
"4 Which one dissolve in water quikly sugar, salt..."
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quora.head()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"topic_results = nmf_model.transform(dtm)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Question</th>\n",
" <th>Topic</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>What is the step by step guide to invest in sh...</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>How can I increase the speed of my internet co...</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Why am I mentally very lonely? How can I solve...</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Which one dissolve in water quikly sugar, salt...</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Astrology: I am a Capricorn Sun Cap moon and c...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Should I buy tiago?</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>How can I be a good geologist?</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>When do you use シ instead of し?</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Motorola (company): Can I hack my Charter Moto...</td>\n",
" <td>17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Question Topic\n",
"0 What is the step by step guide to invest in sh... 5\n",
"1 What is the story of Kohinoor (Koh-i-Noor) Dia... 16\n",
"2 How can I increase the speed of my internet co... 17\n",
"3 Why am I mentally very lonely? How can I solve... 11\n",
"4 Which one dissolve in water quikly sugar, salt... 14\n",
"5 Astrology: I am a Capricorn Sun Cap moon and c... 1\n",
"6 Should I buy tiago? 0\n",
"7 How can I be a good geologist? 10\n",
"8 When do you use シ instead of し? 19\n",
"9 Motorola (company): Can I hack my Charter Moto... 17"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_results.argmax(axis=1)\n",
"\n",
"quora['Topic'] = topic_results.argmax(axis=1)\n",
"\n",
"quora.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Great job!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}