{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Topic Modeling Assessment Project" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Task: Import pandas and read in the quora_questions.csv file." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "quora = pd.read_csv('quora_questions.csv')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Question
0What is the step by step guide to invest in sh...
1What is the story of Kohinoor (Koh-i-Noor) Dia...
2How can I increase the speed of my internet co...
3Why am I mentally very lonely? How can I solve...
4Which one dissolve in water quikly sugar, salt...
\n", "
" ], "text/plain": [ " Question\n", "0 What is the step by step guide to invest in sh...\n", "1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n", "2 How can I increase the speed of my internet co...\n", "3 Why am I mentally very lonely? How can I solve...\n", "4 Which one dissolve in water quikly sugar, salt..." ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quora.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing\n", "\n", "#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dtm = tfidf.fit_transform(quora['Question'])" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<404289x38669 sparse matrix of type ''\n", "\twith 2002912 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Non-negative Matrix Factorization\n", "\n", "#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42).." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.decomposition import NMF" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nmf_model = NMF(n_components=20,random_state=42)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,\n", " n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,\n", " verbose=0)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nmf_model.fit(dtm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TASK: Print our the top 15 most common words for each of the 20 topics." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "THE TOP 15 WORDS FOR TOPIC #0\n", "['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #1\n", "['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #2\n", "['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #3\n", "['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #4\n", "['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #5\n", "['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics', 'available', 'job', 'spotify', 'war', 'pakistan', 'india']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #6\n", "['beginners', 'online', 'english', 'book', 'did', 'hacking', 'want', 'python', 'languages', 'java', 'learning', 'start', 'language', 'programming', 'learn']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #7\n", "['happen', 'presidency', 'think', 'presidential', '2016', 'vote', 'better', 'election', 'did', 'win', 'hillary', 'president', 'clinton', 'donald', 'trump']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #8\n", "['russia', 'business', 'win', 'coming', 'countries', 'place', 'pakistan', 'happen', 'end', 'country', 'iii', 'start', 'did', 'war', 'world']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #9\n", "['indian', 'companies', 'don', 'guy', 'men', 'culture', 'women', 'work', 'girls', 'live', 'girl', 'look', 'sex', 'feel', 'like']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #10\n", "['ca', 'departments', 'positions', 'movies', 'songs', 'business', 'read', 'start', 'job', 'work', 'engineering', 'ways', 'bad', 'books', 'good']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #11\n", "['money', 'modi', 'currency', 'economy', 'think', 'government', 'ban', 'banning', 'black', 'indian', 'rupee', 'rs', '1000', 'notes', '500']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #12\n", "['blowing', 'resolutions', 'resolution', 'mind', 'likes', 'girl', '2017', 'year', 'don', 'employees', 'going', 'day', 'things', 'new', 'know']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #13\n", "['aspects', 'fluent', 'skill', 'spoken', 'ways', 'language', 'fluently', 'speak', 'communication', 'pronunciation', 'speaking', 'writing', 'skills', 'improve', 'english']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #14\n", "['diet', 'help', 'healthy', 'exercise', 'month', 'pounds', 'reduce', 'quickly', 'loss', 'fast', 'fat', 'ways', 'gain', 'lose', 'weight']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #15\n", "['having', 'feel', 'long', 'spend', 'did', 'person', 'machine', 'movies', 'favorite', 'job', 'home', 'sex', 'possible', 'travel', 'time']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #16\n", "['marriage', 'make', 'did', 'girlfriend', 'feel', 'tell', 'forget', 'really', 'friend', 'true', 'know', 'person', 'girl', 'fall', 'love']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #17\n", "['easy', 'hack', 'prepare', 'quickest', 'facebook', 'increase', 'painless', 'instagram', 'account', 'best', 'commit', 'fastest', 'suicide', 'easiest', 'way']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #18\n", "['web', 'java', 'scripting', 'phone', 'mechanical', 'better', 'job', 'use', 'account', 'data', 'software', 'science', 'computer', 'engineering', 'difference']\n", "\n", "\n", "THE TOP 15 WORDS FOR TOPIC #19\n", "['earth', 'blowing', 'stop', 'use', 'easily', 'mind', 'google', 'flat', 'questions', 'hate', 'believe', 'ask', 'don', 'think', 'people']\n", "\n", "\n" ] } ], "source": [ "for index,topic in enumerate(nmf_model.components_):\n", " print(f'THE TOP 15 WORDS FOR TOPIC #{index}')\n", " print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])\n", " print('\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Question
0What is the step by step guide to invest in sh...
1What is the story of Kohinoor (Koh-i-Noor) Dia...
2How can I increase the speed of my internet co...
3Why am I mentally very lonely? How can I solve...
4Which one dissolve in water quikly sugar, salt...
\n", "
" ], "text/plain": [ " Question\n", "0 What is the step by step guide to invest in sh...\n", "1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n", "2 How can I increase the speed of my internet co...\n", "3 Why am I mentally very lonely? How can I solve...\n", "4 Which one dissolve in water quikly sugar, salt..." ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quora.head()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topic_results = nmf_model.transform(dtm)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuestionTopic
0What is the step by step guide to invest in sh...5
1What is the story of Kohinoor (Koh-i-Noor) Dia...16
2How can I increase the speed of my internet co...17
3Why am I mentally very lonely? How can I solve...11
4Which one dissolve in water quikly sugar, salt...14
5Astrology: I am a Capricorn Sun Cap moon and c...1
6Should I buy tiago?0
7How can I be a good geologist?10
8When do you use シ instead of し?19
9Motorola (company): Can I hack my Charter Moto...17
\n", "
" ], "text/plain": [ " Question Topic\n", "0 What is the step by step guide to invest in sh... 5\n", "1 What is the story of Kohinoor (Koh-i-Noor) Dia... 16\n", "2 How can I increase the speed of my internet co... 17\n", "3 Why am I mentally very lonely? How can I solve... 11\n", "4 Which one dissolve in water quikly sugar, salt... 14\n", "5 Astrology: I am a Capricorn Sun Cap moon and c... 1\n", "6 Should I buy tiago? 0\n", "7 How can I be a good geologist? 10\n", "8 When do you use シ instead of し? 19\n", "9 Motorola (company): Can I hack my Charter Moto... 17" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic_results.argmax(axis=1)\n", "\n", "quora['Topic'] = topic_results.argmax(axis=1)\n", "\n", "quora.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Great job!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }