{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", " \n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a corpus from individual files\n", "Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a `\\pos\\` directory while negative reviews live under `\\neg\\`. Refer to [moviereviesREADME.txt](../moviereviews/moviereviewsREADME.txt) for more information about the files.\n", "\n", "We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:\n", "* using Python's **os module** to build a pandas DataFrame\n", "* using an **nltk** tool called `CategorizedPlaintextCorpusReader` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Python's `os` module to build a DataFrame" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Perform imports:\n", "import numpy as np\n", "import pandas as pd\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's look at what os.walk() does:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('../moviereviews - Copy', ['neg', 'pos'], ['poldata.README.2.0'])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = os.walk('../moviereviews')\n", "next(gen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`os.walk()` is a generator that returns a tuple with three items:\n", "1. the name of the current folder\n", "2. a list of names of any subfolders\n", "3. a list of names of any files in the current folder" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('../moviereviews - Copy\\\\neg',\n", " [],\n", " ['cv000_29416.txt',\n", " 'cv001_19502.txt',\n", " 'cv002_17424.txt',\n", " 'cv003_12683.txt',\n", " 'cv004_12641.txt',\n", " 'cv005_29357.txt',\n", " 'cv006_17022.txt',\n", " 'cv007_4992.txt',\n", " 'cv008_29326.txt',\n", " 'cv009_29417.txt',\n", " 'cv010_29063.txt',\n", " 'cv011_13044.txt',\n", " 'cv012_29411.txt',\n", " 'cv013_10494.txt',\n", " 'cv014_15600.txt',\n", " 'cv015_29356.txt',\n", " 'cv016_4348.txt',\n", " 'cv017_23487.txt',\n", " 'cv018_21672.txt',\n", " 'cv019_16117.txt',\n", " 'cv020_9234.txt',\n", " 'cv021_17313.txt',\n", " 'cv022_14227.txt',\n", " 'cv023_13847.txt',\n", " 'cv024_7033.txt',\n", " 'cv025_29825.txt',\n", " 'cv026_29229.txt',\n", " 'cv027_26270.txt',\n", " 'cv028_26964.txt',\n", " 'cv029_19943.txt',\n", " 'cv030_22893.txt',\n", " 'cv031_19540.txt',\n", " 'cv032_23718.txt',\n", " 'cv033_25680.txt',\n", " 'cv034_29446.txt',\n", " 'cv035_3343.txt',\n", " 'cv036_18385.txt',\n", " 'cv037_19798.txt',\n", " 'cv038_9781.txt',\n", " 'cv039_5963.txt',\n", " 'cv040_8829.txt',\n", " 'cv041_22364.txt',\n", " 'cv042_11927.txt',\n", " 'cv043_16808.txt',\n", " 'cv044_18429.txt',\n", " 'cv045_25077.txt',\n", " 'cv046_10613.txt',\n", " 'cv047_18725.txt',\n", " 'cv048_18380.txt',\n", " 'cv049_21917.txt',\n", " 'cv050_12128.txt',\n", " 'cv051_10751.txt',\n", " 'cv052_29318.txt',\n", " 'cv053_23117.txt',\n", " 'cv054_4101.txt',\n", " 'cv055_8926.txt',\n", " 'cv056_14663.txt',\n", " 'cv057_7962.txt',\n", " 'cv058_8469.txt',\n", " 'cv059_28723.txt',\n", " 'cv060_11754.txt',\n", " 'cv061_9321.txt',\n", " 'cv062_24556.txt',\n", " 'cv063_28852.txt',\n", " 'cv064_25842.txt',\n", " 'cv065_16909.txt',\n", " 'cv066_11668.txt',\n", " 'cv067_21192.txt',\n", " 'cv068_14810.txt',\n", " 'cv069_11613.txt',\n", " 'cv070_13249.txt',\n", " 'cv071_12969.txt',\n", " 'cv072_5928.txt',\n", " 'cv073_23039.txt',\n", " 'cv074_7188.txt',\n", " 'cv075_6250.txt',\n", " 'cv076_26009.txt',\n", " 'cv077_23172.txt',\n", " 'cv078_16506.txt',\n", " 'cv079_12766.txt',\n", " 'cv080_14899.txt',\n", " 'cv081_18241.txt',\n", " 'cv082_11979.txt',\n", " 'cv083_25491.txt',\n", " 'cv084_15183.txt',\n", " 'cv085_15286.txt',\n", " 'cv086_19488.txt',\n", " 'cv087_2145.txt',\n", " 'cv088_25274.txt',\n", " 'cv089_12222.txt',\n", " 'cv090_0049.txt',\n", " 'cv091_7899.txt',\n", " 'cv092_27987.txt',\n", " 'cv093_15606.txt',\n", " 'cv094_27868.txt',\n", " 'cv095_28730.txt',\n", " 'cv096_12262.txt',\n", " 'cv097_26081.txt',\n", " 'cv098_17021.txt',\n", " 'cv099_11189.txt',\n", " 'cv100_12406.txt',\n", " 'cv101_10537.txt',\n", " 'cv102_8306.txt',\n", " 'cv103_11943.txt',\n", " 'cv104_19176.txt',\n", " 'cv105_19135.txt',\n", " 'cv106_18379.txt',\n", " 'cv107_25639.txt',\n", " 'cv108_17064.txt',\n", " 'cv109_22599.txt',\n", " 'cv110_27832.txt',\n", " 'cv111_12253.txt',\n", " 'cv112_12178.txt',\n", " 'cv113_24354.txt',\n", " 'cv114_19501.txt',\n", " 'cv115_26443.txt',\n", " 'cv116_28734.txt',\n", " 'cv117_25625.txt',\n", " 'cv118_28837.txt',\n", " 'cv119_9909.txt',\n", " 'cv120_3793.txt',\n", " 'cv121_18621.txt',\n", " 'cv122_7891.txt',\n", " 'cv123_12165.txt',\n", " 'cv124_3903.txt',\n", " 'cv125_9636.txt',\n", " 'cv126_28821.txt',\n", " 'cv127_16451.txt',\n", " 'cv128_29444.txt',\n", " 'cv129_18373.txt',\n", " 'cv130_18521.txt',\n", " 'cv131_11568.txt',\n", " 'cv132_5423.txt',\n", " 'cv133_18065.txt',\n", " 'cv134_23300.txt',\n", " 'cv135_12506.txt',\n", " 'cv136_12384.txt',\n", " 'cv137_17020.txt',\n", " 'cv138_13903.txt',\n", " 'cv139_14236.txt',\n", " 'cv140_7963.txt',\n", " 'cv141_17179.txt',\n", " 'cv142_23657.txt',\n", " 'cv143_21158.txt',\n", " 'cv144_5010.txt',\n", " 'cv145_12239.txt',\n", " 'cv146_19587.txt',\n", " 'cv147_22625.txt',\n", " 'cv148_18084.txt',\n", " 'cv149_17084.txt',\n", " 'cv150_14279.txt',\n", " 'cv151_17231.txt',\n", " 'cv152_9052.txt',\n", " 'cv153_11607.txt',\n", " 'cv154_9562.txt',\n", " 'cv155_7845.txt',\n", " 'cv156_11119.txt',\n", " 'cv157_29302.txt',\n", " 'cv158_10914.txt',\n", " 'cv159_29374.txt',\n", " 'cv160_10848.txt',\n", " 'cv161_12224.txt',\n", " 'cv162_10977.txt',\n", " 'cv163_10110.txt',\n", " 'cv164_23451.txt',\n", " 'cv165_2389.txt',\n", " 'cv166_11959.txt',\n", " 'cv167_18094.txt',\n", " 'cv168_7435.txt',\n", " 'cv169_24973.txt',\n", " 'cv170_29808.txt',\n", " 'cv171_15164.txt',\n", " 'cv172_12037.txt',\n", " 'cv173_4295.txt',\n", " 'cv174_9735.txt',\n", " 'cv175_7375.txt',\n", " 'cv176_14196.txt',\n", " 'cv177_10904.txt',\n", " 'cv178_14380.txt',\n", " 'cv179_9533.txt',\n", " 'cv180_17823.txt',\n", " 'cv181_16083.txt',\n", " 'cv182_7791.txt',\n", " 'cv183_19826.txt',\n", " 'cv184_26935.txt',\n", " 'cv185_28372.txt',\n", " 'cv186_2396.txt',\n", " 'cv187_14112.txt',\n", " 'cv188_20687.txt',\n", " 'cv189_24248.txt',\n", " 'cv190_27176.txt',\n", " 'cv191_29539.txt',\n", " 'cv192_16079.txt',\n", " 'cv193_5393.txt',\n", " 'cv194_12855.txt',\n", " 'cv195_16146.txt',\n", " 'cv196_28898.txt',\n", " 'cv197_29271.txt',\n", " 'cv198_19313.txt',\n", " 'cv199_9721.txt',\n", " 'cv200_29006.txt',\n", " 'cv201_7421.txt',\n", " 'cv202_11382.txt',\n", " 'cv203_19052.txt',\n", " 'cv204_8930.txt',\n", " 'cv205_9676.txt',\n", " 'cv206_15893.txt',\n", " 'cv207_29141.txt',\n", " 'cv208_9475.txt',\n", " 'cv209_28973.txt',\n", " 'cv210_9557.txt',\n", " 'cv211_9955.txt',\n", " 'cv212_10054.txt',\n", " 'cv213_20300.txt',\n", " 'cv214_13285.txt',\n", " 'cv215_23246.txt',\n", " 'cv216_20165.txt',\n", " 'cv217_28707.txt',\n", " 'cv218_25651.txt',\n", " 'cv219_19874.txt',\n", " 'cv220_28906.txt',\n", " 'cv221_27081.txt',\n", " 'cv222_18720.txt',\n", " 'cv223_28923.txt',\n", " 'cv224_18875.txt',\n", " 'cv225_29083.txt',\n", " 'cv226_26692.txt',\n", " 'cv227_25406.txt',\n", " 'cv228_5644.txt',\n", " 'cv229_15200.txt',\n", " 'cv230_7913.txt',\n", " 'cv231_11028.txt',\n", " 'cv232_16768.txt',\n", " 'cv233_17614.txt',\n", " 'cv234_22123.txt',\n", " 'cv235_10704.txt',\n", " 'cv236_12427.txt',\n", " 'cv237_20635.txt',\n", " 'cv238_14285.txt',\n", " 'cv239_29828.txt',\n", " 'cv240_15948.txt',\n", " 'cv241_24602.txt',\n", " 'cv242_11354.txt',\n", " 'cv243_22164.txt',\n", " 'cv244_22935.txt',\n", " 'cv245_8938.txt',\n", " 'cv246_28668.txt',\n", " 'cv247_14668.txt',\n", " 'cv248_15672.txt',\n", " 'cv249_12674.txt',\n", " 'cv250_26462.txt',\n", " 'cv251_23901.txt',\n", " 'cv252_24974.txt',\n", " 'cv253_10190.txt',\n", " 'cv254_5870.txt',\n", " 'cv255_15267.txt',\n", " 'cv256_16529.txt',\n", " 'cv257_11856.txt',\n", " 'cv258_5627.txt',\n", " 'cv259_11827.txt',\n", " 'cv260_15652.txt',\n", " 'cv261_11855.txt',\n", " 'cv262_13812.txt',\n", " 'cv263_20693.txt',\n", " 'cv264_14108.txt',\n", " 'cv265_11625.txt',\n", " 'cv266_26644.txt',\n", " 'cv267_16618.txt',\n", " 'cv268_20288.txt',\n", " 'cv269_23018.txt',\n", " 'cv270_5873.txt',\n", " 'cv271_15364.txt',\n", " 'cv272_20313.txt',\n", " 'cv273_28961.txt',\n", " 'cv274_26379.txt',\n", " 'cv275_28725.txt',\n", " 'cv276_17126.txt',\n", " 'cv277_20467.txt',\n", " 'cv278_14533.txt',\n", " 'cv279_19452.txt',\n", " 'cv280_8651.txt',\n", " 'cv281_24711.txt',\n", " 'cv282_6833.txt',\n", " 'cv283_11963.txt',\n", " 'cv284_20530.txt',\n", " 'cv285_18186.txt',\n", " 'cv286_26156.txt',\n", " 'cv287_17410.txt',\n", " 'cv288_20212.txt',\n", " 'cv289_6239.txt',\n", " 'cv290_11981.txt',\n", " 'cv291_26844.txt',\n", " 'cv292_7804.txt',\n", " 'cv293_29731.txt',\n", " 'cv294_12695.txt',\n", " 'cv295_17060.txt',\n", " 'cv296_13146.txt',\n", " 'cv297_10104.txt',\n", " 'cv298_24487.txt',\n", " 'cv299_17950.txt',\n", " 'cv300_23302.txt',\n", " 'cv301_13010.txt',\n", " 'cv302_26481.txt',\n", " 'cv303_27366.txt',\n", " 'cv304_28489.txt',\n", " 'cv305_9937.txt',\n", " 'cv306_10859.txt',\n", " 'cv307_26382.txt',\n", " 'cv308_5079.txt',\n", " 'cv309_23737.txt',\n", " 'cv310_14568.txt',\n", " 'cv311_17708.txt',\n", " 'cv312_29308.txt',\n", " 'cv313_19337.txt',\n", " 'cv314_16095.txt',\n", " 'cv315_12638.txt',\n", " 'cv316_5972.txt',\n", " 'cv317_25111.txt',\n", " 'cv318_11146.txt',\n", " 'cv319_16459.txt',\n", " 'cv320_9693.txt',\n", " 'cv321_14191.txt',\n", " 'cv322_21820.txt',\n", " 'cv323_29633.txt',\n", " 'cv324_7502.txt',\n", " 'cv325_18330.txt',\n", " 'cv326_14777.txt',\n", " 'cv327_21743.txt',\n", " 'cv328_10908.txt',\n", " 'cv329_29293.txt',\n", " 'cv330_29675.txt',\n", " 'cv331_8656.txt',\n", " 'cv332_17997.txt',\n", " 'cv333_9443.txt',\n", " 'cv334_0074.txt',\n", " 'cv335_16299.txt',\n", " 'cv336_10363.txt',\n", " 'cv337_29061.txt',\n", " 'cv338_9183.txt',\n", " 'cv339_22452.txt',\n", " 'cv340_14776.txt',\n", " 'cv341_25667.txt',\n", " 'cv342_20917.txt',\n", " 'cv343_10906.txt',\n", " 'cv344_5376.txt',\n", " 'cv345_9966.txt',\n", " 'cv346_19198.txt',\n", " 'cv347_14722.txt',\n", " 'cv348_19207.txt',\n", " 'cv349_15032.txt',\n", " 'cv350_22139.txt',\n", " 'cv351_17029.txt',\n", " 'cv352_5414.txt',\n", " 'cv353_19197.txt',\n", " 'cv354_8573.txt',\n", " 'cv355_18174.txt',\n", " 'cv356_26170.txt',\n", " 'cv357_14710.txt',\n", " 'cv358_11557.txt',\n", " 'cv359_6751.txt',\n", " 'cv360_8927.txt',\n", " 'cv361_28738.txt',\n", " 'cv362_16985.txt',\n", " 'cv363_29273.txt',\n", " 'cv364_14254.txt',\n", " 'cv365_12442.txt',\n", " 'cv366_10709.txt',\n", " 'cv367_24065.txt',\n", " 'cv368_11090.txt',\n", " 'cv369_14245.txt',\n", " 'cv370_5338.txt',\n", " 'cv371_8197.txt',\n", " 'cv372_6654.txt',\n", " 'cv373_21872.txt',\n", " 'cv374_26455.txt',\n", " 'cv375_9932.txt',\n", " 'cv376_20883.txt',\n", " 'cv377_8440.txt',\n", " 'cv378_21982.txt',\n", " 'cv379_23167.txt',\n", " 'cv380_8164.txt',\n", " 'cv381_21673.txt',\n", " 'cv382_8393.txt',\n", " 'cv383_14662.txt',\n", " 'cv384_18536.txt',\n", " 'cv385_29621.txt',\n", " 'cv386_10229.txt',\n", " 'cv387_12391.txt',\n", " 'cv388_12810.txt',\n", " 'cv389_9611.txt',\n", " 'cv390_12187.txt',\n", " 'cv391_11615.txt',\n", " 'cv392_12238.txt',\n", " 'cv393_29234.txt',\n", " 'cv394_5311.txt',\n", " 'cv395_11761.txt',\n", " 'cv396_19127.txt',\n", " 'cv397_28890.txt',\n", " 'cv398_17047.txt',\n", " 'cv399_28593.txt',\n", " 'cv400_20631.txt',\n", " 'cv401_13758.txt',\n", " 'cv402_16097.txt',\n", " 'cv403_6721.txt',\n", " 'cv404_21805.txt',\n", " 'cv405_21868.txt',\n", " 'cv406_22199.txt',\n", " 'cv407_23928.txt',\n", " 'cv408_5367.txt',\n", " 'cv409_29625.txt',\n", " 'cv410_25624.txt',\n", " 'cv411_16799.txt',\n", " 'cv412_25254.txt',\n", " 'cv413_7893.txt',\n", " 'cv414_11161.txt',\n", " 'cv415_23674.txt',\n", " 'cv416_12048.txt',\n", " 'cv417_14653.txt',\n", " 'cv418_16562.txt',\n", " 'cv419_14799.txt',\n", " 'cv420_28631.txt',\n", " 'cv421_9752.txt',\n", " 'cv422_9632.txt',\n", " 'cv423_12089.txt',\n", " 'cv424_9268.txt',\n", " 'cv425_8603.txt',\n", " 'cv426_10976.txt',\n", " 'cv427_11693.txt',\n", " 'cv428_12202.txt',\n", " 'cv429_7937.txt',\n", " 'cv430_18662.txt',\n", " 'cv431_7538.txt',\n", " 'cv432_15873.txt',\n", " 'cv433_10443.txt',\n", " 'cv434_5641.txt',\n", " 'cv435_24355.txt',\n", " 'cv436_20564.txt',\n", " 'cv437_24070.txt',\n", " 'cv438_8500.txt',\n", " 'cv439_17633.txt',\n", " 'cv440_16891.txt',\n", " 'cv441_15276.txt',\n", " 'cv442_15499.txt',\n", " 'cv443_22367.txt',\n", " 'cv444_9975.txt',\n", " 'cv445_26683.txt',\n", " 'cv446_12209.txt',\n", " 'cv447_27334.txt',\n", " 'cv448_16409.txt',\n", " 'cv449_9126.txt',\n", " 'cv450_8319.txt',\n", " 'cv451_11502.txt',\n", " 'cv452_5179.txt',\n", " 'cv453_10911.txt',\n", " 'cv454_21961.txt',\n", " 'cv455_28866.txt',\n", " 'cv456_20370.txt',\n", " 'cv457_19546.txt',\n", " 'cv458_9000.txt',\n", " 'cv459_21834.txt',\n", " 'cv460_11723.txt',\n", " 'cv461_21124.txt',\n", " 'cv462_20788.txt',\n", " 'cv463_10846.txt',\n", " 'cv464_17076.txt',\n", " 'cv465_23401.txt',\n", " 'cv466_20092.txt',\n", " 'cv467_26610.txt',\n", " 'cv468_16844.txt',\n", " 'cv469_21998.txt',\n", " 'cv470_17444.txt',\n", " 'cv471_18405.txt',\n", " 'cv472_29140.txt',\n", " 'cv473_7869.txt',\n", " 'cv474_10682.txt',\n", " 'cv475_22978.txt',\n", " 'cv476_18402.txt',\n", " 'cv477_23530.txt',\n", " 'cv478_15921.txt',\n", " 'cv479_5450.txt',\n", " 'cv480_21195.txt',\n", " 'cv481_7930.txt',\n", " 'cv482_11233.txt',\n", " 'cv483_18103.txt',\n", " 'cv484_26169.txt',\n", " 'cv485_26879.txt',\n", " 'cv486_9788.txt',\n", " 'cv487_11058.txt',\n", " 'cv488_21453.txt',\n", " 'cv489_19046.txt',\n", " 'cv490_18986.txt',\n", " 'cv491_12992.txt',\n", " 'cv492_19370.txt',\n", " 'cv493_14135.txt',\n", " 'cv494_18689.txt',\n", " 'cv495_16121.txt',\n", " 'cv496_11185.txt',\n", " 'cv497_27086.txt',\n", " 'cv498_9288.txt',\n", " 'cv499_11407.txt',\n", " 'cv500_10722.txt',\n", " 'cv501_12675.txt',\n", " 'cv502_10970.txt',\n", " 'cv503_11196.txt',\n", " 'cv504_29120.txt',\n", " 'cv505_12926.txt',\n", " 'cv506_17521.txt',\n", " 'cv507_9509.txt',\n", " 'cv508_17742.txt',\n", " 'cv509_17354.txt',\n", " 'cv510_24758.txt',\n", " 'cv511_10360.txt',\n", " 'cv512_17618.txt',\n", " 'cv513_7236.txt',\n", " 'cv514_12173.txt',\n", " 'cv515_18484.txt',\n", " 'cv516_12117.txt',\n", " 'cv517_20616.txt',\n", " 'cv518_14798.txt',\n", " 'cv519_16239.txt',\n", " 'cv520_13297.txt',\n", " 'cv521_1730.txt',\n", " 'cv522_5418.txt',\n", " 'cv523_18285.txt',\n", " 'cv524_24885.txt',\n", " 'cv525_17930.txt',\n", " 'cv526_12868.txt',\n", " 'cv527_10338.txt',\n", " 'cv528_11669.txt',\n", " 'cv529_10972.txt',\n", " 'cv530_17949.txt',\n", " 'cv531_26838.txt',\n", " 'cv532_6495.txt',\n", " 'cv533_9843.txt',\n", " 'cv534_15683.txt',\n", " 'cv535_21183.txt',\n", " 'cv536_27221.txt',\n", " 'cv537_13516.txt',\n", " 'cv538_28485.txt',\n", " 'cv539_21865.txt',\n", " 'cv540_3092.txt',\n", " 'cv541_28683.txt',\n", " 'cv542_20359.txt',\n", " 'cv543_5107.txt',\n", " 'cv544_5301.txt',\n", " 'cv545_12848.txt',\n", " 'cv546_12723.txt',\n", " 'cv547_18043.txt',\n", " 'cv548_18944.txt',\n", " 'cv549_22771.txt',\n", " 'cv550_23226.txt',\n", " 'cv551_11214.txt',\n", " 'cv552_0150.txt',\n", " 'cv553_26965.txt',\n", " 'cv554_14678.txt',\n", " 'cv555_25047.txt',\n", " 'cv556_16563.txt',\n", " 'cv557_12237.txt',\n", " 'cv558_29376.txt',\n", " 'cv559_0057.txt',\n", " 'cv560_18608.txt',\n", " 'cv561_9484.txt',\n", " 'cv562_10847.txt',\n", " 'cv563_18610.txt',\n", " 'cv564_12011.txt',\n", " 'cv565_29403.txt',\n", " 'cv566_8967.txt',\n", " 'cv567_29420.txt',\n", " 'cv568_17065.txt',\n", " 'cv569_26750.txt',\n", " 'cv570_28960.txt',\n", " 'cv571_29292.txt',\n", " 'cv572_20053.txt',\n", " 'cv573_29384.txt',\n", " 'cv574_23191.txt',\n", " 'cv575_22598.txt',\n", " 'cv576_15688.txt',\n", " 'cv577_28220.txt',\n", " 'cv578_16825.txt',\n", " 'cv579_12542.txt',\n", " 'cv580_15681.txt',\n", " 'cv581_20790.txt',\n", " 'cv582_6678.txt',\n", " 'cv583_29465.txt',\n", " 'cv584_29549.txt',\n", " 'cv585_23576.txt',\n", " 'cv586_8048.txt',\n", " 'cv587_20532.txt',\n", " 'cv588_14467.txt',\n", " 'cv589_12853.txt',\n", " 'cv590_20712.txt',\n", " 'cv591_24887.txt',\n", " 'cv592_23391.txt',\n", " 'cv593_11931.txt',\n", " 'cv594_11945.txt',\n", " 'cv595_26420.txt',\n", " 'cv596_4367.txt',\n", " 'cv597_26744.txt',\n", " 'cv598_18184.txt',\n", " 'cv599_22197.txt',\n", " 'cv600_25043.txt',\n", " 'cv601_24759.txt',\n", " 'cv602_8830.txt',\n", " 'cv603_18885.txt',\n", " 'cv604_23339.txt',\n", " 'cv605_12730.txt',\n", " 'cv606_17672.txt',\n", " 'cv607_8235.txt',\n", " 'cv608_24647.txt',\n", " 'cv609_25038.txt',\n", " 'cv610_24153.txt',\n", " 'cv611_2253.txt',\n", " 'cv612_5396.txt',\n", " 'cv613_23104.txt',\n", " 'cv614_11320.txt',\n", " 'cv615_15734.txt',\n", " 'cv616_29187.txt',\n", " 'cv617_9561.txt',\n", " 'cv618_9469.txt',\n", " 'cv619_13677.txt',\n", " 'cv620_2556.txt',\n", " 'cv621_15984.txt',\n", " 'cv622_8583.txt',\n", " 'cv623_16988.txt',\n", " 'cv624_11601.txt',\n", " 'cv625_13518.txt',\n", " 'cv626_7907.txt',\n", " 'cv627_12603.txt',\n", " 'cv628_20758.txt',\n", " 'cv629_16604.txt',\n", " 'cv630_10152.txt',\n", " 'cv631_4782.txt',\n", " 'cv632_9704.txt',\n", " 'cv633_29730.txt',\n", " 'cv634_11989.txt',\n", " 'cv635_0984.txt',\n", " 'cv636_16954.txt',\n", " 'cv637_13682.txt',\n", " 'cv638_29394.txt',\n", " 'cv639_10797.txt',\n", " 'cv640_5380.txt',\n", " 'cv641_13412.txt',\n", " 'cv642_29788.txt',\n", " 'cv643_29282.txt',\n", " 'cv644_18551.txt',\n", " 'cv645_17078.txt',\n", " 'cv646_16817.txt',\n", " 'cv647_15275.txt',\n", " 'cv648_17277.txt',\n", " 'cv649_13947.txt',\n", " 'cv650_15974.txt',\n", " 'cv651_11120.txt',\n", " 'cv652_15653.txt',\n", " 'cv653_2107.txt',\n", " 'cv654_19345.txt',\n", " 'cv655_12055.txt',\n", " 'cv656_25395.txt',\n", " 'cv657_25835.txt',\n", " 'cv658_11186.txt',\n", " 'cv659_21483.txt',\n", " 'cv660_23140.txt',\n", " 'cv661_25780.txt',\n", " 'cv662_14791.txt',\n", " 'cv663_14484.txt',\n", " 'cv664_4264.txt',\n", " 'cv665_29386.txt',\n", " 'cv666_20301.txt',\n", " 'cv667_19672.txt',\n", " 'cv668_18848.txt',\n", " 'cv669_24318.txt',\n", " 'cv670_2666.txt',\n", " 'cv671_5164.txt',\n", " 'cv672_27988.txt',\n", " 'cv673_25874.txt',\n", " 'cv674_11593.txt',\n", " 'cv675_22871.txt',\n", " 'cv676_22202.txt',\n", " 'cv677_18938.txt',\n", " 'cv678_14887.txt',\n", " 'cv679_28221.txt',\n", " 'cv680_10533.txt',\n", " 'cv681_9744.txt',\n", " 'cv682_17947.txt',\n", " 'cv683_13047.txt',\n", " 'cv684_12727.txt',\n", " 'cv685_5710.txt',\n", " 'cv686_15553.txt',\n", " 'cv687_22207.txt',\n", " 'cv688_7884.txt',\n", " 'cv689_13701.txt',\n", " 'cv690_5425.txt',\n", " 'cv691_5090.txt',\n", " 'cv692_17026.txt',\n", " 'cv693_19147.txt',\n", " 'cv694_4526.txt',\n", " 'cv695_22268.txt',\n", " 'cv696_29619.txt',\n", " 'cv697_12106.txt',\n", " 'cv698_16930.txt',\n", " 'cv699_7773.txt',\n", " 'cv700_23163.txt',\n", " 'cv701_15880.txt',\n", " 'cv702_12371.txt',\n", " 'cv703_17948.txt',\n", " 'cv704_17622.txt',\n", " 'cv705_11973.txt',\n", " 'cv706_25883.txt',\n", " 'cv707_11421.txt',\n", " 'cv708_28539.txt',\n", " 'cv709_11173.txt',\n", " 'cv710_23745.txt',\n", " 'cv711_12687.txt',\n", " 'cv712_24217.txt',\n", " 'cv713_29002.txt',\n", " 'cv714_19704.txt',\n", " 'cv715_19246.txt',\n", " 'cv716_11153.txt',\n", " 'cv717_17472.txt',\n", " 'cv718_12227.txt',\n", " 'cv719_5581.txt',\n", " 'cv720_5383.txt',\n", " 'cv721_28993.txt',\n", " 'cv722_7571.txt',\n", " 'cv723_9002.txt',\n", " 'cv724_15265.txt',\n", " 'cv725_10266.txt',\n", " 'cv726_4365.txt',\n", " 'cv727_5006.txt',\n", " 'cv728_17931.txt',\n", " 'cv729_10475.txt',\n", " 'cv730_10729.txt',\n", " 'cv731_3968.txt',\n", " 'cv732_13092.txt',\n", " 'cv733_9891.txt',\n", " 'cv734_22821.txt',\n", " 'cv735_20218.txt',\n", " 'cv736_24947.txt',\n", " 'cv737_28733.txt',\n", " 'cv738_10287.txt',\n", " 'cv739_12179.txt',\n", " 'cv740_13643.txt',\n", " 'cv741_12765.txt',\n", " 'cv742_8279.txt',\n", " 'cv743_17023.txt',\n", " 'cv744_10091.txt',\n", " 'cv745_14009.txt',\n", " 'cv746_10471.txt',\n", " 'cv747_18189.txt',\n", " 'cv748_14044.txt',\n", " 'cv749_18960.txt',\n", " 'cv750_10606.txt',\n", " 'cv751_17208.txt',\n", " 'cv752_25330.txt',\n", " 'cv753_11812.txt',\n", " 'cv754_7709.txt',\n", " 'cv755_24881.txt',\n", " 'cv756_23676.txt',\n", " 'cv757_10668.txt',\n", " 'cv758_9740.txt',\n", " 'cv759_15091.txt',\n", " 'cv760_8977.txt',\n", " 'cv761_13769.txt',\n", " 'cv762_15604.txt',\n", " 'cv763_16486.txt',\n", " 'cv764_12701.txt',\n", " 'cv765_20429.txt',\n", " 'cv766_7983.txt',\n", " 'cv767_15673.txt',\n", " 'cv768_12709.txt',\n", " 'cv769_8565.txt',\n", " 'cv770_11061.txt',\n", " 'cv771_28466.txt',\n", " 'cv772_12971.txt',\n", " 'cv773_20264.txt',\n", " 'cv774_15488.txt',\n", " 'cv775_17966.txt',\n", " 'cv776_21934.txt',\n", " 'cv777_10247.txt',\n", " 'cv778_18629.txt',\n", " 'cv779_18989.txt',\n", " 'cv780_8467.txt',\n", " 'cv781_5358.txt',\n", " 'cv782_21078.txt',\n", " 'cv783_14724.txt',\n", " 'cv784_16077.txt',\n", " 'cv785_23748.txt',\n", " 'cv786_23608.txt',\n", " 'cv787_15277.txt',\n", " 'cv788_26409.txt',\n", " 'cv789_12991.txt',\n", " 'cv790_16202.txt',\n", " 'cv791_17995.txt',\n", " 'cv792_3257.txt',\n", " 'cv793_15235.txt',\n", " 'cv794_17353.txt',\n", " 'cv795_10291.txt',\n", " 'cv796_17243.txt',\n", " 'cv797_7245.txt',\n", " 'cv798_24779.txt',\n", " 'cv799_19812.txt',\n", " 'cv800_13494.txt',\n", " 'cv801_26335.txt',\n", " 'cv802_28381.txt',\n", " 'cv803_8584.txt',\n", " 'cv804_11763.txt',\n", " 'cv805_21128.txt',\n", " 'cv806_9405.txt',\n", " 'cv807_23024.txt',\n", " 'cv808_13773.txt',\n", " 'cv809_5012.txt',\n", " 'cv810_13660.txt',\n", " 'cv811_22646.txt',\n", " 'cv812_19051.txt',\n", " 'cv813_6649.txt',\n", " 'cv814_20316.txt',\n", " 'cv815_23466.txt',\n", " 'cv816_15257.txt',\n", " 'cv817_3675.txt',\n", " 'cv818_10698.txt',\n", " 'cv819_9567.txt',\n", " 'cv820_24157.txt',\n", " 'cv821_29283.txt',\n", " 'cv822_21545.txt',\n", " 'cv823_17055.txt',\n", " 'cv824_9335.txt',\n", " 'cv825_5168.txt',\n", " 'cv826_12761.txt',\n", " 'cv827_19479.txt',\n", " 'cv828_21392.txt',\n", " 'cv829_21725.txt',\n", " 'cv830_5778.txt',\n", " 'cv831_16325.txt',\n", " 'cv832_24713.txt',\n", " 'cv833_11961.txt',\n", " 'cv834_23192.txt',\n", " 'cv835_20531.txt',\n", " 'cv836_14311.txt',\n", " 'cv837_27232.txt',\n", " 'cv838_25886.txt',\n", " 'cv839_22807.txt',\n", " 'cv840_18033.txt',\n", " 'cv841_3367.txt',\n", " 'cv842_5702.txt',\n", " 'cv843_17054.txt',\n", " 'cv844_13890.txt',\n", " 'cv845_15886.txt',\n", " 'cv846_29359.txt',\n", " 'cv847_20855.txt',\n", " 'cv848_10061.txt',\n", " 'cv849_17215.txt',\n", " 'cv850_18185.txt',\n", " 'cv851_21895.txt',\n", " 'cv852_27512.txt',\n", " 'cv853_29119.txt',\n", " 'cv854_18955.txt',\n", " 'cv855_22134.txt',\n", " 'cv856_28882.txt',\n", " 'cv857_17527.txt',\n", " 'cv858_20266.txt',\n", " 'cv859_15689.txt',\n", " 'cv860_15520.txt',\n", " 'cv861_12809.txt',\n", " 'cv862_15924.txt',\n", " 'cv863_7912.txt',\n", " 'cv864_3087.txt',\n", " 'cv865_28796.txt',\n", " 'cv866_29447.txt',\n", " 'cv867_18362.txt',\n", " 'cv868_12799.txt',\n", " 'cv869_24782.txt',\n", " 'cv870_18090.txt',\n", " 'cv871_25971.txt',\n", " 'cv872_13710.txt',\n", " 'cv873_19937.txt',\n", " 'cv874_12182.txt',\n", " 'cv875_5622.txt',\n", " 'cv876_9633.txt',\n", " 'cv877_29132.txt',\n", " 'cv878_17204.txt',\n", " 'cv879_16585.txt',\n", " 'cv880_29629.txt',\n", " 'cv881_14767.txt',\n", " 'cv882_10042.txt',\n", " 'cv883_27621.txt',\n", " 'cv884_15230.txt',\n", " 'cv885_13390.txt',\n", " 'cv886_19210.txt',\n", " 'cv887_5306.txt',\n", " 'cv888_25678.txt',\n", " 'cv889_22670.txt',\n", " 'cv890_3515.txt',\n", " 'cv891_6035.txt',\n", " 'cv892_18788.txt',\n", " 'cv893_26731.txt',\n", " 'cv894_22140.txt',\n", " 'cv895_22200.txt',\n", " 'cv896_17819.txt',\n", " 'cv897_11703.txt',\n", " 'cv898_1576.txt',\n", " 'cv899_17812.txt',\n", " 'cv900_10800.txt',\n", " 'cv901_11934.txt',\n", " 'cv902_13217.txt',\n", " 'cv903_18981.txt',\n", " 'cv904_25663.txt',\n", " 'cv905_28965.txt',\n", " 'cv906_12332.txt',\n", " 'cv907_3193.txt',\n", " 'cv908_17779.txt',\n", " 'cv909_9973.txt',\n", " 'cv910_21930.txt',\n", " 'cv911_21695.txt',\n", " 'cv912_5562.txt',\n", " 'cv913_29127.txt',\n", " 'cv914_2856.txt',\n", " 'cv915_9342.txt',\n", " 'cv916_17034.txt',\n", " 'cv917_29484.txt',\n", " 'cv918_27080.txt',\n", " 'cv919_18155.txt',\n", " 'cv920_29423.txt',\n", " 'cv921_13988.txt',\n", " 'cv922_10185.txt',\n", " 'cv923_11951.txt',\n", " 'cv924_29397.txt',\n", " 'cv925_9459.txt',\n", " 'cv926_18471.txt',\n", " 'cv927_11471.txt',\n", " 'cv928_9478.txt',\n", " 'cv929_1841.txt',\n", " 'cv930_14949.txt',\n", " 'cv931_18783.txt',\n", " 'cv932_14854.txt',\n", " 'cv933_24953.txt',\n", " 'cv934_20426.txt',\n", " 'cv935_24977.txt',\n", " 'cv936_17473.txt',\n", " 'cv937_9816.txt',\n", " 'cv938_10706.txt',\n", " 'cv939_11247.txt',\n", " 'cv940_18935.txt',\n", " 'cv941_10718.txt',\n", " 'cv942_18509.txt',\n", " 'cv943_23547.txt',\n", " 'cv944_15042.txt',\n", " 'cv945_13012.txt',\n", " 'cv946_20084.txt',\n", " 'cv947_11316.txt',\n", " 'cv948_25870.txt',\n", " 'cv949_21565.txt',\n", " 'cv950_13478.txt',\n", " 'cv951_11816.txt',\n", " 'cv952_26375.txt',\n", " 'cv953_7078.txt',\n", " 'cv954_19932.txt',\n", " 'cv955_26154.txt',\n", " 'cv956_12547.txt',\n", " 'cv957_9059.txt',\n", " 'cv958_13020.txt',\n", " 'cv959_16218.txt',\n", " 'cv960_28877.txt',\n", " 'cv961_5578.txt',\n", " 'cv962_9813.txt',\n", " 'cv963_7208.txt',\n", " 'cv964_5794.txt',\n", " 'cv965_26688.txt',\n", " 'cv966_28671.txt',\n", " 'cv967_5626.txt',\n", " 'cv968_25413.txt',\n", " 'cv969_14760.txt',\n", " 'cv970_19532.txt',\n", " 'cv971_11790.txt',\n", " 'cv972_26837.txt',\n", " 'cv973_10171.txt',\n", " 'cv974_24303.txt',\n", " 'cv975_11920.txt',\n", " 'cv976_10724.txt',\n", " 'cv977_4776.txt',\n", " 'cv978_22192.txt',\n", " 'cv979_2029.txt',\n", " 'cv980_11851.txt',\n", " 'cv981_16679.txt',\n", " 'cv982_22209.txt',\n", " 'cv983_24219.txt',\n", " 'cv984_14006.txt',\n", " 'cv985_5964.txt',\n", " 'cv986_15092.txt',\n", " 'cv987_7394.txt',\n", " 'cv988_20168.txt',\n", " 'cv989_17297.txt',\n", " 'cv990_12443.txt',\n", " 'cv991_19973.txt',\n", " 'cv992_12806.txt',\n", " 'cv993_29565.txt',\n", " 'cv994_13229.txt',\n", " 'cv995_23113.txt',\n", " 'cv996_12447.txt',\n", " 'cv997_5152.txt',\n", " 'cv998_15691.txt',\n", " 'cv999_14636.txt'])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(gen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The subfolder `../moviereviews/neg` contains 1000 text files. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# this walks the /pos/ subfolder\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(gen) # this walks the /pos/ subfolder\n", "next(gen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`os.walk()` stopped once it had walked all subfolders." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use os.walk() to build a DataFrame\n", "\n", "The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.
We'll take the following steps to build our list:\n", "1. Start with a list of subdirectory names ('neg' and 'pos')\n", "2. Walk each subdirectory\n", "3. Create a dictionary object for every file in a subdirectory where `label` is either 'neg' or 'pos', and `review` is the text of the file.\n", "4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "row_list = []\n", "\n", "for subdir in ['neg','pos']:\n", " for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):\n", " for file in filenames:\n", " d = {'label':subdir} # assign the name of the subdirectory to the label field\n", " with open('moviereviews/'+subdir+'/'+file) as f:\n", " if f.read(): # handles the case of empty files, which become NaN on import\n", " f.seek(0)\n", " d['review'] = f.read() # assign the contents of the file to the review field\n", " row_list.append(d)\n", " break" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame(row_list)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelreview
0negNaN
1negthe happy bastard's quick movie review \\ndamn ...
2negit is movies like these that make a jaded movi...
3neg\" quest for camelot \" is warner bros . ' firs...
4negsynopsis : a mentally unstable man undergoing ...
\n", "
" ], "text/plain": [ " label review\n", "0 neg NaN\n", "1 neg the happy bastard's quick movie review \\ndamn ...\n", "2 neg it is movies like these that make a jaded movi...\n", "3 neg \" quest for camelot \" is warner bros . ' firs...\n", "4 neg synopsis : a mentally unstable man undergoing ..." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }