1297 lines
42 KiB
Plaintext
1297 lines
42 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>\n",
|
|
"___"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Building a corpus from individual files\n",
|
|
"Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a `\\pos\\` directory while negative reviews live under `\\neg\\`. Refer to [moviereviesREADME.txt](../moviereviews/moviereviewsREADME.txt) for more information about the files.\n",
|
|
"\n",
|
|
"We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:\n",
|
|
"* using Python's **os module** to build a pandas DataFrame\n",
|
|
"* using an **nltk** tool called `CategorizedPlaintextCorpusReader` "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Using Python's `os` module to build a DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Perform imports:\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import os"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Let's look at what os.walk() does:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"('../moviereviews - Copy', ['neg', 'pos'], ['poldata.README.2.0'])"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"gen = os.walk('../moviereviews')\n",
|
|
"next(gen)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`os.walk()` is a generator that returns a tuple with three items:\n",
|
|
"1. the name of the current folder\n",
|
|
"2. a list of names of any subfolders\n",
|
|
"3. a list of names of any files in the current folder"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"('../moviereviews - Copy\\\\neg',\n",
|
|
" [],\n",
|
|
" ['cv000_29416.txt',\n",
|
|
" 'cv001_19502.txt',\n",
|
|
" 'cv002_17424.txt',\n",
|
|
" 'cv003_12683.txt',\n",
|
|
" 'cv004_12641.txt',\n",
|
|
" 'cv005_29357.txt',\n",
|
|
" 'cv006_17022.txt',\n",
|
|
" 'cv007_4992.txt',\n",
|
|
" 'cv008_29326.txt',\n",
|
|
" 'cv009_29417.txt',\n",
|
|
" 'cv010_29063.txt',\n",
|
|
" 'cv011_13044.txt',\n",
|
|
" 'cv012_29411.txt',\n",
|
|
" 'cv013_10494.txt',\n",
|
|
" 'cv014_15600.txt',\n",
|
|
" 'cv015_29356.txt',\n",
|
|
" 'cv016_4348.txt',\n",
|
|
" 'cv017_23487.txt',\n",
|
|
" 'cv018_21672.txt',\n",
|
|
" 'cv019_16117.txt',\n",
|
|
" 'cv020_9234.txt',\n",
|
|
" 'cv021_17313.txt',\n",
|
|
" 'cv022_14227.txt',\n",
|
|
" 'cv023_13847.txt',\n",
|
|
" 'cv024_7033.txt',\n",
|
|
" 'cv025_29825.txt',\n",
|
|
" 'cv026_29229.txt',\n",
|
|
" 'cv027_26270.txt',\n",
|
|
" 'cv028_26964.txt',\n",
|
|
" 'cv029_19943.txt',\n",
|
|
" 'cv030_22893.txt',\n",
|
|
" 'cv031_19540.txt',\n",
|
|
" 'cv032_23718.txt',\n",
|
|
" 'cv033_25680.txt',\n",
|
|
" 'cv034_29446.txt',\n",
|
|
" 'cv035_3343.txt',\n",
|
|
" 'cv036_18385.txt',\n",
|
|
" 'cv037_19798.txt',\n",
|
|
" 'cv038_9781.txt',\n",
|
|
" 'cv039_5963.txt',\n",
|
|
" 'cv040_8829.txt',\n",
|
|
" 'cv041_22364.txt',\n",
|
|
" 'cv042_11927.txt',\n",
|
|
" 'cv043_16808.txt',\n",
|
|
" 'cv044_18429.txt',\n",
|
|
" 'cv045_25077.txt',\n",
|
|
" 'cv046_10613.txt',\n",
|
|
" 'cv047_18725.txt',\n",
|
|
" 'cv048_18380.txt',\n",
|
|
" 'cv049_21917.txt',\n",
|
|
" 'cv050_12128.txt',\n",
|
|
" 'cv051_10751.txt',\n",
|
|
" 'cv052_29318.txt',\n",
|
|
" 'cv053_23117.txt',\n",
|
|
" 'cv054_4101.txt',\n",
|
|
" 'cv055_8926.txt',\n",
|
|
" 'cv056_14663.txt',\n",
|
|
" 'cv057_7962.txt',\n",
|
|
" 'cv058_8469.txt',\n",
|
|
" 'cv059_28723.txt',\n",
|
|
" 'cv060_11754.txt',\n",
|
|
" 'cv061_9321.txt',\n",
|
|
" 'cv062_24556.txt',\n",
|
|
" 'cv063_28852.txt',\n",
|
|
" 'cv064_25842.txt',\n",
|
|
" 'cv065_16909.txt',\n",
|
|
" 'cv066_11668.txt',\n",
|
|
" 'cv067_21192.txt',\n",
|
|
" 'cv068_14810.txt',\n",
|
|
" 'cv069_11613.txt',\n",
|
|
" 'cv070_13249.txt',\n",
|
|
" 'cv071_12969.txt',\n",
|
|
" 'cv072_5928.txt',\n",
|
|
" 'cv073_23039.txt',\n",
|
|
" 'cv074_7188.txt',\n",
|
|
" 'cv075_6250.txt',\n",
|
|
" 'cv076_26009.txt',\n",
|
|
" 'cv077_23172.txt',\n",
|
|
" 'cv078_16506.txt',\n",
|
|
" 'cv079_12766.txt',\n",
|
|
" 'cv080_14899.txt',\n",
|
|
" 'cv081_18241.txt',\n",
|
|
" 'cv082_11979.txt',\n",
|
|
" 'cv083_25491.txt',\n",
|
|
" 'cv084_15183.txt',\n",
|
|
" 'cv085_15286.txt',\n",
|
|
" 'cv086_19488.txt',\n",
|
|
" 'cv087_2145.txt',\n",
|
|
" 'cv088_25274.txt',\n",
|
|
" 'cv089_12222.txt',\n",
|
|
" 'cv090_0049.txt',\n",
|
|
" 'cv091_7899.txt',\n",
|
|
" 'cv092_27987.txt',\n",
|
|
" 'cv093_15606.txt',\n",
|
|
" 'cv094_27868.txt',\n",
|
|
" 'cv095_28730.txt',\n",
|
|
" 'cv096_12262.txt',\n",
|
|
" 'cv097_26081.txt',\n",
|
|
" 'cv098_17021.txt',\n",
|
|
" 'cv099_11189.txt',\n",
|
|
" 'cv100_12406.txt',\n",
|
|
" 'cv101_10537.txt',\n",
|
|
" 'cv102_8306.txt',\n",
|
|
" 'cv103_11943.txt',\n",
|
|
" 'cv104_19176.txt',\n",
|
|
" 'cv105_19135.txt',\n",
|
|
" 'cv106_18379.txt',\n",
|
|
" 'cv107_25639.txt',\n",
|
|
" 'cv108_17064.txt',\n",
|
|
" 'cv109_22599.txt',\n",
|
|
" 'cv110_27832.txt',\n",
|
|
" 'cv111_12253.txt',\n",
|
|
" 'cv112_12178.txt',\n",
|
|
" 'cv113_24354.txt',\n",
|
|
" 'cv114_19501.txt',\n",
|
|
" 'cv115_26443.txt',\n",
|
|
" 'cv116_28734.txt',\n",
|
|
" 'cv117_25625.txt',\n",
|
|
" 'cv118_28837.txt',\n",
|
|
" 'cv119_9909.txt',\n",
|
|
" 'cv120_3793.txt',\n",
|
|
" 'cv121_18621.txt',\n",
|
|
" 'cv122_7891.txt',\n",
|
|
" 'cv123_12165.txt',\n",
|
|
" 'cv124_3903.txt',\n",
|
|
" 'cv125_9636.txt',\n",
|
|
" 'cv126_28821.txt',\n",
|
|
" 'cv127_16451.txt',\n",
|
|
" 'cv128_29444.txt',\n",
|
|
" 'cv129_18373.txt',\n",
|
|
" 'cv130_18521.txt',\n",
|
|
" 'cv131_11568.txt',\n",
|
|
" 'cv132_5423.txt',\n",
|
|
" 'cv133_18065.txt',\n",
|
|
" 'cv134_23300.txt',\n",
|
|
" 'cv135_12506.txt',\n",
|
|
" 'cv136_12384.txt',\n",
|
|
" 'cv137_17020.txt',\n",
|
|
" 'cv138_13903.txt',\n",
|
|
" 'cv139_14236.txt',\n",
|
|
" 'cv140_7963.txt',\n",
|
|
" 'cv141_17179.txt',\n",
|
|
" 'cv142_23657.txt',\n",
|
|
" 'cv143_21158.txt',\n",
|
|
" 'cv144_5010.txt',\n",
|
|
" 'cv145_12239.txt',\n",
|
|
" 'cv146_19587.txt',\n",
|
|
" 'cv147_22625.txt',\n",
|
|
" 'cv148_18084.txt',\n",
|
|
" 'cv149_17084.txt',\n",
|
|
" 'cv150_14279.txt',\n",
|
|
" 'cv151_17231.txt',\n",
|
|
" 'cv152_9052.txt',\n",
|
|
" 'cv153_11607.txt',\n",
|
|
" 'cv154_9562.txt',\n",
|
|
" 'cv155_7845.txt',\n",
|
|
" 'cv156_11119.txt',\n",
|
|
" 'cv157_29302.txt',\n",
|
|
" 'cv158_10914.txt',\n",
|
|
" 'cv159_29374.txt',\n",
|
|
" 'cv160_10848.txt',\n",
|
|
" 'cv161_12224.txt',\n",
|
|
" 'cv162_10977.txt',\n",
|
|
" 'cv163_10110.txt',\n",
|
|
" 'cv164_23451.txt',\n",
|
|
" 'cv165_2389.txt',\n",
|
|
" 'cv166_11959.txt',\n",
|
|
" 'cv167_18094.txt',\n",
|
|
" 'cv168_7435.txt',\n",
|
|
" 'cv169_24973.txt',\n",
|
|
" 'cv170_29808.txt',\n",
|
|
" 'cv171_15164.txt',\n",
|
|
" 'cv172_12037.txt',\n",
|
|
" 'cv173_4295.txt',\n",
|
|
" 'cv174_9735.txt',\n",
|
|
" 'cv175_7375.txt',\n",
|
|
" 'cv176_14196.txt',\n",
|
|
" 'cv177_10904.txt',\n",
|
|
" 'cv178_14380.txt',\n",
|
|
" 'cv179_9533.txt',\n",
|
|
" 'cv180_17823.txt',\n",
|
|
" 'cv181_16083.txt',\n",
|
|
" 'cv182_7791.txt',\n",
|
|
" 'cv183_19826.txt',\n",
|
|
" 'cv184_26935.txt',\n",
|
|
" 'cv185_28372.txt',\n",
|
|
" 'cv186_2396.txt',\n",
|
|
" 'cv187_14112.txt',\n",
|
|
" 'cv188_20687.txt',\n",
|
|
" 'cv189_24248.txt',\n",
|
|
" 'cv190_27176.txt',\n",
|
|
" 'cv191_29539.txt',\n",
|
|
" 'cv192_16079.txt',\n",
|
|
" 'cv193_5393.txt',\n",
|
|
" 'cv194_12855.txt',\n",
|
|
" 'cv195_16146.txt',\n",
|
|
" 'cv196_28898.txt',\n",
|
|
" 'cv197_29271.txt',\n",
|
|
" 'cv198_19313.txt',\n",
|
|
" 'cv199_9721.txt',\n",
|
|
" 'cv200_29006.txt',\n",
|
|
" 'cv201_7421.txt',\n",
|
|
" 'cv202_11382.txt',\n",
|
|
" 'cv203_19052.txt',\n",
|
|
" 'cv204_8930.txt',\n",
|
|
" 'cv205_9676.txt',\n",
|
|
" 'cv206_15893.txt',\n",
|
|
" 'cv207_29141.txt',\n",
|
|
" 'cv208_9475.txt',\n",
|
|
" 'cv209_28973.txt',\n",
|
|
" 'cv210_9557.txt',\n",
|
|
" 'cv211_9955.txt',\n",
|
|
" 'cv212_10054.txt',\n",
|
|
" 'cv213_20300.txt',\n",
|
|
" 'cv214_13285.txt',\n",
|
|
" 'cv215_23246.txt',\n",
|
|
" 'cv216_20165.txt',\n",
|
|
" 'cv217_28707.txt',\n",
|
|
" 'cv218_25651.txt',\n",
|
|
" 'cv219_19874.txt',\n",
|
|
" 'cv220_28906.txt',\n",
|
|
" 'cv221_27081.txt',\n",
|
|
" 'cv222_18720.txt',\n",
|
|
" 'cv223_28923.txt',\n",
|
|
" 'cv224_18875.txt',\n",
|
|
" 'cv225_29083.txt',\n",
|
|
" 'cv226_26692.txt',\n",
|
|
" 'cv227_25406.txt',\n",
|
|
" 'cv228_5644.txt',\n",
|
|
" 'cv229_15200.txt',\n",
|
|
" 'cv230_7913.txt',\n",
|
|
" 'cv231_11028.txt',\n",
|
|
" 'cv232_16768.txt',\n",
|
|
" 'cv233_17614.txt',\n",
|
|
" 'cv234_22123.txt',\n",
|
|
" 'cv235_10704.txt',\n",
|
|
" 'cv236_12427.txt',\n",
|
|
" 'cv237_20635.txt',\n",
|
|
" 'cv238_14285.txt',\n",
|
|
" 'cv239_29828.txt',\n",
|
|
" 'cv240_15948.txt',\n",
|
|
" 'cv241_24602.txt',\n",
|
|
" 'cv242_11354.txt',\n",
|
|
" 'cv243_22164.txt',\n",
|
|
" 'cv244_22935.txt',\n",
|
|
" 'cv245_8938.txt',\n",
|
|
" 'cv246_28668.txt',\n",
|
|
" 'cv247_14668.txt',\n",
|
|
" 'cv248_15672.txt',\n",
|
|
" 'cv249_12674.txt',\n",
|
|
" 'cv250_26462.txt',\n",
|
|
" 'cv251_23901.txt',\n",
|
|
" 'cv252_24974.txt',\n",
|
|
" 'cv253_10190.txt',\n",
|
|
" 'cv254_5870.txt',\n",
|
|
" 'cv255_15267.txt',\n",
|
|
" 'cv256_16529.txt',\n",
|
|
" 'cv257_11856.txt',\n",
|
|
" 'cv258_5627.txt',\n",
|
|
" 'cv259_11827.txt',\n",
|
|
" 'cv260_15652.txt',\n",
|
|
" 'cv261_11855.txt',\n",
|
|
" 'cv262_13812.txt',\n",
|
|
" 'cv263_20693.txt',\n",
|
|
" 'cv264_14108.txt',\n",
|
|
" 'cv265_11625.txt',\n",
|
|
" 'cv266_26644.txt',\n",
|
|
" 'cv267_16618.txt',\n",
|
|
" 'cv268_20288.txt',\n",
|
|
" 'cv269_23018.txt',\n",
|
|
" 'cv270_5873.txt',\n",
|
|
" 'cv271_15364.txt',\n",
|
|
" 'cv272_20313.txt',\n",
|
|
" 'cv273_28961.txt',\n",
|
|
" 'cv274_26379.txt',\n",
|
|
" 'cv275_28725.txt',\n",
|
|
" 'cv276_17126.txt',\n",
|
|
" 'cv277_20467.txt',\n",
|
|
" 'cv278_14533.txt',\n",
|
|
" 'cv279_19452.txt',\n",
|
|
" 'cv280_8651.txt',\n",
|
|
" 'cv281_24711.txt',\n",
|
|
" 'cv282_6833.txt',\n",
|
|
" 'cv283_11963.txt',\n",
|
|
" 'cv284_20530.txt',\n",
|
|
" 'cv285_18186.txt',\n",
|
|
" 'cv286_26156.txt',\n",
|
|
" 'cv287_17410.txt',\n",
|
|
" 'cv288_20212.txt',\n",
|
|
" 'cv289_6239.txt',\n",
|
|
" 'cv290_11981.txt',\n",
|
|
" 'cv291_26844.txt',\n",
|
|
" 'cv292_7804.txt',\n",
|
|
" 'cv293_29731.txt',\n",
|
|
" 'cv294_12695.txt',\n",
|
|
" 'cv295_17060.txt',\n",
|
|
" 'cv296_13146.txt',\n",
|
|
" 'cv297_10104.txt',\n",
|
|
" 'cv298_24487.txt',\n",
|
|
" 'cv299_17950.txt',\n",
|
|
" 'cv300_23302.txt',\n",
|
|
" 'cv301_13010.txt',\n",
|
|
" 'cv302_26481.txt',\n",
|
|
" 'cv303_27366.txt',\n",
|
|
" 'cv304_28489.txt',\n",
|
|
" 'cv305_9937.txt',\n",
|
|
" 'cv306_10859.txt',\n",
|
|
" 'cv307_26382.txt',\n",
|
|
" 'cv308_5079.txt',\n",
|
|
" 'cv309_23737.txt',\n",
|
|
" 'cv310_14568.txt',\n",
|
|
" 'cv311_17708.txt',\n",
|
|
" 'cv312_29308.txt',\n",
|
|
" 'cv313_19337.txt',\n",
|
|
" 'cv314_16095.txt',\n",
|
|
" 'cv315_12638.txt',\n",
|
|
" 'cv316_5972.txt',\n",
|
|
" 'cv317_25111.txt',\n",
|
|
" 'cv318_11146.txt',\n",
|
|
" 'cv319_16459.txt',\n",
|
|
" 'cv320_9693.txt',\n",
|
|
" 'cv321_14191.txt',\n",
|
|
" 'cv322_21820.txt',\n",
|
|
" 'cv323_29633.txt',\n",
|
|
" 'cv324_7502.txt',\n",
|
|
" 'cv325_18330.txt',\n",
|
|
" 'cv326_14777.txt',\n",
|
|
" 'cv327_21743.txt',\n",
|
|
" 'cv328_10908.txt',\n",
|
|
" 'cv329_29293.txt',\n",
|
|
" 'cv330_29675.txt',\n",
|
|
" 'cv331_8656.txt',\n",
|
|
" 'cv332_17997.txt',\n",
|
|
" 'cv333_9443.txt',\n",
|
|
" 'cv334_0074.txt',\n",
|
|
" 'cv335_16299.txt',\n",
|
|
" 'cv336_10363.txt',\n",
|
|
" 'cv337_29061.txt',\n",
|
|
" 'cv338_9183.txt',\n",
|
|
" 'cv339_22452.txt',\n",
|
|
" 'cv340_14776.txt',\n",
|
|
" 'cv341_25667.txt',\n",
|
|
" 'cv342_20917.txt',\n",
|
|
" 'cv343_10906.txt',\n",
|
|
" 'cv344_5376.txt',\n",
|
|
" 'cv345_9966.txt',\n",
|
|
" 'cv346_19198.txt',\n",
|
|
" 'cv347_14722.txt',\n",
|
|
" 'cv348_19207.txt',\n",
|
|
" 'cv349_15032.txt',\n",
|
|
" 'cv350_22139.txt',\n",
|
|
" 'cv351_17029.txt',\n",
|
|
" 'cv352_5414.txt',\n",
|
|
" 'cv353_19197.txt',\n",
|
|
" 'cv354_8573.txt',\n",
|
|
" 'cv355_18174.txt',\n",
|
|
" 'cv356_26170.txt',\n",
|
|
" 'cv357_14710.txt',\n",
|
|
" 'cv358_11557.txt',\n",
|
|
" 'cv359_6751.txt',\n",
|
|
" 'cv360_8927.txt',\n",
|
|
" 'cv361_28738.txt',\n",
|
|
" 'cv362_16985.txt',\n",
|
|
" 'cv363_29273.txt',\n",
|
|
" 'cv364_14254.txt',\n",
|
|
" 'cv365_12442.txt',\n",
|
|
" 'cv366_10709.txt',\n",
|
|
" 'cv367_24065.txt',\n",
|
|
" 'cv368_11090.txt',\n",
|
|
" 'cv369_14245.txt',\n",
|
|
" 'cv370_5338.txt',\n",
|
|
" 'cv371_8197.txt',\n",
|
|
" 'cv372_6654.txt',\n",
|
|
" 'cv373_21872.txt',\n",
|
|
" 'cv374_26455.txt',\n",
|
|
" 'cv375_9932.txt',\n",
|
|
" 'cv376_20883.txt',\n",
|
|
" 'cv377_8440.txt',\n",
|
|
" 'cv378_21982.txt',\n",
|
|
" 'cv379_23167.txt',\n",
|
|
" 'cv380_8164.txt',\n",
|
|
" 'cv381_21673.txt',\n",
|
|
" 'cv382_8393.txt',\n",
|
|
" 'cv383_14662.txt',\n",
|
|
" 'cv384_18536.txt',\n",
|
|
" 'cv385_29621.txt',\n",
|
|
" 'cv386_10229.txt',\n",
|
|
" 'cv387_12391.txt',\n",
|
|
" 'cv388_12810.txt',\n",
|
|
" 'cv389_9611.txt',\n",
|
|
" 'cv390_12187.txt',\n",
|
|
" 'cv391_11615.txt',\n",
|
|
" 'cv392_12238.txt',\n",
|
|
" 'cv393_29234.txt',\n",
|
|
" 'cv394_5311.txt',\n",
|
|
" 'cv395_11761.txt',\n",
|
|
" 'cv396_19127.txt',\n",
|
|
" 'cv397_28890.txt',\n",
|
|
" 'cv398_17047.txt',\n",
|
|
" 'cv399_28593.txt',\n",
|
|
" 'cv400_20631.txt',\n",
|
|
" 'cv401_13758.txt',\n",
|
|
" 'cv402_16097.txt',\n",
|
|
" 'cv403_6721.txt',\n",
|
|
" 'cv404_21805.txt',\n",
|
|
" 'cv405_21868.txt',\n",
|
|
" 'cv406_22199.txt',\n",
|
|
" 'cv407_23928.txt',\n",
|
|
" 'cv408_5367.txt',\n",
|
|
" 'cv409_29625.txt',\n",
|
|
" 'cv410_25624.txt',\n",
|
|
" 'cv411_16799.txt',\n",
|
|
" 'cv412_25254.txt',\n",
|
|
" 'cv413_7893.txt',\n",
|
|
" 'cv414_11161.txt',\n",
|
|
" 'cv415_23674.txt',\n",
|
|
" 'cv416_12048.txt',\n",
|
|
" 'cv417_14653.txt',\n",
|
|
" 'cv418_16562.txt',\n",
|
|
" 'cv419_14799.txt',\n",
|
|
" 'cv420_28631.txt',\n",
|
|
" 'cv421_9752.txt',\n",
|
|
" 'cv422_9632.txt',\n",
|
|
" 'cv423_12089.txt',\n",
|
|
" 'cv424_9268.txt',\n",
|
|
" 'cv425_8603.txt',\n",
|
|
" 'cv426_10976.txt',\n",
|
|
" 'cv427_11693.txt',\n",
|
|
" 'cv428_12202.txt',\n",
|
|
" 'cv429_7937.txt',\n",
|
|
" 'cv430_18662.txt',\n",
|
|
" 'cv431_7538.txt',\n",
|
|
" 'cv432_15873.txt',\n",
|
|
" 'cv433_10443.txt',\n",
|
|
" 'cv434_5641.txt',\n",
|
|
" 'cv435_24355.txt',\n",
|
|
" 'cv436_20564.txt',\n",
|
|
" 'cv437_24070.txt',\n",
|
|
" 'cv438_8500.txt',\n",
|
|
" 'cv439_17633.txt',\n",
|
|
" 'cv440_16891.txt',\n",
|
|
" 'cv441_15276.txt',\n",
|
|
" 'cv442_15499.txt',\n",
|
|
" 'cv443_22367.txt',\n",
|
|
" 'cv444_9975.txt',\n",
|
|
" 'cv445_26683.txt',\n",
|
|
" 'cv446_12209.txt',\n",
|
|
" 'cv447_27334.txt',\n",
|
|
" 'cv448_16409.txt',\n",
|
|
" 'cv449_9126.txt',\n",
|
|
" 'cv450_8319.txt',\n",
|
|
" 'cv451_11502.txt',\n",
|
|
" 'cv452_5179.txt',\n",
|
|
" 'cv453_10911.txt',\n",
|
|
" 'cv454_21961.txt',\n",
|
|
" 'cv455_28866.txt',\n",
|
|
" 'cv456_20370.txt',\n",
|
|
" 'cv457_19546.txt',\n",
|
|
" 'cv458_9000.txt',\n",
|
|
" 'cv459_21834.txt',\n",
|
|
" 'cv460_11723.txt',\n",
|
|
" 'cv461_21124.txt',\n",
|
|
" 'cv462_20788.txt',\n",
|
|
" 'cv463_10846.txt',\n",
|
|
" 'cv464_17076.txt',\n",
|
|
" 'cv465_23401.txt',\n",
|
|
" 'cv466_20092.txt',\n",
|
|
" 'cv467_26610.txt',\n",
|
|
" 'cv468_16844.txt',\n",
|
|
" 'cv469_21998.txt',\n",
|
|
" 'cv470_17444.txt',\n",
|
|
" 'cv471_18405.txt',\n",
|
|
" 'cv472_29140.txt',\n",
|
|
" 'cv473_7869.txt',\n",
|
|
" 'cv474_10682.txt',\n",
|
|
" 'cv475_22978.txt',\n",
|
|
" 'cv476_18402.txt',\n",
|
|
" 'cv477_23530.txt',\n",
|
|
" 'cv478_15921.txt',\n",
|
|
" 'cv479_5450.txt',\n",
|
|
" 'cv480_21195.txt',\n",
|
|
" 'cv481_7930.txt',\n",
|
|
" 'cv482_11233.txt',\n",
|
|
" 'cv483_18103.txt',\n",
|
|
" 'cv484_26169.txt',\n",
|
|
" 'cv485_26879.txt',\n",
|
|
" 'cv486_9788.txt',\n",
|
|
" 'cv487_11058.txt',\n",
|
|
" 'cv488_21453.txt',\n",
|
|
" 'cv489_19046.txt',\n",
|
|
" 'cv490_18986.txt',\n",
|
|
" 'cv491_12992.txt',\n",
|
|
" 'cv492_19370.txt',\n",
|
|
" 'cv493_14135.txt',\n",
|
|
" 'cv494_18689.txt',\n",
|
|
" 'cv495_16121.txt',\n",
|
|
" 'cv496_11185.txt',\n",
|
|
" 'cv497_27086.txt',\n",
|
|
" 'cv498_9288.txt',\n",
|
|
" 'cv499_11407.txt',\n",
|
|
" 'cv500_10722.txt',\n",
|
|
" 'cv501_12675.txt',\n",
|
|
" 'cv502_10970.txt',\n",
|
|
" 'cv503_11196.txt',\n",
|
|
" 'cv504_29120.txt',\n",
|
|
" 'cv505_12926.txt',\n",
|
|
" 'cv506_17521.txt',\n",
|
|
" 'cv507_9509.txt',\n",
|
|
" 'cv508_17742.txt',\n",
|
|
" 'cv509_17354.txt',\n",
|
|
" 'cv510_24758.txt',\n",
|
|
" 'cv511_10360.txt',\n",
|
|
" 'cv512_17618.txt',\n",
|
|
" 'cv513_7236.txt',\n",
|
|
" 'cv514_12173.txt',\n",
|
|
" 'cv515_18484.txt',\n",
|
|
" 'cv516_12117.txt',\n",
|
|
" 'cv517_20616.txt',\n",
|
|
" 'cv518_14798.txt',\n",
|
|
" 'cv519_16239.txt',\n",
|
|
" 'cv520_13297.txt',\n",
|
|
" 'cv521_1730.txt',\n",
|
|
" 'cv522_5418.txt',\n",
|
|
" 'cv523_18285.txt',\n",
|
|
" 'cv524_24885.txt',\n",
|
|
" 'cv525_17930.txt',\n",
|
|
" 'cv526_12868.txt',\n",
|
|
" 'cv527_10338.txt',\n",
|
|
" 'cv528_11669.txt',\n",
|
|
" 'cv529_10972.txt',\n",
|
|
" 'cv530_17949.txt',\n",
|
|
" 'cv531_26838.txt',\n",
|
|
" 'cv532_6495.txt',\n",
|
|
" 'cv533_9843.txt',\n",
|
|
" 'cv534_15683.txt',\n",
|
|
" 'cv535_21183.txt',\n",
|
|
" 'cv536_27221.txt',\n",
|
|
" 'cv537_13516.txt',\n",
|
|
" 'cv538_28485.txt',\n",
|
|
" 'cv539_21865.txt',\n",
|
|
" 'cv540_3092.txt',\n",
|
|
" 'cv541_28683.txt',\n",
|
|
" 'cv542_20359.txt',\n",
|
|
" 'cv543_5107.txt',\n",
|
|
" 'cv544_5301.txt',\n",
|
|
" 'cv545_12848.txt',\n",
|
|
" 'cv546_12723.txt',\n",
|
|
" 'cv547_18043.txt',\n",
|
|
" 'cv548_18944.txt',\n",
|
|
" 'cv549_22771.txt',\n",
|
|
" 'cv550_23226.txt',\n",
|
|
" 'cv551_11214.txt',\n",
|
|
" 'cv552_0150.txt',\n",
|
|
" 'cv553_26965.txt',\n",
|
|
" 'cv554_14678.txt',\n",
|
|
" 'cv555_25047.txt',\n",
|
|
" 'cv556_16563.txt',\n",
|
|
" 'cv557_12237.txt',\n",
|
|
" 'cv558_29376.txt',\n",
|
|
" 'cv559_0057.txt',\n",
|
|
" 'cv560_18608.txt',\n",
|
|
" 'cv561_9484.txt',\n",
|
|
" 'cv562_10847.txt',\n",
|
|
" 'cv563_18610.txt',\n",
|
|
" 'cv564_12011.txt',\n",
|
|
" 'cv565_29403.txt',\n",
|
|
" 'cv566_8967.txt',\n",
|
|
" 'cv567_29420.txt',\n",
|
|
" 'cv568_17065.txt',\n",
|
|
" 'cv569_26750.txt',\n",
|
|
" 'cv570_28960.txt',\n",
|
|
" 'cv571_29292.txt',\n",
|
|
" 'cv572_20053.txt',\n",
|
|
" 'cv573_29384.txt',\n",
|
|
" 'cv574_23191.txt',\n",
|
|
" 'cv575_22598.txt',\n",
|
|
" 'cv576_15688.txt',\n",
|
|
" 'cv577_28220.txt',\n",
|
|
" 'cv578_16825.txt',\n",
|
|
" 'cv579_12542.txt',\n",
|
|
" 'cv580_15681.txt',\n",
|
|
" 'cv581_20790.txt',\n",
|
|
" 'cv582_6678.txt',\n",
|
|
" 'cv583_29465.txt',\n",
|
|
" 'cv584_29549.txt',\n",
|
|
" 'cv585_23576.txt',\n",
|
|
" 'cv586_8048.txt',\n",
|
|
" 'cv587_20532.txt',\n",
|
|
" 'cv588_14467.txt',\n",
|
|
" 'cv589_12853.txt',\n",
|
|
" 'cv590_20712.txt',\n",
|
|
" 'cv591_24887.txt',\n",
|
|
" 'cv592_23391.txt',\n",
|
|
" 'cv593_11931.txt',\n",
|
|
" 'cv594_11945.txt',\n",
|
|
" 'cv595_26420.txt',\n",
|
|
" 'cv596_4367.txt',\n",
|
|
" 'cv597_26744.txt',\n",
|
|
" 'cv598_18184.txt',\n",
|
|
" 'cv599_22197.txt',\n",
|
|
" 'cv600_25043.txt',\n",
|
|
" 'cv601_24759.txt',\n",
|
|
" 'cv602_8830.txt',\n",
|
|
" 'cv603_18885.txt',\n",
|
|
" 'cv604_23339.txt',\n",
|
|
" 'cv605_12730.txt',\n",
|
|
" 'cv606_17672.txt',\n",
|
|
" 'cv607_8235.txt',\n",
|
|
" 'cv608_24647.txt',\n",
|
|
" 'cv609_25038.txt',\n",
|
|
" 'cv610_24153.txt',\n",
|
|
" 'cv611_2253.txt',\n",
|
|
" 'cv612_5396.txt',\n",
|
|
" 'cv613_23104.txt',\n",
|
|
" 'cv614_11320.txt',\n",
|
|
" 'cv615_15734.txt',\n",
|
|
" 'cv616_29187.txt',\n",
|
|
" 'cv617_9561.txt',\n",
|
|
" 'cv618_9469.txt',\n",
|
|
" 'cv619_13677.txt',\n",
|
|
" 'cv620_2556.txt',\n",
|
|
" 'cv621_15984.txt',\n",
|
|
" 'cv622_8583.txt',\n",
|
|
" 'cv623_16988.txt',\n",
|
|
" 'cv624_11601.txt',\n",
|
|
" 'cv625_13518.txt',\n",
|
|
" 'cv626_7907.txt',\n",
|
|
" 'cv627_12603.txt',\n",
|
|
" 'cv628_20758.txt',\n",
|
|
" 'cv629_16604.txt',\n",
|
|
" 'cv630_10152.txt',\n",
|
|
" 'cv631_4782.txt',\n",
|
|
" 'cv632_9704.txt',\n",
|
|
" 'cv633_29730.txt',\n",
|
|
" 'cv634_11989.txt',\n",
|
|
" 'cv635_0984.txt',\n",
|
|
" 'cv636_16954.txt',\n",
|
|
" 'cv637_13682.txt',\n",
|
|
" 'cv638_29394.txt',\n",
|
|
" 'cv639_10797.txt',\n",
|
|
" 'cv640_5380.txt',\n",
|
|
" 'cv641_13412.txt',\n",
|
|
" 'cv642_29788.txt',\n",
|
|
" 'cv643_29282.txt',\n",
|
|
" 'cv644_18551.txt',\n",
|
|
" 'cv645_17078.txt',\n",
|
|
" 'cv646_16817.txt',\n",
|
|
" 'cv647_15275.txt',\n",
|
|
" 'cv648_17277.txt',\n",
|
|
" 'cv649_13947.txt',\n",
|
|
" 'cv650_15974.txt',\n",
|
|
" 'cv651_11120.txt',\n",
|
|
" 'cv652_15653.txt',\n",
|
|
" 'cv653_2107.txt',\n",
|
|
" 'cv654_19345.txt',\n",
|
|
" 'cv655_12055.txt',\n",
|
|
" 'cv656_25395.txt',\n",
|
|
" 'cv657_25835.txt',\n",
|
|
" 'cv658_11186.txt',\n",
|
|
" 'cv659_21483.txt',\n",
|
|
" 'cv660_23140.txt',\n",
|
|
" 'cv661_25780.txt',\n",
|
|
" 'cv662_14791.txt',\n",
|
|
" 'cv663_14484.txt',\n",
|
|
" 'cv664_4264.txt',\n",
|
|
" 'cv665_29386.txt',\n",
|
|
" 'cv666_20301.txt',\n",
|
|
" 'cv667_19672.txt',\n",
|
|
" 'cv668_18848.txt',\n",
|
|
" 'cv669_24318.txt',\n",
|
|
" 'cv670_2666.txt',\n",
|
|
" 'cv671_5164.txt',\n",
|
|
" 'cv672_27988.txt',\n",
|
|
" 'cv673_25874.txt',\n",
|
|
" 'cv674_11593.txt',\n",
|
|
" 'cv675_22871.txt',\n",
|
|
" 'cv676_22202.txt',\n",
|
|
" 'cv677_18938.txt',\n",
|
|
" 'cv678_14887.txt',\n",
|
|
" 'cv679_28221.txt',\n",
|
|
" 'cv680_10533.txt',\n",
|
|
" 'cv681_9744.txt',\n",
|
|
" 'cv682_17947.txt',\n",
|
|
" 'cv683_13047.txt',\n",
|
|
" 'cv684_12727.txt',\n",
|
|
" 'cv685_5710.txt',\n",
|
|
" 'cv686_15553.txt',\n",
|
|
" 'cv687_22207.txt',\n",
|
|
" 'cv688_7884.txt',\n",
|
|
" 'cv689_13701.txt',\n",
|
|
" 'cv690_5425.txt',\n",
|
|
" 'cv691_5090.txt',\n",
|
|
" 'cv692_17026.txt',\n",
|
|
" 'cv693_19147.txt',\n",
|
|
" 'cv694_4526.txt',\n",
|
|
" 'cv695_22268.txt',\n",
|
|
" 'cv696_29619.txt',\n",
|
|
" 'cv697_12106.txt',\n",
|
|
" 'cv698_16930.txt',\n",
|
|
" 'cv699_7773.txt',\n",
|
|
" 'cv700_23163.txt',\n",
|
|
" 'cv701_15880.txt',\n",
|
|
" 'cv702_12371.txt',\n",
|
|
" 'cv703_17948.txt',\n",
|
|
" 'cv704_17622.txt',\n",
|
|
" 'cv705_11973.txt',\n",
|
|
" 'cv706_25883.txt',\n",
|
|
" 'cv707_11421.txt',\n",
|
|
" 'cv708_28539.txt',\n",
|
|
" 'cv709_11173.txt',\n",
|
|
" 'cv710_23745.txt',\n",
|
|
" 'cv711_12687.txt',\n",
|
|
" 'cv712_24217.txt',\n",
|
|
" 'cv713_29002.txt',\n",
|
|
" 'cv714_19704.txt',\n",
|
|
" 'cv715_19246.txt',\n",
|
|
" 'cv716_11153.txt',\n",
|
|
" 'cv717_17472.txt',\n",
|
|
" 'cv718_12227.txt',\n",
|
|
" 'cv719_5581.txt',\n",
|
|
" 'cv720_5383.txt',\n",
|
|
" 'cv721_28993.txt',\n",
|
|
" 'cv722_7571.txt',\n",
|
|
" 'cv723_9002.txt',\n",
|
|
" 'cv724_15265.txt',\n",
|
|
" 'cv725_10266.txt',\n",
|
|
" 'cv726_4365.txt',\n",
|
|
" 'cv727_5006.txt',\n",
|
|
" 'cv728_17931.txt',\n",
|
|
" 'cv729_10475.txt',\n",
|
|
" 'cv730_10729.txt',\n",
|
|
" 'cv731_3968.txt',\n",
|
|
" 'cv732_13092.txt',\n",
|
|
" 'cv733_9891.txt',\n",
|
|
" 'cv734_22821.txt',\n",
|
|
" 'cv735_20218.txt',\n",
|
|
" 'cv736_24947.txt',\n",
|
|
" 'cv737_28733.txt',\n",
|
|
" 'cv738_10287.txt',\n",
|
|
" 'cv739_12179.txt',\n",
|
|
" 'cv740_13643.txt',\n",
|
|
" 'cv741_12765.txt',\n",
|
|
" 'cv742_8279.txt',\n",
|
|
" 'cv743_17023.txt',\n",
|
|
" 'cv744_10091.txt',\n",
|
|
" 'cv745_14009.txt',\n",
|
|
" 'cv746_10471.txt',\n",
|
|
" 'cv747_18189.txt',\n",
|
|
" 'cv748_14044.txt',\n",
|
|
" 'cv749_18960.txt',\n",
|
|
" 'cv750_10606.txt',\n",
|
|
" 'cv751_17208.txt',\n",
|
|
" 'cv752_25330.txt',\n",
|
|
" 'cv753_11812.txt',\n",
|
|
" 'cv754_7709.txt',\n",
|
|
" 'cv755_24881.txt',\n",
|
|
" 'cv756_23676.txt',\n",
|
|
" 'cv757_10668.txt',\n",
|
|
" 'cv758_9740.txt',\n",
|
|
" 'cv759_15091.txt',\n",
|
|
" 'cv760_8977.txt',\n",
|
|
" 'cv761_13769.txt',\n",
|
|
" 'cv762_15604.txt',\n",
|
|
" 'cv763_16486.txt',\n",
|
|
" 'cv764_12701.txt',\n",
|
|
" 'cv765_20429.txt',\n",
|
|
" 'cv766_7983.txt',\n",
|
|
" 'cv767_15673.txt',\n",
|
|
" 'cv768_12709.txt',\n",
|
|
" 'cv769_8565.txt',\n",
|
|
" 'cv770_11061.txt',\n",
|
|
" 'cv771_28466.txt',\n",
|
|
" 'cv772_12971.txt',\n",
|
|
" 'cv773_20264.txt',\n",
|
|
" 'cv774_15488.txt',\n",
|
|
" 'cv775_17966.txt',\n",
|
|
" 'cv776_21934.txt',\n",
|
|
" 'cv777_10247.txt',\n",
|
|
" 'cv778_18629.txt',\n",
|
|
" 'cv779_18989.txt',\n",
|
|
" 'cv780_8467.txt',\n",
|
|
" 'cv781_5358.txt',\n",
|
|
" 'cv782_21078.txt',\n",
|
|
" 'cv783_14724.txt',\n",
|
|
" 'cv784_16077.txt',\n",
|
|
" 'cv785_23748.txt',\n",
|
|
" 'cv786_23608.txt',\n",
|
|
" 'cv787_15277.txt',\n",
|
|
" 'cv788_26409.txt',\n",
|
|
" 'cv789_12991.txt',\n",
|
|
" 'cv790_16202.txt',\n",
|
|
" 'cv791_17995.txt',\n",
|
|
" 'cv792_3257.txt',\n",
|
|
" 'cv793_15235.txt',\n",
|
|
" 'cv794_17353.txt',\n",
|
|
" 'cv795_10291.txt',\n",
|
|
" 'cv796_17243.txt',\n",
|
|
" 'cv797_7245.txt',\n",
|
|
" 'cv798_24779.txt',\n",
|
|
" 'cv799_19812.txt',\n",
|
|
" 'cv800_13494.txt',\n",
|
|
" 'cv801_26335.txt',\n",
|
|
" 'cv802_28381.txt',\n",
|
|
" 'cv803_8584.txt',\n",
|
|
" 'cv804_11763.txt',\n",
|
|
" 'cv805_21128.txt',\n",
|
|
" 'cv806_9405.txt',\n",
|
|
" 'cv807_23024.txt',\n",
|
|
" 'cv808_13773.txt',\n",
|
|
" 'cv809_5012.txt',\n",
|
|
" 'cv810_13660.txt',\n",
|
|
" 'cv811_22646.txt',\n",
|
|
" 'cv812_19051.txt',\n",
|
|
" 'cv813_6649.txt',\n",
|
|
" 'cv814_20316.txt',\n",
|
|
" 'cv815_23466.txt',\n",
|
|
" 'cv816_15257.txt',\n",
|
|
" 'cv817_3675.txt',\n",
|
|
" 'cv818_10698.txt',\n",
|
|
" 'cv819_9567.txt',\n",
|
|
" 'cv820_24157.txt',\n",
|
|
" 'cv821_29283.txt',\n",
|
|
" 'cv822_21545.txt',\n",
|
|
" 'cv823_17055.txt',\n",
|
|
" 'cv824_9335.txt',\n",
|
|
" 'cv825_5168.txt',\n",
|
|
" 'cv826_12761.txt',\n",
|
|
" 'cv827_19479.txt',\n",
|
|
" 'cv828_21392.txt',\n",
|
|
" 'cv829_21725.txt',\n",
|
|
" 'cv830_5778.txt',\n",
|
|
" 'cv831_16325.txt',\n",
|
|
" 'cv832_24713.txt',\n",
|
|
" 'cv833_11961.txt',\n",
|
|
" 'cv834_23192.txt',\n",
|
|
" 'cv835_20531.txt',\n",
|
|
" 'cv836_14311.txt',\n",
|
|
" 'cv837_27232.txt',\n",
|
|
" 'cv838_25886.txt',\n",
|
|
" 'cv839_22807.txt',\n",
|
|
" 'cv840_18033.txt',\n",
|
|
" 'cv841_3367.txt',\n",
|
|
" 'cv842_5702.txt',\n",
|
|
" 'cv843_17054.txt',\n",
|
|
" 'cv844_13890.txt',\n",
|
|
" 'cv845_15886.txt',\n",
|
|
" 'cv846_29359.txt',\n",
|
|
" 'cv847_20855.txt',\n",
|
|
" 'cv848_10061.txt',\n",
|
|
" 'cv849_17215.txt',\n",
|
|
" 'cv850_18185.txt',\n",
|
|
" 'cv851_21895.txt',\n",
|
|
" 'cv852_27512.txt',\n",
|
|
" 'cv853_29119.txt',\n",
|
|
" 'cv854_18955.txt',\n",
|
|
" 'cv855_22134.txt',\n",
|
|
" 'cv856_28882.txt',\n",
|
|
" 'cv857_17527.txt',\n",
|
|
" 'cv858_20266.txt',\n",
|
|
" 'cv859_15689.txt',\n",
|
|
" 'cv860_15520.txt',\n",
|
|
" 'cv861_12809.txt',\n",
|
|
" 'cv862_15924.txt',\n",
|
|
" 'cv863_7912.txt',\n",
|
|
" 'cv864_3087.txt',\n",
|
|
" 'cv865_28796.txt',\n",
|
|
" 'cv866_29447.txt',\n",
|
|
" 'cv867_18362.txt',\n",
|
|
" 'cv868_12799.txt',\n",
|
|
" 'cv869_24782.txt',\n",
|
|
" 'cv870_18090.txt',\n",
|
|
" 'cv871_25971.txt',\n",
|
|
" 'cv872_13710.txt',\n",
|
|
" 'cv873_19937.txt',\n",
|
|
" 'cv874_12182.txt',\n",
|
|
" 'cv875_5622.txt',\n",
|
|
" 'cv876_9633.txt',\n",
|
|
" 'cv877_29132.txt',\n",
|
|
" 'cv878_17204.txt',\n",
|
|
" 'cv879_16585.txt',\n",
|
|
" 'cv880_29629.txt',\n",
|
|
" 'cv881_14767.txt',\n",
|
|
" 'cv882_10042.txt',\n",
|
|
" 'cv883_27621.txt',\n",
|
|
" 'cv884_15230.txt',\n",
|
|
" 'cv885_13390.txt',\n",
|
|
" 'cv886_19210.txt',\n",
|
|
" 'cv887_5306.txt',\n",
|
|
" 'cv888_25678.txt',\n",
|
|
" 'cv889_22670.txt',\n",
|
|
" 'cv890_3515.txt',\n",
|
|
" 'cv891_6035.txt',\n",
|
|
" 'cv892_18788.txt',\n",
|
|
" 'cv893_26731.txt',\n",
|
|
" 'cv894_22140.txt',\n",
|
|
" 'cv895_22200.txt',\n",
|
|
" 'cv896_17819.txt',\n",
|
|
" 'cv897_11703.txt',\n",
|
|
" 'cv898_1576.txt',\n",
|
|
" 'cv899_17812.txt',\n",
|
|
" 'cv900_10800.txt',\n",
|
|
" 'cv901_11934.txt',\n",
|
|
" 'cv902_13217.txt',\n",
|
|
" 'cv903_18981.txt',\n",
|
|
" 'cv904_25663.txt',\n",
|
|
" 'cv905_28965.txt',\n",
|
|
" 'cv906_12332.txt',\n",
|
|
" 'cv907_3193.txt',\n",
|
|
" 'cv908_17779.txt',\n",
|
|
" 'cv909_9973.txt',\n",
|
|
" 'cv910_21930.txt',\n",
|
|
" 'cv911_21695.txt',\n",
|
|
" 'cv912_5562.txt',\n",
|
|
" 'cv913_29127.txt',\n",
|
|
" 'cv914_2856.txt',\n",
|
|
" 'cv915_9342.txt',\n",
|
|
" 'cv916_17034.txt',\n",
|
|
" 'cv917_29484.txt',\n",
|
|
" 'cv918_27080.txt',\n",
|
|
" 'cv919_18155.txt',\n",
|
|
" 'cv920_29423.txt',\n",
|
|
" 'cv921_13988.txt',\n",
|
|
" 'cv922_10185.txt',\n",
|
|
" 'cv923_11951.txt',\n",
|
|
" 'cv924_29397.txt',\n",
|
|
" 'cv925_9459.txt',\n",
|
|
" 'cv926_18471.txt',\n",
|
|
" 'cv927_11471.txt',\n",
|
|
" 'cv928_9478.txt',\n",
|
|
" 'cv929_1841.txt',\n",
|
|
" 'cv930_14949.txt',\n",
|
|
" 'cv931_18783.txt',\n",
|
|
" 'cv932_14854.txt',\n",
|
|
" 'cv933_24953.txt',\n",
|
|
" 'cv934_20426.txt',\n",
|
|
" 'cv935_24977.txt',\n",
|
|
" 'cv936_17473.txt',\n",
|
|
" 'cv937_9816.txt',\n",
|
|
" 'cv938_10706.txt',\n",
|
|
" 'cv939_11247.txt',\n",
|
|
" 'cv940_18935.txt',\n",
|
|
" 'cv941_10718.txt',\n",
|
|
" 'cv942_18509.txt',\n",
|
|
" 'cv943_23547.txt',\n",
|
|
" 'cv944_15042.txt',\n",
|
|
" 'cv945_13012.txt',\n",
|
|
" 'cv946_20084.txt',\n",
|
|
" 'cv947_11316.txt',\n",
|
|
" 'cv948_25870.txt',\n",
|
|
" 'cv949_21565.txt',\n",
|
|
" 'cv950_13478.txt',\n",
|
|
" 'cv951_11816.txt',\n",
|
|
" 'cv952_26375.txt',\n",
|
|
" 'cv953_7078.txt',\n",
|
|
" 'cv954_19932.txt',\n",
|
|
" 'cv955_26154.txt',\n",
|
|
" 'cv956_12547.txt',\n",
|
|
" 'cv957_9059.txt',\n",
|
|
" 'cv958_13020.txt',\n",
|
|
" 'cv959_16218.txt',\n",
|
|
" 'cv960_28877.txt',\n",
|
|
" 'cv961_5578.txt',\n",
|
|
" 'cv962_9813.txt',\n",
|
|
" 'cv963_7208.txt',\n",
|
|
" 'cv964_5794.txt',\n",
|
|
" 'cv965_26688.txt',\n",
|
|
" 'cv966_28671.txt',\n",
|
|
" 'cv967_5626.txt',\n",
|
|
" 'cv968_25413.txt',\n",
|
|
" 'cv969_14760.txt',\n",
|
|
" 'cv970_19532.txt',\n",
|
|
" 'cv971_11790.txt',\n",
|
|
" 'cv972_26837.txt',\n",
|
|
" 'cv973_10171.txt',\n",
|
|
" 'cv974_24303.txt',\n",
|
|
" 'cv975_11920.txt',\n",
|
|
" 'cv976_10724.txt',\n",
|
|
" 'cv977_4776.txt',\n",
|
|
" 'cv978_22192.txt',\n",
|
|
" 'cv979_2029.txt',\n",
|
|
" 'cv980_11851.txt',\n",
|
|
" 'cv981_16679.txt',\n",
|
|
" 'cv982_22209.txt',\n",
|
|
" 'cv983_24219.txt',\n",
|
|
" 'cv984_14006.txt',\n",
|
|
" 'cv985_5964.txt',\n",
|
|
" 'cv986_15092.txt',\n",
|
|
" 'cv987_7394.txt',\n",
|
|
" 'cv988_20168.txt',\n",
|
|
" 'cv989_17297.txt',\n",
|
|
" 'cv990_12443.txt',\n",
|
|
" 'cv991_19973.txt',\n",
|
|
" 'cv992_12806.txt',\n",
|
|
" 'cv993_29565.txt',\n",
|
|
" 'cv994_13229.txt',\n",
|
|
" 'cv995_23113.txt',\n",
|
|
" 'cv996_12447.txt',\n",
|
|
" 'cv997_5152.txt',\n",
|
|
" 'cv998_15691.txt',\n",
|
|
" 'cv999_14636.txt'])"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"next(gen)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The subfolder `../moviereviews/neg` contains 1000 text files. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"ename": "StopIteration",
|
|
"evalue": "",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)",
|
|
"\u001b[1;32m<ipython-input-16-e2a758a6db89>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# this walks the /pos/ subfolder\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
|
|
"\u001b[1;31mStopIteration\u001b[0m: "
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"next(gen) # this walks the /pos/ subfolder\n",
|
|
"next(gen)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`os.walk()` stopped once it had walked all subfolders."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Use os.walk() to build a DataFrame\n",
|
|
"\n",
|
|
"The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.<br>We'll take the following steps to build our list:\n",
|
|
"1. Start with a list of subdirectory names ('neg' and 'pos')\n",
|
|
"2. Walk each subdirectory\n",
|
|
"3. Create a dictionary object for every file in a subdirectory where `label` is either 'neg' or 'pos', and `review` is the text of the file.\n",
|
|
"4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"row_list = []\n",
|
|
"\n",
|
|
"for subdir in ['neg','pos']:\n",
|
|
" for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):\n",
|
|
" for file in filenames:\n",
|
|
" d = {'label':subdir} # assign the name of the subdirectory to the label field\n",
|
|
" with open('moviereviews/'+subdir+'/'+file) as f:\n",
|
|
" if f.read(): # handles the case of empty files, which become NaN on import\n",
|
|
" f.seek(0)\n",
|
|
" d['review'] = f.read() # assign the contents of the file to the review field\n",
|
|
" row_list.append(d)\n",
|
|
" break"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.DataFrame(row_list)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style>\n",
|
|
" .dataframe thead tr:only-child th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: left;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>the happy bastard's quick movie review \\ndamn ...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>it is movies like these that make a jaded movi...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>\" quest for camelot \" is warner bros . ' firs...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>neg</td>\n",
|
|
" <td>synopsis : a mentally unstable man undergoing ...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"0 neg NaN\n",
|
|
"1 neg the happy bastard's quick movie review \\ndamn ...\n",
|
|
"2 neg it is movies like these that make a jaded movi...\n",
|
|
"3 neg \" quest for camelot \" is warner bros . ' firs...\n",
|
|
"4 neg synopsis : a mentally unstable man undergoing ..."
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|