___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Building a corpus from individual files
Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a `\pos\` directory while negative reviews live under `\neg\`. Refer to [moviereviesREADME.txt](../moviereviews/moviereviewsREADME.txt) for more information about the files.

We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:
* using Python's **os module** to build a pandas DataFrame
* using an **nltk** tool called `CategorizedPlaintextCorpusReader` 

## Using Python's `os` module to build a DataFrame

In [1]:
# Perform imports:
import numpy as np
import pandas as pd
import os

### Let's look at what os.walk() does:

In [14]:
gen = os.walk('../moviereviews')
next(gen)

('../moviereviews - Copy', ['neg', 'pos'], ['poldata.README.2.0'])

`os.walk()` is a generator that returns a tuple with three items:
1. the name of the current folder
2. a list of names of any subfolders
3. a list of names of any files in the current folder

In [15]:
next(gen)

('../moviereviews - Copy\\neg',
 [],
 ['cv000_29416.txt',
  'cv001_19502.txt',
  'cv002_17424.txt',
  'cv003_12683.txt',
  'cv004_12641.txt',
  'cv005_29357.txt',
  'cv006_17022.txt',
  'cv007_4992.txt',
  'cv008_29326.txt',
  'cv009_29417.txt',
  'cv010_29063.txt',
  'cv011_13044.txt',
  'cv012_29411.txt',
  'cv013_10494.txt',
  'cv014_15600.txt',
  'cv015_29356.txt',
  'cv016_4348.txt',
  'cv017_23487.txt',
  'cv018_21672.txt',
  'cv019_16117.txt',
  'cv020_9234.txt',
  'cv021_17313.txt',
  'cv022_14227.txt',
  'cv023_13847.txt',
  'cv024_7033.txt',
  'cv025_29825.txt',
  'cv026_29229.txt',
  'cv027_26270.txt',
  'cv028_26964.txt',
  'cv029_19943.txt',
  'cv030_22893.txt',
  'cv031_19540.txt',
  'cv032_23718.txt',
  'cv033_25680.txt',
  'cv034_29446.txt',
  'cv035_3343.txt',
  'cv036_18385.txt',
  'cv037_19798.txt',
  'cv038_9781.txt',
  'cv039_5963.txt',
  'cv040_8829.txt',
  'cv041_22364.txt',
  'cv042_11927.txt',
  'cv043_16808.txt',
  'cv044_18429.txt',
  'cv045_25077.txt',
  'cv

The subfolder `../moviereviews/neg` contains 1000 text files. 

In [16]:
next(gen) # this walks the /pos/ subfolder
next(gen)

StopIteration: 

`os.walk()` stopped once it had walked all subfolders.

### Use os.walk() to build a DataFrame

The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.<br>We'll take the following steps to build our list:
1. Start with a list of subdirectory names ('neg' and 'pos')
2. Walk each subdirectory
3. Create a dictionary object for every file in a subdirectory where `label` is either 'neg' or 'pos', and `review` is the text of the file.
4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values.

In [20]:
row_list = []

for subdir in ['neg','pos']:
    for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):
        for file in filenames:
            d = {'label':subdir}  # assign the name of the subdirectory to the label field
            with open('moviereviews/'+subdir+'/'+file) as f:
                if f.read():      # handles the case of empty files, which become NaN on import
                    f.seek(0)
                    d['review'] = f.read()  # assign the contents of the file to the review field
            row_list.append(d)
        break

In [21]:
df = pd.DataFrame(row_list)

In [22]:
df.head()

Unnamed: 0,label,review
0,neg,
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...
