Teaching myself: NLTK, web scraping, other stuff

September 2015. Working through the NLTK book. Scraping academic papers, analyzing them. Hypothesisless.

~~git.~~
~~Make def for scraping text, converting PDF to TXT.~~
Basic text data cleaning: eliminate \\n, etc.
~~FreqDist common words.~~
Language diversity?
Data visualizations of everything.
Add more sources/corpi. (Compare to non-academic? Compare over time?)
Refactor code into: scrape.py, clean.py, analysis.py, visualize.py, or something.
Most common word endings.
Sentence length.
FreqDist: instead of abs values, convert to % of total wordcount.
Return a random sentence.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
pperish-clean.py		pperish-clean.py
pperish-run.py		pperish-run.py

Provide feedback