September 2015. Working through the NLTK book. Scraping academic papers, analyzing them. Hypothesisless.
git.Makedeffor scraping text, converting PDF to TXT.- Basic text data cleaning: eliminate
\\n, etc. FreqDistcommon words.- Language diversity?
- Data visualizations of everything.
- Add more sources/corpi. (Compare to non-academic? Compare over time?)
- Refactor code into:
scrape.py,clean.py,analysis.py,visualize.py, or something. - Most common word endings.
- Sentence length.
FreqDist: instead of abs values, convert to%of total wordcount.- Return a random sentence.