Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
README.md		README.md
bec_scraper.py		bec_scraper.py
clean_utf8.py		clean_utf8.py
create_csv.py		create_csv.py
pdfbox_wrapper.py		pdfbox_wrapper.py
requirements.txt		requirements.txt
test_pdfbox_extraction.py		test_pdfbox_extraction.py
text_extractor.py		text_extractor.py

Repository files navigation

parlamentare2016.bec.ro scraper

####Initial setup:

Go to the right folder
Create a virtual environment: $ virtualenv venv
Activate the virtual environment: $ source venv/bin/activate
Install the requirements: $ pip install -r requirements.txt

###Scrape it like you know it:

Run the main script with: python bec_scraper.py and magic will happen
text_extractor dumps the UTF8 and ascii texts in two separate folders
create_csv does a partial csv generation from the ascii texts

TODO

Use java -jar pdfbox-app-2.0.3.jar ExtractText pdfs/some.pdf output.txt

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

Contributors

Languages

Python 100.0%