Rosetta

Tools for data science with a focus on text processing.

Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
Integrates with existing scientific Python stack as well as select outside tools.

Examples

See the examples/ directory.
The docs contain plots of example output.

Packages

`cmdutils`

Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
Focus on stream processing and csv files.

`parallel`

Wrappers for Python multiprocessing that add ease of use
Memory-friendly multiprocessing

`text`

Stream text from disk to formats used in common ML processes
Write processed text to sparse formats
Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
Other general utilities

`workflow`

High-level wrappers that have helped with our workflow and provide additional examples of code use

`modeling`

General ML modeling utilities

Install

Check out the master branch from the rosettarepo. Then, (so long as you have pip).

cd rosetta
make
make test

If you update the source, you can do

make reinstall
make test

The above make targets use pip, so you can of course do pip uninstall at any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

pip install rosetta-X.X.X.tar.gz

Development

Code

You can get the latest sources with

git clone git://github.com/columbia-applied-data-science/rosetta

Contributing

Feel free to contribute a bug report or a request by opening an issue

The preferred method to contribute is to fork and send a pull request. Before doing this, read CONTRIBUTING.md

Dependencies

Major dependencies on Pandas and numpy.
Minor dependencies on Gensim and statsmodels.
Some examples need scikit-learn.
Minor dependencies on docx
Minor dependencies on the unix utilities pdftotext and catdoc

Testing

From the base repo directory, rosetta/, you can run all tests with

make test

Documentation

Documentation for releases is hosted at pypi. This does NOT auto-update.

History

Rosetta refers to the Rosetta Stone, the ancient Egyptian tablet discovered just over 200 years ago. The tablet contained fragmented text in three different languages and the uncovering of its meaning is considered an essential key to our understanding of Ancient Egyptian civilization. We would like this project to provide individuals the necessary tools to process and unearth insight in the ever-growing volumes of textual data of today.

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
docs		docs
examples		examples
notebooks		notebooks
notes		notes
rosetta		rosetta
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
README_data.md		README_data.md
makefile		makefile
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rosetta

Examples

Packages

`cmdutils`

`parallel`

`text`

`workflow`

`modeling`

Install

Development

Code

Contributing

Dependencies

Testing

Documentation

History

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

columbia-applied-data-science/rosetta

Folders and files

Latest commit

History

Repository files navigation

Rosetta

Examples

Packages

cmdutils

parallel

text

workflow

modeling

Install

Development

Code

Contributing

Dependencies

Testing

Documentation

History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

`cmdutils`

`parallel`

`text`

`workflow`

`modeling`

Packages