Currently, only bag of word vectorization is implemented. It would be good to extend the code to word n-grams as well.