Skip to content

Conversation

@jmaroeder
Copy link
Contributor

This adds ~30 languages, and uses an easily updatable source of stopwords from stopwords-iso rather than requiring separately maintained stopwords files. I switched over every language that's in stopwords-iso to use that, so there may be some differing results, but overall there should be similar performance. (https://github.com/stopwords-iso/stopwords-iso).

This PR leaves the .txt loading functionality (since Macedonian mk wasn't in stopwords-iso), but a side effect is that users may still include .txt files in the resources/text directory to add languages.

@bact
Copy link
Contributor

bact commented Jan 22, 2019

FYI, in the case you want to off-loading some of the stopwords loading code,
stopwords-iso is now available as a pip package (a fork from the original project).

You can pip install stopwordsiso and stopwordsiso.stopwords("en") to get a set of English stopwords.

https://pypi.org/project/stopwordsiso/

This way, there's a possibility that we don't have to maintain some of the stopwords-xx.txt inside newspaper and just rely on stopwords-iso JSON file (I'm not sure anyway if it's a good idea or not, well, in terms of control over the behavior of newspaper).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants