-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Included 'span' and 'a' #635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Sometimes author names are included inside span and a. So modified the code. Can also include div if necessary
Often author name is included in text of 'span' and 'a'. So author name is searched there too
|
Great find @agnelvishal, looking!! 💯👍 |
codelucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments, this is a great change but have some suggestions
newspaper/extractors.py
Outdated
| if match.tag == 'meta' or match.tag == 'span' or match.tag == 'a': | ||
| mm = match.xpath('@content') | ||
| if not mm: | ||
| mm=str(match.text_content()).split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confident that this will increase the recall but will precision take a hit?
I am going to approve this but can you also add some urls where you found that these changes were helpful
newspaper/extractors.py
Outdated
| if match.tag == 'meta' or match.tag == 'span' or match.tag == 'a': | ||
| mm = match.xpath('@content') | ||
| if not mm: | ||
| mm=str(match.text_content()).split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: python style, please add spaces between the = on line 154
|
newspaper/newspaper/extractors.py Line 159 in b75b9c7
if len(content) > 0 and len(content) < 30: to get better precision.
|
Sometimes author names are included inside span and a. So modified the code.
Can also include div if necessary