-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add JSON-LD support by using extruct #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9c9e732 to
31415b2
Compare
requirements.txt
Outdated
| feedfinder2>=0.0.4 | ||
| jieba3k>=0.35.1 | ||
| python-dateutil>=2.5.3 | ||
| extruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't just list the package name, list >=SPECIFIC_VERSION to finely define what we want. i.e. find out the latest version of extruct that you're using right now and set this to be >= that version
newspaper/extractors.py
Outdated
| from tldextract import tldextract | ||
| from urllib.parse import urljoin, urlparse, urlunparse | ||
|
|
||
| from extruct.jsonld import JsonLdExtractor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get away with this by just using the libraries newspaper already imports (e.g. lxml, BeautifulSoup) I'm not against adding extruct but it would be better to keep this library with fewer dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like dependencies too. But I found the implementation quite difficult.
newspaper/extractors.py
Outdated
| 'datePublished' | ||
| ] | ||
| jsonlde = JsonLdExtractor() | ||
| jsonlddata = jsonlde.extract_items(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call expensive? I worry since we are running this over the entire html doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect yes.
But I think the json-ld is widely used nowadays. If we decide to integrate json-ld into newspaper, we will probably need to restructure the code and move some priority of these parsers up.
codelucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow @torbenbrodt this is great work - thanks for doing this.
Please see my inline comments + add some preliminary tests for this functionality. (If you aren't familiar with how to run tests online feel free to use our travis CI build system)
|
Thx for the fast feedback. What do you think? Tests are green? Let's include extruct? |
|
Is this enhancement still being considered? I am finding many major publishers using ld+json for publish dates and much more, which comes up missing with current implementation. Thank you. |
|
@mccarran @torbenbrodt This is still being considered, but I'm considering various implementations, will accept or make comments later |
|
Hey @codelucas , any update on this? |
|
@torbenbrodt Could you have a look at #655 |
|
Would be great if this could be completed. Many major media sites such as the bbc use json_ld and it would enable newspaper to extract publish_date. |
Kerl1310
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
You will realize that publish_date for a some major German brands is empty
The publisher is using JSON-LD, so I wondered how newspaper could support this. Actually there is a library called https://github.com/scrapinghub/extruct which has similar dependencies but comes with a jsonld extractor.
The MR is a proof of concept. Tests are missing, etc.
Happy to get your thoughts.