Skip to content

Conversation

@torbenbrodt
Copy link
Contributor

You will realize that publish_date for a some major German brands is empty

url = "www.spiegel.de/politik/deutschland/terror-und-innere-sicherheit-die-neue-deutsche-haerte-a-1152113.html"
article = Article(url)
article.download()
article.parse()

article.title
>>> Steinmeiers Bundespräsidialamt: Personalrat tritt geschlossen zurück

article.publish_date
>>> None

The publisher is using JSON-LD, so I wondered how newspaper could support this. Actually there is a library called https://github.com/scrapinghub/extruct which has similar dependencies but comes with a jsonld extractor.

The MR is a proof of concept. Tests are missing, etc.
Happy to get your thoughts.

requirements.txt Outdated
feedfinder2>=0.0.4
jieba3k>=0.35.1
python-dateutil>=2.5.3
extruct
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't just list the package name, list >=SPECIFIC_VERSION to finely define what we want. i.e. find out the latest version of extruct that you're using right now and set this to be >= that version

from tldextract import tldextract
from urllib.parse import urljoin, urlparse, urlunparse

from extruct.jsonld import JsonLdExtractor
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get away with this by just using the libraries newspaper already imports (e.g. lxml, BeautifulSoup) I'm not against adding extruct but it would be better to keep this library with fewer dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like dependencies too. But I found the implementation quite difficult.

'datePublished'
]
jsonlde = JsonLdExtractor()
jsonlddata = jsonlde.extract_items(doc)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call expensive? I worry since we are running this over the entire html doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect yes.

But I think the json-ld is widely used nowadays. If we decide to integrate json-ld into newspaper, we will probably need to restructure the code and move some priority of these parsers up.

Copy link
Owner

@codelucas codelucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow @torbenbrodt this is great work - thanks for doing this.

Please see my inline comments + add some preliminary tests for this functionality. (If you aren't familiar with how to run tests online feel free to use our travis CI build system)

@codelucas codelucas requested a review from yprez June 15, 2017 23:00
@torbenbrodt
Copy link
Contributor Author

Thx for the fast feedback. What do you think? Tests are green? Let's include extruct?

@mccarran
Copy link

mccarran commented Oct 2, 2017

Is this enhancement still being considered? I am finding many major publishers using ld+json for publish dates and much more, which comes up missing with current implementation. Thank you.

@codelucas
Copy link
Owner

@mccarran @torbenbrodt This is still being considered, but I'm considering various implementations, will accept or make comments later

@torbenbrodt
Copy link
Contributor Author

Hey @codelucas , any update on this?
Actually extruct did make good progress recently and now includes further formats so I really suggest including it

@agnelvishal agnelvishal mentioned this pull request Nov 21, 2018
@agnelvishal
Copy link
Contributor

@torbenbrodt Could you have a look at #655

@simonm3
Copy link

simonm3 commented Apr 9, 2019

Would be great if this could be completed. Many major media sites such as the bbc use json_ld and it would enable newspaper to extract publish_date.

Copy link
Contributor

@Kerl1310 Kerl1310 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants