Add JSON-LD support by using extruct #385

torbenbrodt · 2017-06-15T15:31:18Z

You will realize that publish_date for a some major German brands is empty

url = "www.spiegel.de/politik/deutschland/terror-und-innere-sicherheit-die-neue-deutsche-haerte-a-1152113.html"
article = Article(url)
article.download()
article.parse()

article.title
>>> Steinmeiers Bundespräsidialamt: Personalrat tritt geschlossen zurück

article.publish_date
>>> None

The publisher is using JSON-LD, so I wondered how newspaper could support this. Actually there is a library called https://github.com/scrapinghub/extruct which has similar dependencies but comes with a jsonld extractor.

The MR is a proof of concept. Tests are missing, etc.
Happy to get your thoughts.

codelucas · 2017-06-15T22:58:19Z

requirements.txt

 feedfinder2>=0.0.4
 jieba3k>=0.35.1
 python-dateutil>=2.5.3
+extruct


Don't just list the package name, list >=SPECIFIC_VERSION to finely define what we want. i.e. find out the latest version of extruct that you're using right now and set this to be >= that version

codelucas · 2017-06-15T22:59:18Z

newspaper/extractors.py

 from tldextract import tldextract
 from urllib.parse import urljoin, urlparse, urlunparse

+from extruct.jsonld import JsonLdExtractor


Can we get away with this by just using the libraries newspaper already imports (e.g. lxml, BeautifulSoup) I'm not against adding extruct but it would be better to keep this library with fewer dependencies

I don't like dependencies too. But I found the implementation quite difficult.

codelucas · 2017-06-15T22:59:53Z

newspaper/extractors.py

+            'datePublished'
+        ]
+        jsonlde = JsonLdExtractor()
+        jsonlddata = jsonlde.extract_items(doc)


Is this call expensive? I worry since we are running this over the entire html doc

I expect yes.

But I think the json-ld is widely used nowadays. If we decide to integrate json-ld into newspaper, we will probably need to restructure the code and move some priority of these parsers up.

codelucas

Wow @torbenbrodt this is great work - thanks for doing this.

Please see my inline comments + add some preliminary tests for this functionality. (If you aren't familiar with how to run tests online feel free to use our travis CI build system)

torbenbrodt · 2017-06-19T17:32:41Z

Thx for the fast feedback. What do you think? Tests are green? Let's include extruct?

mccarran · 2017-10-02T14:30:12Z

Is this enhancement still being considered? I am finding many major publishers using ld+json for publish dates and much more, which comes up missing with current implementation. Thank you.

codelucas · 2017-10-04T07:47:00Z

@mccarran @torbenbrodt This is still being considered, but I'm considering various implementations, will accept or make comments later

torbenbrodt · 2018-05-18T13:13:39Z

Hey @codelucas , any update on this?
Actually extruct did make good progress recently and now includes further formats so I really suggest including it

agnelvishal · 2018-11-21T19:02:16Z

@torbenbrodt Could you have a look at #655

simonm3 · 2019-04-09T09:29:41Z

Would be great if this could be completed. Many major media sites such as the bbc use json_ld and it would enable newspaper to extract publish_date.

Kerl1310

LGTM

Add JSON-LD support by using extruct

31415b2

torbenbrodt force-pushed the jsonld-extruct branch from 9c9e732 to 31415b2 Compare June 15, 2017 15:36

codelucas reviewed Jun 15, 2017

View reviewed changes

codelucas added the enhancement label Jun 15, 2017

codelucas requested a review from yprez June 15, 2017 23:00

Add version requirement

d59bf6b

Move extraction of jsonld from getter to parse() method

793c80d

This was referenced Jul 13, 2018

Use <meta name="parsely-pub-date"> for article date #591

Open

get_publish_date() Strategy 3 not implemented #521

Open

Merge branch 'master' into jsonld-extruct

ecf7d20

agnelvishal mentioned this pull request Nov 21, 2018

Patch 1 #655

Open

Kerl1310 approved these changes Oct 31, 2019

View reviewed changes

AndyTheFactory mentioned this pull request Oct 24, 2023

Add JSON-LD support by using extruct AndyTheFactory/newspaper4k#99

Open

Add JSON-LD support by using extruct #385

Are you sure you want to change the base?

Add JSON-LD support by using extruct #385

Uh oh!

Conversation

torbenbrodt commented Jun 15, 2017

Uh oh!

codelucas Jun 15, 2017

Choose a reason for hiding this comment

Uh oh!

codelucas Jun 15, 2017

Choose a reason for hiding this comment

Uh oh!

torbenbrodt Jun 19, 2017

Choose a reason for hiding this comment

Uh oh!

codelucas Jun 15, 2017

Choose a reason for hiding this comment

Uh oh!

torbenbrodt Jun 19, 2017

Choose a reason for hiding this comment

Uh oh!

codelucas left a comment

Choose a reason for hiding this comment

Uh oh!

torbenbrodt commented Jun 19, 2017

Uh oh!

mccarran commented Oct 2, 2017

Uh oh!

codelucas commented Oct 4, 2017

Uh oh!

torbenbrodt commented May 18, 2018

Uh oh!

agnelvishal commented Nov 21, 2018

Uh oh!

simonm3 commented Apr 9, 2019

Uh oh!

Kerl1310 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants