implementing 3rd method of publish date extraction (issue # 521) #549

zachorban · 2018-04-11T16:33:54Z

I tested the original version and this modified version on 120 different news articles across multiple sites. This version successfully parsed 119 of the publish dates, compared to 84 from the previous version. On average, the runtime increase is only marginal for the accuracy increase.

Final statistics from testing:

urls that failed both versions:

https://www.bbcgoodfood.com/howto/guide/top-10-retro-british-desserts

success rates of original/modified versions:

original version successes: 84 / 120
updated version successes: 119 / 120
accuracy improvement percentage: 0.41666666666666674

average abolute/relative runtime differences:

dataset: | absolute difference: | relative difference:
original: | 0.004801957380203973 | 0.008619953698073361
new : | 0.01785543305533273 | 0.1452353506990094
fails: | 0.0388028621673584 | 0.19184348967941411
total: | 0.00889256199200948 | 0.04999297395652421

.

The url that failed is one that was linked to on an aggregated-link-based news site. On inspection of the source code, no evidence of a publish date was found.

Statistic explanation:

Datasets:
- original: urls that worked in the original version
- new: urls that failed the original, but passed in the new version
- fails: urls that failed both versions
- total: all urls used
Abolute Difference:
- runtime differences from executing article.parse()
- calculated using python's time.time() function
- difference = new runtime - original runtime
Relatvie Difference:
- difference = absolute difference / original runtime

Summary:
The modified version found publish dates for ~42% more articles and, on average, increased the runtime of article.parse() by only ~5%

implementing 3rd method of publish date extraction (issue # 521)

e2289bd

AndyTheFactory mentioned this pull request Oct 24, 2023

implementing 3rd method of publish date extraction (issue # 521) AndyTheFactory/newspaper4k#195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implementing 3rd method of publish date extraction (issue # 521) #549

implementing 3rd method of publish date extraction (issue # 521) #549

Uh oh!

zachorban commented Apr 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

implementing 3rd method of publish date extraction (issue # 521) #549

Are you sure you want to change the base?

implementing 3rd method of publish date extraction (issue # 521) #549

Uh oh!

Conversation

zachorban commented Apr 11, 2018

urls that failed both versions:

https://www.bbcgoodfood.com/howto/guide/top-10-retro-british-desserts

success rates of original/modified versions:

original version successes: 84 / 120 updated version successes: 119 / 120 accuracy improvement percentage: 0.41666666666666674

average abolute/relative runtime differences:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

original version successes: 84 / 120
updated version successes: 119 / 120
accuracy improvement percentage: 0.41666666666666674