implementing 3rd method of publish date extraction (issue # 521) #549
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I tested the original version and this modified version on 120 different news articles across multiple sites. This version successfully parsed 119 of the publish dates, compared to 84 from the previous version. On average, the runtime increase is only marginal for the accuracy increase.
Final statistics from testing:
urls that failed both versions:
https://www.bbcgoodfood.com/howto/guide/top-10-retro-british-desserts
success rates of original/modified versions:
original version successes: 84 / 120
updated version successes: 119 / 120
accuracy improvement percentage: 0.41666666666666674
average abolute/relative runtime differences:
dataset: | absolute difference: | relative difference:
original: | 0.004801957380203973 | 0.008619953698073361
new : | 0.01785543305533273 | 0.1452353506990094
fails: | 0.0388028621673584 | 0.19184348967941411
total: | 0.00889256199200948 | 0.04999297395652421
.
The url that failed is one that was linked to on an aggregated-link-based news site. On inspection of the source code, no evidence of a publish date was found.
Statistic explanation:
Datasets:
- original: urls that worked in the original version
- new: urls that failed the original, but passed in the new version
- fails: urls that failed both versions
- total: all urls used
Abolute Difference:
- runtime differences from executing article.parse()
- calculated using python's time.time() function
- difference = new runtime - original runtime
Relatvie Difference:
- difference = absolute difference / original runtime
Summary:
The modified version found publish dates for ~42% more articles and, on average, increased the runtime of article.parse() by only ~5%