Skip to content

Conversation

@zachorban
Copy link

I tested the original version and this modified version on 120 different news articles across multiple sites. This version successfully parsed 119 of the publish dates, compared to 84 from the previous version. On average, the runtime increase is only marginal for the accuracy increase.

Final statistics from testing:


urls that failed both versions:

https://www.bbcgoodfood.com/howto/guide/top-10-retro-british-desserts

success rates of original/modified versions:

original version successes: 84 / 120
updated version successes: 119 / 120
accuracy improvement percentage: 0.41666666666666674

average abolute/relative runtime differences:

dataset: | absolute difference: | relative difference:
original: | 0.004801957380203973 | 0.008619953698073361
new : | 0.01785543305533273 | 0.1452353506990094
fails: | 0.0388028621673584 | 0.19184348967941411
total: | 0.00889256199200948 | 0.04999297395652421

.

The url that failed is one that was linked to on an aggregated-link-based news site. On inspection of the source code, no evidence of a publish date was found.

Statistic explanation:

Datasets:
- original: urls that worked in the original version
- new: urls that failed the original, but passed in the new version
- fails: urls that failed both versions
- total: all urls used
Abolute Difference:
- runtime differences from executing article.parse()
- calculated using python's time.time() function
- difference = new runtime - original runtime
Relatvie Difference:
- difference = absolute difference / original runtime

Summary:
The modified version found publish dates for ~42% more articles and, on average, increased the runtime of article.parse() by only ~5%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant