Research:Newbie reverts and article length
This sprint will be an examination of the relationship between page-length (as a proxy to completeness/quality of an article) and the probability of being reverted for new editors.
- The sample that I'll gather will be divided by the year of an editor's first edit (2001-2010)
- I'll be removing probable vandals from the dataset using the V_LOOSE+V_STRICT function (see Priedhorsky et al. from GROUP'07).
- I intend to plot the results and examine the effects using a logistic regression (dependent variable is True if an editor's sampled edit was reverted and false otherwise).
If page_length is a positive predictor of the probability of being reverted, that should support my hypothesis that new editors' work is more likely to be rejected when they edit more complete articles. If year is a positive predictor, that will refute the gold rush hypothesis (or at least suggest that the gold rush alone is not sufficient to explain the increased proportion of newbie reverts.
Methods
[edit]Sample of editors
[edit]To gather a sample of editors, I ran a query against a database constructed from the Jan 30th, 2010 dump that I had access to in the GroupLens research lab. First, I grouped editors by the year in which they made their first edit to the English Wikipedia. Then I randomly select up to 20,000 editors from each first edit year group.
SELECT * FROM enwp_dump_20100130.editor
WHERE EXTRACT(year from TO_TIMESTAMP(first_edit)) = %(year)s
AND user_id IS NOT NULL
ORDER BY RANDOM()
LIMIT 20000;
For each of these randomly selected editors, I examined the first 10 edits that they made to main namespace articles.
SELECT
r.revision_id,
r.text_length,
rvtd.revision_id IS NOT NULL as reverted,
rvtd.is_vandalism
FROM enwp_dump_20100130.revision_by_user r
LEFT JOIN enwp_dump_20100130.reverted rvtd USING (revision_id)
WHERE r.username = %(username)s
AND r.namespace = 0
ORDER BY r.timestamp
LIMIT 10
Sample of editors' work as newbies
[edit]From this list of the first 10 edits per editor, I both generated the proportion of these edits that were marked as vandalism (see above) and randomly selected one of the edits to represent a newbie experience for the editor (includes length of article at time of edit, whether it was reverted or not and whether that revert was for vandalism).
random.shuffle(revs)
row = {
'user_id': editor['user_id'],
'edits': len(revs),
'v_edits': len([r for r in revs if r['is_vandalism'] == True]),
'revision_id': revs[0]['revision_id'],
'text_length': revs[0]['text_length'],
'reverted': revs[0]['reverted'],
'vandalism': revs[0]['is_vandalism']
}
writer.write(row)
This data was used for the plot and regression below. Note that editors with more than 20% of their first 10 edits marked as vandalism were discarded in an attempt to weed out damaging editors from the analysis.
Results
[edit]Newbie edits by page length and year
[edit]The plot on the right was generated by randomly sampling one of the first 10 edits to articles (main namespace) by editors who started editing in a year. It should thus be representative of the types of edits that newbies perform shortly after registering an account. The distribution of page lengths (number of characters) creeps higher exponentially for each year (although it appears to be linear due to the log scaling of the x axis). This suggests that the average size of the articles that newbies edit in their first 10 edits has quadrupled since 2003.
Regression
[edit]Call: glm(formula = reverted ~ sc(text_length) * sc(first_edit_year), family = binomial(link = "logit"), data = newbie_initial_edit[newbie_initial_edit$vandal_prop <= 0.2, ]) Deviance Residuals: Min 1Q Median 3Q Max -4.6194 -0.4146 -0.3430 -0.2756 2.8578 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.72615 0.01207 -225.801 < 2e-16 *** sc(text_length) 0.23049 0.01076 21.429 < 2e-16 *** sc(first_edit_year) 0.40717 0.01188 34.272 < 2e-16 *** sc(text_length):sc(first_edit_year) -0.02878 0.01026 -2.805 0.00504 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 64886 on 131225 degrees of freedom Residual deviance: 62782 on 131222 degrees of freedom AIC: 62790 Number of Fisher Scoring iterations: 6
This regression shows that both the size of the article being edited (text_length) and the year in which an editor started editing (first_edit_year) are significant, positive predictors of the probability of being reverted. (ie. The longer the article, the more likely a newbie's edit will be reverted and the more recent they started editing, the more likely a newbies edit will be reverted.)
However, the interaction of these two variables (text_length:first_edit_year) is a significant, negative predictor of the probability of being reverted meaning that the more positive either of the variables become, the less of an effect the other has. This could mean that it is actually easier for newbies to successfully edit long articles than it used to be. This could also be interpreted as a change in the notion of what makes an article "long".
Summary
[edit]Newbies are editing longer pages in their first 10 edits than they used to. The plot above shows that the average length of pages edited by a newbie in their first 10 edits has quadrupled since 2003. Newbies are also getting reverted more in recent years than they used to be. The regression confirms that these two effects appear to be independent (ie. newbies were reverted for editing long pages when Wikipedia was young and newbies are more likely to be reverted for editing short pages now than before).
Future Work
[edit]- What is it about long articles that makes them hard for newbies to successfully edit?
- Is it just harder for a newbie to contribute productively to an already-complete article?
- Is there a confounding factor (like the number of other editors active on an article) that results in newbies being reverted?
- Can we predict how long a newbie will stick in Wikipedia by the length of pages they start editing?
- What types of edits do newbies make to long/short articles and does the difference in edit type matter more than where it is done?