Page MenuHomePhabricator

[A/B Test] Run an A/B test to evaluate impact of mobile DiscussionTools
Closed, ResolvedPublic

Description

This task represents the work with running an A/B test to evaluate the impact of
disabling the MobileFrontend talk page overlay and introducing the suite of mobile DiscussionTools:

  1. Reply Tool
  2. New Topic Tool
  3. Topic Subscriptions
  4. Usability Improvements

Research Questions

In running this A/B test, we are seeking to learn whether introducing the set of DiscussionTools listed above cause the following to happen?

  1. Junior Contributors are more successful publishing new talk page comments and discussion topics
  2. Junior Contributors intuitively recognize talk pages as spaces to communicate with other volunteers
  3. Senior Contributors can assess the level of activity on a talk page with less effort

Decision to be made

This A/B test will help us make the following decision: Are the set of mobile Talk Pages Project features fit to be made available to everyone, at all wikis, by default?

Decision Matrix

We do not think a single metric / KPI will be sufficient for evaluating the cumulative impact of the set of DiscussionTools we are introducing in this test.

Reason being: we do not think there is a single metric that is likely to: A) move in response to these changes *and* B) for the direction of that movement to indicate a clear improvement or degradation in peoples' user experience.

In line with the above, we will take a "guardrail" approach to this analysis. Meaning, we will base the Decision to be made on the presence of or absence of two unambiguously negative outcomes.

IDScenarioIndicator/MetricPlan of action
1.People are more likely to make destructive editsProportion of published edits that are reverted within 48 hours of being made increases by >10% over a sustained period of time1) Pause plans for wider deployment, 2) To contextualize change in revert rate, investigate changes in the number of published edits (maybe higher revert rate is a "price" we're willing to "pay" for the increase in good edits), 3) Investigate the type of edits being reverted to understand how the new tools – namely the Reply and New Topic Tools – could be contributing to the uptick
2.People are less likely to publish the edits they startEdit completion rate decreases by >10% over a sustained period of time1) Pause plans for wider deployment and 2) Investigate what patterns exist among the people whose edit completion rate has gone down
3.People do NOT encounter more difficulty publishing edits and there are no regressions in edit revert and edit completion ratesA) Edit completion rate increases by any percentage or it decreases by <10% over a sustained period of time and B) Edit revert rate decreases by any percentage or it increases by <10% over a sustained period of timeMove forward with opt-out deployment at all Wikimedia wikis

Curiosities

While the scenarios listed in the Decision Matrix section above will guide

PriorityImpact/OutcomeMetric
1.Junior Contributors intuitively recognize talk pages as places to communicatePercentage of unique Junior Contributors who visit a talk page and engage with it in some way. //Read: expanding a discussion section, initiating the workflow for starting a new discussion, initiating the workflow for replying to a comment someone else has made, etc.
2.Senior Contributors can assess the level of activity on a talk page with less effortAverage time duration between from when a contributor views a talk page to the time they first engage with the page in some way
3.People across experience levels are more successful publishing new talk page comments and discussion topicsA) Average number of talk page new topics or comments people publish during the course of the test and B) Percentage of people that edit a talk page, grouped by number of new topics or comments (e.g. 1-5, 6-10, 11-15, etc) they publish during the course of the test

Wikis

This section will contain the list of wikis participating in the A/B test. See T314950.

NOTE: In aggregate, there should be at least 2,000 people using mobile web talk pages at the wikis included in the A/B test and a minimum of 15 distinct wikis included in the test. Also, to draw conclusions about any individual wiki in the test, there will need to be ~200 unique people using talk pages while the test is running. More in T298271#7641238.

Open Questions

  • 1. Per the question @dchan raised in Editing Scratch, how long do anticipate the A/B test needing to run given the number of people using mobile talk pages and they frequency with which they are using them? See T295180 for more details on mobile talk page usage.
  • 2. Should the A/B test be limited to wikis that have NOT had access to any mobile talk page improvements via T298221 or T298222 to-date? See T297448#7575858 for more context.
    • Yes. The wikis involved in this A/B test will be limited to those who have NOT had access to any mobile DT features prior to the test beginning. See Selection Criteria within T314950's description for more details.

Done

  • A report is published that evaluates the ===Hypotheses above

Related Objects

StatusSubtypeAssignedTask
Resolvedmatmarex
ResolvedJdlrobson
ResolvedNone
ResolvedBUG REPORTmatmarex
Resolvedmatmarex
ResolvedNone
Resolvedppelberg
DeclinedNone
Resolved Whatamidoing-WMF
Resolvedppelberg
ResolvedMNeisler
DuplicateMNeisler
ResolvedDLynch
Resolvedppelberg
Resolvedppelberg
Resolvedmatmarex
ResolvedDLynch
ResolvedDLynch
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
ResolvedDLynch
Resolvedppelberg
ResolvedRyasmeen
ResolvedRyasmeen
Resolvedmatmarex
Resolvedppelberg
Resolvedmatmarex
Resolvedppelberg
Resolvedmatmarex
Resolvedmatmarex

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ppelberg updated the task description. (Show Details)
MNeisler triaged this task as Medium priority.
MNeisler edited projects, added Product-Analytics; removed Product-Analytics (Kanban).
MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.

@ppelberg The metrics identified in the Decision Matrix and Curiosities sections in the Task Description look good to me. I made some small text changes to clarify what we would be measuring and to more closely match with the text outlined in the measurement plan.

Note: We do not currently list a metric to measure the following research question outlined in the measurement plan "People, across experience levels, to receive more timely responses to the talk page comments they post and discussion topics they start."

This metric would look specifically at the impact of the Topic Subscriptions feature being introduced but I think the metrics already identified would provide a better overall assessment of the impact of the suite of features being introduced. I'm fine leaving this metric off unless we identify a specific reason to look at it.

Documenting the outcomes of the conversation @MNeisler and I had offline today, in-line below...

@ppelberg The metrics identified in the Decision Matrix and Curiosities sections in the Task Description look good to me. I made some small text changes to clarify what we would be measuring and to more closely match with the text outlined in the measurement plan.

Sounds great.

Note: We do not currently list a metric to measure the following research question outlined in the measurement plan "People, across experience levels, to receive more timely responses to the talk page comments they post and discussion topics they start."

This metric would look specifically at the impact of the Topic Subscriptions feature being introduced but I think the metrics already identified would provide a better overall assessment of the impact of the suite of features being introduced. I'm fine leaving this metric off unless we identify a specific reason to look at it.

Per what Megan and I talked about offline, we will not be adding response time as a key metric that we will base deployment decisions on. However, we will consider response rate as a potential "Curiosity" to review.

I've updated the task description to reflect the above.

ANALYSIS UPDATE

Per what @MNeisler and I talked about offline last week (25 Jan 2023), we'll need to exclude the behavior of people who are logged out from this analysis because of the issue @Ryasmeen detected and @DLynch documented in T321961#8521374.

The "issue" here being that people who are logged out would be bucketed; however, the version of talk pages they would see (read: DiscussionTools enabled or not) would be reset each time they visited a talk page.

I've completed an initial analysis of the DiscussionTools on Mobile AB Test. Please see the report here.

I plan to update this ticket and report with a high-level summary of the results as well but wanted to provide the initial results for review.

cc @ppelberg

A quick high-level summary of some key results:

Edit Completion Rate (percent of edits started that are successfully published)

  • We observed a higher edit completion rate for users shown the suite of mobile DiscussionTools (test group) than the edit completion rate for users shown either the existing MobileFrontEnd overlay or the ReadasWiki views (control group) overall and on each participating Wikipedia.
    • There was a 56% increase (22 percentage points) in edit completion rate for users shown the DT-enhanced view compared to MobileFrontEnd overlay and 61% increase (15.7 percentage points) compared to the existing Read as Wiki view
    • Edit completion rate for all three editing workflows on the DT-enhanced view had a higher edit completion rate than edits made by users shown the Read as wiki view. There was a 2x increase in the edit completion rate with the reply tool compared to edits on the Read as Wiki view.

Screen Shot 2023-02-02 at 12.40.52 PM.png (497×1 px, 54 KB)

  • We also observed an increase in the completion rates when comparing usage of the reply tool and new topic tool to the existing replying and new topic workflows on MobileFrontEnd.

Screen Shot 2023-02-02 at 1.18.08 PM.png (497×1 px, 50 KB)

  • There was a significantly higher proportion of distinct editors that successfully saved at least 1 mobile talk page edit when shown the mobile suite of DiscussionTools.
experiment_groupProportion of users that published at least 1 mobile talk page edit
control15.42%
test48.3%
  • The revert rate for mobile talk page edits completed by users in the test group was 4.4 percentage points higher than the revert rate for users in the control group. The initial spike in the revert rate observed on the first-day test was not sustained. (Note: I checked these numbers with the data in the recently available mediawiki_history snapshot and the results were the same)
experiment_groupNumber of save editsNumber of revertsPercent edits reverted within 48 hours
control2301873.8%
test37013028.2%

Screen Shot 2023-02-02 at 12.33.29 PM.png (607×1 px, 78 KB)

@ppelberg - Reassigning to you for review. Please let me know if you have any questions.

Next steps

Based on the results @MNeisler shared in T298062#8579981 and T298062#8583003, the Editing Team considers us to be in scenario 3. in the task description's === Decision Matrix: "People do NOT encounter more difficulty publishing edits and there are no regressions in edit revert and edit completion rates."

As such, we are proceeding with plans to offer the suite of mobile DiscussionTools as a default-on feature to everyone (logged in and out) at all Wikimedia wikis in the coming weeks.

We will be tracking these deployments in T298060.


Now, as it relates to revert rate, the rate at which people who were shown the DiscussionTools version of talk pages had the edits they published reverted was 4.4 percentage points higher (3.78% vs. 8.16%) that the edits people made who were shown the existing MobileFrontend experience.

We are comfortable with this increase for the following reasons:

  1. With people publishing more edits, as a consequence of it being easier to do so, we anticipated an increase in revert rate. See Scenario 1.
  2. The absolute number of edits that people are publishing using mobile talk pages combined with the absence of feedback we've heard from experienced volunteers about mobile talk page disruption leads us to think the increase in revert we see as part of this test is not [yet] negatively impacting wikis and the people who moderate them
  3. We think the revert rate of the pre-DiscussionTools state of mobile talk pages might've been suppressed by how challenged people have been publishing edits on talk pages, as evidenced by the 56% increase in the rate at which people who were shown the DiscussionTools version of talk pages by default published edits