**Overview**
We are interested in categorizing different types of /reasons for deletions of uploaded media files (how: based on analysis of a sample of filed deletion requests). Once we understand the main reasons, and a rough proportion of deletion types, we can identify most problematic ones and prioritize improvements focused on minimizing their in-flow.
This is part of [[ https://docs.google.com/document/d/1qBWhT5O-47P8xzRxoLpjZ9UaoUzSAt5ZAOnrZwX8nqs/edit | Design research on Commons ]]. We would first do a programmatic analysis and then ask the design research for qualitative analysis on top.
USeful infformation about the baselines for uploads and some deletion request ratios can be found in comments here https://phabricator.wikimedia.org/T337466
**Requirements**
Step 1: Preliminary analysis
- Which data can we get about a deletion request? Before proceeding to the sampling and analyses, send an example with all data we can get to Sneha and Alexandra for review and discussion about which data to include in the analysis
Step 2: Analysis a sample
Retrieve a random sample of 1000 deletion requests over the last year and try to categorise based on the following parameters:
- Type of deletion request (speedy or regular)
- Time to resolve (speedy is immediate, for regular for example the following classes: up to 2 weeks, 2 weeks to 1 months, 1 month +, haven't been resolved)
- Reasons - see reasons in this [[ https://docs.google.com/document/d/1jxyyui4onla8cO0ub0Zrw9fKApHmulu_f0Tf8xENqlY/edit | write-up ]]. Implementation note: Reasons for deletion requests should have tags, so can probably use those
Questions we want to answer:
Share/% of each deletion class
What are the reasons most commonly reported within in each class
Is there any correlation between e.g. time to close and specific reasons?
Step 3: We would like to ensure that the analysis is representative and not biased to the latest 1000 deletion requests. As such, we would like to run the same analysis for several historical samples to minimize bias.