Page MenuHomePhabricator

[SPIKE] How will we handle storage of PII data?
Closed, ResolvedPublic

Description

Questions we'd like to answer as part of this investigation:

  • How will we store participants' PII data:
  • We know it can only be kept for 90 days then needs to be aggregated (i.e., decoupled from usernames) - how will this work?
    • What about for instances where the registration is enabled for events that are greater than 90 days in the future?
  • Are we expecting a weird data type that will change the way we approach storing the data?
  • Will we limit the time before an event that an organizer may collect PII?
    • Option 1: you can collect PII anytime, but it is not visible in UI until 60 days ahead of event start date
    • Option 2: registration can only be enabled 60 days ahead of start date, along with PII collection

PII MVP - retention options.png (1×2 px, 192 KB)

More detail/consultation with Legal/Trust & Safety in this doc

Decisions:

  • We will aggregate the date only after the event is ended for the MPV
  • Aggregated date will be shown to organizer only:
    • After the event is ended
    • If there is at least 10 participants
  • The MPV will have the tables below:
ce_question_answers
{
    ceqa_id: bigint, autoincrement,
    ceqa_user_id: integer,
    ceqa_event_id: integer,
    ceqa_question_id: integer,
    ceqa_answer_option: integer | nullable,
    ceqa_answer_text: string | nullable
}

ce_question_aggregation
{
    ceqag_id: bigint | autoincrement,
    ceqag_question_id: integer,
    ceqag_event_id: integer,
    ceqag_qustion_option_id: integer,
    ceqag_answers_amount: integer
}

ce_event_questions
{
    ceeq_id: bigint | autoincrement,
    ceeq_event_id: integer,
    ceeq_question_id: integer
}

To be decided:

  • When should we delete the PII data?
    • After the event is over?
    • After x days counting from when we start collecting the PII data?

Related Objects

Event Timeline

ldelench_wmf created this task.

We can discuss this synchronously, but I was thinking that maybe we might reach out to DBAs in the planning phase. This could help us understand and evaluate all options properly.

Hi there, We are about to start planning for the DB structure of PII.
Actually, we had our first conversation about it on our Engineering meeting last week (14/03/2023).
I brought kind of a first version of how the tables could look like, we can start from there, and we will make adjustments until we get closer to what we think would fit us, then we can ask for a call with the DBAs to show them and get some advices.

So here are an example of how the tables could look like:
Tables: ce_questions, ce_question_options, ce_question_answers, ce_question_aggregation

#ce = Campaing Events
ce_questions

ceq_id | ceq_answer_type (1 - free text, 2 - one choice, 3 - multiple choice, etc...)
------------------------
1      | 2               ( E.g: What is your age? )

#ceq = ce_question
ce_question_options ( this table would save the options that the users can choose )

ceqo_id  | ceq_id
--------------------
1        | 1        ( E.g: I am 14 to 18 )
--------------------
2        | 1        ( E.g: I am 19 to 32 )

#ceqa = ce_question_answers
#Here we save the PII ( this will be anonymized/deleted after 90 days)
ce_question_answers

ceqa_id | ceqa_user_id | ceqa_event_id | ceqa_question_id | ceqa_answer_option | ceqa_answer_text
1       | 1            | 1             | 1                | 2                  | null

#In the example below, we stored in "ce_question_aggregation" table, that 9 users chose option ID 1 and 8 users chose option 2 when they answered the qustion ID 1, which is this example is "What is your age?", so we have that 9 users said "14 to 18" and 8 users said " 19 to 32"

ce_question_aggregation

ceqag_id | ceqag_question_id | ceqag_event_id | ceqag_qustion_option_id | ceqag_id_answers_amount | ceqag_answer
1          1                   1                1                         9                       | null
1          1                   1                2                         8                       | null

// this is an option of how to aggregate freetext answers
1          3                   1                null                      7                       | aaaaaaaa
1          3                   1                null                      5                       | bbbbbbbb

@Daimona , @MHorsey-WMF , @ldelench_wmf, @Iflorez , @VPuffetMichel , @MMoss_WMF

Discussed with @cmelo @VPuffetMichel : let's review this at our next Engineering/Design Sync (also to make sure this structure serves the initial wireframes in T322751); if there are no concerns within Campaigns we can go ahead and request a review/consult with DBA.

Tables: ce_questions, ce_question_options, ce_question_answers, ce_question_aggregation

#ce = Campaing Events
ce_questions

ceq_id | ceq_answer_type (1 - free text, 2 - one choice, 3 - multiple choice, etc...)
------------------------
1      | 2               ( E.g: What is your age? )

#ceq = ce_question
ce_question_options ( this table would save the options that the users can choose )

ceqo_id  | ceq_id
--------------------
1        | 1        ( E.g: I am 14 to 18 )
--------------------
2        | 1        ( E.g: I am 19 to 32 )

We decided that we don't need the 2 tables above, so tables will be only:
ce_question_answers, ce_question_aggregation

Hi @cmelo @VPuffetMichel @gonyeahialam @Iflorez , we decided that one of our sprint goals would be deciding on boundaries for PII retention before/after an event takes place. I think that fits into the scope of this spike, so I've added to the task description two options that were discussed this week, along with a visual if it's helpful. I'm eager to hear what you all think of these, if you have a preference, or if you have other approaches in mind. Thanks!

PS: if you're wondering where I came up with the 60-day-before-event-start-date boundary: for events created in January 2023, the maximum time between enabling registration and the event start date was 56 days.
This is open to debate, too!

As far as time boundaries prior to event-start:
@JStephenson noted that many grantee organizers register event participants in the month prior to the event (to ensure time to work out any account registration or IP issues).

My sense is that the time boundary seen in the data is in large part a product of what we've allowed up to now.

Have ambassadors shared their feedback? @Astinson might have feedback from his experience on when organizers begin registering for events.

@lauren thank you for the visual.

As far as data retention time periods overall, Option 2, is more conservative and thus my preference of the two options.

However, I'm not seeing why we need to hold data for 60 days prior to the event starting AND also 90 days after the event starting. Why not 60 days prior and 30 days post? Or 30 days prior and 60 days post? From a data retention perspective, keeping to the current time period max (90) seems best for keeping personal data "for the shortest possible time that is consistent with the maintenance, understanding, and improvement of the Wikimedia Sites, and our obligations under applicable U.S. law" - 90 day retention policy page

@cmelo Can you say more about the ce_question_aggregation buckets? Ideally, the buckets that we use coincide with the answer options. Also, what about threshholds? If there are fewer than x numbers can we bucket that as <x ?

Change 904513 had a related patch set uploaded (by Damilare Adedoyin; author: Damilare Adedoyin):

[wikimedia/fundraising/SmashPig@master] Handle REJECT and other failure responses with http response status code 200 in ApprovePayment and CancelPayment methods for Dlocal.

https://gerrit.wikimedia.org/r/904513

As for starting registration for events: 56 days is super generous. I can't think of a situation where folks are registering more than 30 days out (you have to remember most of the "actions" that organizers want people to do are easy to forget about if you don't do it in a timely manner).

@cmelo Can you say more about the ce_question_aggregation buckets? Ideally, the buckets that we use coincide with the answer options. Also, what about threshholds? If there are fewer than x numbers can we bucket that as <x ?

Hi @Iflorez, sure, We are still checking how we will aggregate the data, and we will probably save it on X1 as well, we will aggregate the data on going, by that I mean we will update the aggregated data when the users answer/edit their responses, and then after 90 days we anonymize or delete the PII.

We had a meeting with the DBA, and he told us that we should check with privacy engineering re anonymizing data (e.g., add noise / randomize responses for aggregation), I will add more context about this as soon as I know more about it.

Thank you!

Summary of team discussions and decisions on this

First of all, our requirements are:

  • The data should be aggregated no later than when the event ends. This is necessary because the aggregated data must then be shown to the organizer. Note: deciding how the information will be presented to the organizer, and to what extent, is not within the scope of this task. Same for the "when", which we already agreed will be "when the event ends".
  • Once a participant's answers have been deleted and aggregated, the participant won’t be able to answer again. The reason for this is that otherwise, when the new answers are also aggregated, the user’s answers would be duplicated in the aggregated data.
    • This will also be enforced if the participant unregisters and then registers again.
    • We will explain this in the interface.

Note that deletion and aggregation need not happen at the same time. What's really important is that all the data has been aggregated when the event ends, because the organizer would then be able to see the aggregates. Data may as well be deleted once it's been aggregated, since we couldn't allow further changes anyway.

We considered the following options:

  1. Event-based
    1. Delete all the data 90 days after event creation, allow organizers to enable registration at most 30 days before event.
    2. Delete all the data 90 (or 60?) days after event start, allow organizers to enable registration at most 30 days before event.
    3. Enable PII data collection 90 days before event ends, regardless of event start and creation; delete all the data when event ends.
  2. Answer-based
    1. For each participant, delete the data 90 days after the first answer was provided
    2. For each answer, delete it 90 days after it was provided

We believe that options 1A, 1B, and 1C are not ideal because they impose unnecessary restrictions on organizers (1A and 1B with the 30 days constraint), or make it more difficult for participants to provide their answers (1C, which would require a two-step interaction from participants). Additionally, there would be many edge cases stemming from organizers changing the start and end dates just to affect the PII availability, either by mistake or maliciously.

Of the remaining two options, that do not have the cons of the first three, we especially liked 2A because the behaviour would be easier to explain and understand for both organizers and participants. Therefore, we decided that we will proceed with option 2A. In practice, this means:

  • Organizers can enable registration whenever they want
  • Participants can provide answers whenever they want
  • As soon as a participant answers the first question, a 90 day timer starts. When the timer reaches 0, we will aggregate and delete the answers of that participant.
  • If there are remaining unaggregated answer by the time the event ends, those answers will also be aggregated and deleted even if the 90 days haven't passed yet.

This implies that there's a new requirement: organizers cannot change the end date of the event if the event is in the past. Without this restriction, organizers would be able to move the end date back and forth to view the aggregated data, and potentially infer individualized data from it. For instance, if there are 20 participants, you could change the end date to be in the past, take note of the aggregated data, and set the end date in the future again. When a new person registers, you could repeat the same trick and infer their answers from the differences in aggregated data. This restriction will be explained to the organizers.

Additionally, we decided that if a participant registers for an event, answers questions, and then cancels their registration before the answers are aggregated, we will also delete their answers (and pretend that they never answered in the first place, in case they choose to register again later). This is based on the assumption that we won't keep data without a valid reason to do it.

We're still left with an open question: should we delete the “time of first answer” if the participant deletes all their answers? Meaning we would restart counting from the start if they choose to later answer the question again.

With these decisions being made, we have the following next steps:

  • We should explain to participants when their answers will be deleted:
    • There could be a preemptive explanation that data will be deleted 90 days after the first answer OR when the event ends, whichever is first.
    • There should be some text that is only shown after the participant answered for the first time, saying something like “Your answers will be deleted on $DATE”.
  • We should explain to organizers that answers are deleted 90 days after first answer OR when event ends. This could be in the participants list, for instance, and it would be a generic explainer, like the one for participants mentioned above.
    • This could also mention the fact that the organizer will be able to view the aggregated data when the event ends.
  • We need to make sure that organizers are aware of the consequences of changing the end date to be in the past (data aggregated immediately, no longer possible to “reopen” the event), even if it happens by mistake. This might be explained in the EditEventRegistration form.
  • We will add an error message to use when the organizer tries to change the end date of an event that has already ended. This will be shown upon form submission.
  • We acknowledge the fact that if someone registers, their answers are aggregated, and then they unregister, their answers will remain in the aggregated data even if the person didn’t participate. We still don't know if this will be stated explicitly somewhere in the interface.
  • We will make sure that participants are not allowed to answer if their answers have already been aggregated. This should work whether the participant has unregistered and re-registered or not.
    • We need an implementation task
    • We need “error” messages
    • We will consider adding a preemptive explanation
  • The proposed schema (T335526) should be updated:
    • We should store the time of first answer, probably in the ce_participants table
    • We may have to store whether your answers have already been aggregated (e.g., to prevent you from answering again), if this can't be inferred from the other data.

Once the above steps have been reviewed, we will update the relevant tasks and create new tasks where needed.

We're still left with an open question: should we delete the “time of first answer” if the participant deletes all their answers? Meaning we would restart counting from the start if they choose to later answer the question again.

That's a good point, I think it makes sense to delete it.
what do you think @VPuffetMichel , @MHorsey-WMF , @Iflorez ?

@Daimona this looks good. I would start working on the design and reach out if I encounter any confusion.

The next steps above that refer to design work may not be designed as described above but I would reach out to you for such cases. An example is:

  • We will add an error message to use when the organizer tries to change the end date of an event that has already ended. This will be shown upon form submission.

Since we don't want this action in the first place as described below, we don't need to wait till the user does it and submits the form before we inform them it is not allowed. That would be stressful. We tell them before hand either by an inline error message as soon as they change the date or disable editing the date after the event has ended with a note on why it is disabled. I would explore this further as I begin the design.

This implies that there's a new requirement: organizers cannot change the end date of the event if the event is in the past. Without this restriction, organizers would be able to move the end date back and forth to view the aggregated data, and potentially infer individualized data from it. For instance, if there are 20 participants, you could change the end date to be in the past, take note of the aggregated data, and set the end date in the future again. When a new person registers, you could repeat the same trick and infer their answers from the differences in aggregated data. This restriction will be explained to the organizers.

We're still left with an open question: should we delete the “time of first answer” if the participant deletes all their answers? Meaning we would restart counting from the start if they choose to later answer the question again.

That's a good point, I think it makes sense to delete it.
what do you think @VPuffetMichel , @MHorsey-WMF , @Iflorez ?

I don't think we need it for anything or do we?

@Daimona this looks good. I would start working on the design and reach out if I encounter any confusion.

The next steps above that refer to design work may not be designed as described above but I would reach out to you for such cases.

Yup, absolutely! I included some potential design ideas just to clarify what I meant, but those are in no way finalized or required. I will leave that up to you.

An example is:

  • We will add an error message to use when the organizer tries to change the end date of an event that has already ended. This will be shown upon form submission.

Since we don't want this action in the first place as described below, we don't need to wait till the user does it and submits the form before we inform them it is not allowed. That would be stressful. We tell them before hand either by an inline error message as soon as they change the date or disable editing the date after the event has ended with a note on why it is disabled. I would explore this further as I begin the design.

Makes sense! I think disabling the field is a good thing to do regardless of additional error messages. However, do note that we will still need an error message for API users.

We're still left with an open question: should we delete the “time of first answer” if the participant deletes all their answers? Meaning we would restart counting from the start if they choose to later answer the question again.

That's a good point, I think it makes sense to delete it.
what do you think @VPuffetMichel , @MHorsey-WMF , @Iflorez ?

I don't think we need it for anything or do we?

Indeed. I also think it should just be deleted.

@Daimona I have begun designing and I have some additional questions

Question

Note that deletion and aggregation need not happen at the same time. What's really important is that all the data has been aggregated when the event ends, because the organizer would then be able to see the aggregates. Data may as well be deleted once it's been aggregated since we couldn't allow further changes anyway.

  • Would deletion always happen 90 days after the first answer regardless of when the aggregation happens?

• As soon as a participant answers the first question, a 90-day timer starts. When the timer reaches 0, we will aggregate and delete the answers of that participant.

  • Does the timer start counting when the participant answers their first question (I understand the first question to be any of the questions they answer first e.g What is your age?) or when the participant submits the registration form with at least one question answered?
  • Also, since deletion and aggregation may not happen at the same time wouldn't the timer for aggregation and deletion may be different?

We will make sure that participants are not allowed to answer if their answers have already been aggregated. This should work whether the participant has unregistered and re-registered or not.

Please confirm my interpretation of the consequences of this.
For events that end before the 90 days limit (aggregation happens at event end). The consequences would be that after aggregation(which is also event end), the participant:

  • Can’t edit their registration form responses,
  • Can cancel their registration but can not register again;

If the event is longer than 90 days that means aggregation would happen while the event is still on, the consequences of this would be that after aggregation, the participant:

  • Can’t edit registration form responses,
  • Can cancel the registration,
  • Can register again(if the event hasn't ended) but can not fill the registration form(But how would we know that a participant who canceled their registration previously is trying to register again, are we going to still store the fact they once registered even after cancellation)

Note that deletion and aggregation need not happen at the same time. What's really important is that all the data has been aggregated when the event ends, because the organizer would then be able to see the aggregates. Data may as well be deleted once it's been aggregated since we couldn't allow further changes anyway.

  • Would deletion always happen 90 days after the first answer regardless of when the aggregation happens?

I guess this is more of an implementation detail, but I think we would always do aggregation and deletion at the same time.

• As soon as a participant answers the first question, a 90-day timer starts. When the timer reaches 0, we will aggregate and delete the answers of that participant.

  • Does the timer start counting when the participant answers their first question (I understand the first question to be any of the questions they answer first e.g What is your age?) or when the participant submits the registration form with at least one question answered?

I'm not sure if I understand the difference between these. Participants would do both things at the same time, right?

  • Also, since deletion and aggregation may not happen at the same time wouldn't the timer for aggregation and deletion may be different?

In theory, they could. But I think for simplicity we could just aggregate and delete the data at the same time, and with the same timer. This doesn't have to be the case, but it's the simplest option I believe.

We will make sure that participants are not allowed to answer if their answers have already been aggregated. This should work whether the participant has unregistered and re-registered or not.

Please confirm my interpretation of the consequences of this.
For events that end before the 90 days limit (aggregation happens at event end). The consequences would be that after aggregation(which is also event end), the participant:

  • Can’t edit their registration form responses,

Correct.

  • Can cancel their registration but can not register again;

Also correct, and this is already how it works.

If the event is longer than 90 days that means aggregation would happen while the event is still on, the consequences of this would be that after aggregation, the participant:

  • Can’t edit registration form responses,

Aye.

  • Can cancel the registration,

Yeah, same as current behaviour.

  • Can register again(if the event hasn't ended) but can not fill the registration form

Yea.

(But how would we know that a participant who canceled their registration previously is trying to register again, are we going to still store the fact they once registered even after cancellation)

Yes, and this is already the case, although it's not exposed anywhere in the interface.

I've created a bunch of tasks for the next steps outlined in T328032#8936008. You can see them in the mentions right above this comment. I am therefore closing this task as resolved, and conversation about a specific next step can continue in the relevant task.

Thank you @Daimona

Quick answers here and I'll follow up on the appropriate follow-up tickets:

We're still left with an open question: should we delete the “time of first answer” if the participant deletes all their answers? Meaning we would restart counting from the start if they choose to later answer the question again.

That's a good point, I think it makes sense to delete it.
what do you think @VPuffetMichel , @MHorsey-WMF , @Iflorez ?

  1. I don't have a reason for keeping this data
  1. Shall we aggregate and delete the data at the same time, and with the same timer? Maybe not...More on implementation of aggregation coming soon.