0% found this document useful (0 votes)
62 views9 pages

Deepfake Detection by Police Experts

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views9 pages

Deepfake Detection by Police Experts

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS

Deepfake Detection in Super-Recognizers


and Police Officers

Meike Ramon | University of Lausanne and AIR – Association for Independent Research
Matthew Vowels | University of Lausanne, Lausanne University Hospital and University of Lausanne,
and The Sense Innovation and Research Center
Matthew Groh | Northwestern University

We examined human deepfake detection performance (DDP) in relation to face identity processing
ability among Berlin Police officers, including Super-Recognizers (SRs). While we find no relationship,
further research into human DDP using state-of-the-art static deepfakes is needed to establish the
potential value of SR-deployment.

T he present study is the first empirical investigation


of the relationship between human deepfake detec-
tion performance (DDP) and individuals’ face identity
ability. Interestingly, we found no relationship between
DDP and FIP ability. Further work using static deepfakes
created with current state-of-the-art generative models
processing (FIP) ability. Using videos from the Deepfake is needed to determine the value of SR deployment for
Detection Challenge, we investigated DDP in two unique deepfake detection in law enforcement.
observer groups: Super-Recognizers (SRs) and “normal”
officers from within the 18,000 members of the Berlin Introduction
Police. SRs were identified either via previously proposed
lab-based procedures or the only existing tool for SR iden- Perception Versus Reality
tification involving increasingly challenging authentic Our perception of the world around us is highly subjec-
forensic material: the Berlin Test For Super-Recognizer tive; presented with the same information, we interpret it
Identification (beSure). Participants judged either pairs of in vastly different ways. Our perception is influenced by
videos or single videos in a two-alternative forced-choice several factors, most of which operate without our knowl-
(2AFC) decision setting (that is, which of the pair edge or control.1 Perceptual illusions provide compelling
or whether a single video was a deepfake or not). We examples of the highly subjective nature of human per-
explored speed–accuracy tradeoffs and compared DDP ception and of how it is influenced by both external stim-
between lab-identified SRs and non-SRs and police offi- uli and internal cognitive processes. Simply put, the world
cers as a function of their independently measured FIP as we perceive it reflects a unique interaction between
incoming information and the way it is processed
Digital Object Identifier 10.1109/MSEC.2024.3371030
depending on our abilities, prior experiences, and expec-
Date of publication 26 March 2024; date of current version: 10 May 2024. tations. Adding to this human complexity, technological
This work is licensed under a Creative
Commons Attribution 4.0 License. For more information,
68 May/June 2024 Copublished by the IEEE Computer and Reliability Societies  see https://creativecommons.org/licenses/by/4.0/
advances provide means to alter or even create entirely required to make final decisions, it is critical to under-
novel information (for a review, see Farid2). Whether stand the limits of our ability to detect deepfakes.
consciously or not, all of us are likely to have already A few studies have investigated humans’ perception
experienced some form of synthetic media—or deepfake. of deepfakes. For example, Groh et al.8 reported that
Deepfakes have been used in the realm of art, for “ordinary humans perform in the range of the leading
example, to (re)create characters or scenarios, including machine learning model on a large set of minimal con-
interactive installations to create immersive experiences.3 text videos” (p. 1). Although the highest DDP could
In wider society, the word “deepfake” is typically associ- be achieved by combining human and model predic-
ated with misinformation, that is, the intentional manip- tions, humans often incorrectly updated their responses
ulation of audio content or facial information. Facial when exposed to inaccurate model predictions (that is,
deepfakes can take various forms, for example, swap- machine-based responses that were not correct). Thus,
ping the entire face or individual features or manipulat- integrating human with model predictions can result in
ing them—either in static 2D images or dynamic video an increase or decrease in DDP. Moreover, the authors
sequences. Such manipulations can include masking or reported that manipulations that are known to disrupt
enhancing information or changing characteristics that human FIP, notably stimulus inversion, were associated
are stable (for example, gender and ethnicity) or those with decreased human—but not model—performance.
that vary across different time scales (for example, age These findings were interpreted as supporting “a role
and expressions of emotion). These manipulations aside, for specialized cognitive capacities in explaining human
facial deepfakes also encompass the creation of entirely deepfake detection performance” (Groh et al.8).
new artificially generated synthetic facial identities.
Knowledge Gap
Deepfake Detection in Real Life and An important question that remains unanswered is
Laboratory Conditions whether—and to what degree—deepfake detection per-
In 2022, several European mayors, including those from formance varies across observers. For instance, highly
Berlin, Madrid, and Vienna, were deceived into hold- motivated trained law enforcement professionals might
ing video calls with a deepfake impersonating Vitali outperform neurotypical observers, who are not pro-
Klitschko, the mayor of Kyiv.4 In the case of Berlin, fessionally tasked with face or deepfake processing. On
Mayor Franziska Giffey became suspicious ~15 min the other hand, it is possible that stable individual differ-
into the call when the fake Klitschko started discussing ences in FIP ability, notably superior abilities (for exam-
controversial topics regarding Ukrainian refugees. The ple, Ramon9 and Ramon and Vowels10), may be a better
deception, which was confirmed through diplomatic predictor of DDP. Conceivably, compared to neurotypi-
channels, emphasizes the usage and impact of misinfor- cal observers, individuals with substantially inferior or
mation in the political realm. exceptionally superior FIP ability may exhibit markedly
Notwithstanding their importance, the known num- different sensitivity to information manipulation.
ber of instances of deepfake deployment to influence Over the past decade, there has been a surging inter-
politics is modest compared to the much more frequent est in so-called Super-Recognizers, individuals with an
targeting of celebrities, public figures, and everyday peo- apparently innate superiority in face identity process-
ple.5 Deepfakes are considered to pose crucial “risks to ing (Ramon,9 Ramon and Vowels,10 and Mayer and
our democracy and to national security” as well as “indi- Ramon11). These individuals are of interest not only to
viduals and businesses fac[ing] novel forms of exploita- cognitive (neuro)scientists but also to law enforcement
tion, intimidation, and personal sabotage.”6 Given the and policing (Ramon9 and Ramon et al.12). The most
challenges associated with facial deepfakes, the increas- consistent strategy for identifying these unique individu-
ing number of studies emerging in this domain is not als has been proposed by Ramon9, whose diagnostic
surprising. However, these studies typically report framework for lab-based SR identification comprises chal-
algorithmic approaches to tackle deepfake detection. lenging behavioral tests assessing perception and recogni-
Importantly, the speed of deep learning-based deep- tion memory for facial identities. The only existing tool
fake detection theoretically makes them suitable for designed to identify law enforcement professionals using
large-scale implementation. However, human and authentic police images was proposed by Ramon and
machine-based processing operates based on different Rjosk13. While empirical evidence into the mechanisms
features, which remain to be clearly defined (Wich- underlying SRs’ ability is mounting (for example, Nador
mann and Geirhos7). Differences between machines et al.,14 Nador et al.,15 and Linka et al.16), to date, no study has
and humans aside, our understanding of humans’ abil- investigated DDP in SRs or law enforcement professionals.
ity for deepfake detection actually remains largely Understanding factors that influence our perception
unexplored. Considering that humans will always be and detection of deepfakes is critical considering their

www.computer.org/security 69
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS

potentially wide-ranging societal implications. Such knowl- of Helsinki. All participants were healthy volunteers, were
edge is particularly pertinent for organizations that are provided with informed written consent, and were not
expected to monitor and mitigate threats by deepfakes: financially compensated for their participation.
law enforcement professionals. Therefore, in this study, we
investigated the impact of human factors—professional Experiments
occupation and individual differences in face identity pro- Participants were invited to participate in two deepfake
cessing ability—on deepfake detection performance. We detection tests reported previously by Groh et al.8 and
did so by testing two unique cohorts of observers: previ- exemplified in Figure 1(a). The first experiment involves
ously reported SRs (Ramon9) and law enforcement profes- presenting two stimuli in a 2AFC design; the second pres-
sionals from within the 18,000 officers of the Berlin Police ents a single stimulus. Observers are required to decide
(Ramon and Vowels10). Their performance was measured which of the two stimuli in the 2AFC setting represents a
across using identical stimulus material, experimental set- deepfake and report their confidence in the single-video
tings, and neurotypical control observers’ data as reported stimulus being a deepfake. Participants could com-
previously (Groh et al.8). plete as many trials as they wished. The full 2AFC and
single-video experiments comprised a total of 56 and 56
Methods trials, respectively (for full details, see Groh et al.8).
This research complies with all relevant ethical regula-
tions, and the Massachusetts Institute of Technology’s Participants
Committee on the Use of Humans as Experimental Sub- The data reported in this study originated from different
jects approved the deepfake detection portion of this sources. First, data published previously by Groh et al.8
study as Exempt Category 3 – Benign Behavioral Inter- included nonrecruited observers (who arrived at the website
vention. This study’s exemption identification number is via organic links on the Internet) and observers recruited
E-3354. All procedures and protocols were approved by from Prolific.17 These data were considered as representing
the University of Fribourg’s Ethics Committee (approval neurotypical controls (as no independent measure of their
number 473) and conducted in accordance with both FIP ability was available). Second, data from lab-identified
their guidelines as well as those set forth in the Declaration SRs reported previously (Ramon9) and thereafter using

2AFC Experiment Single Video Experiment

(a)
SR Versus Non-SR Comparison
1 2
1.5
Performance
z-Scored Performance

0.8 1
beSure

0.5
0
Accuracy

0.6 –0.5
–1
0.4 –1.5

0.2
Non-SR
SR
0
2AFC Single Video beSure 2AFC Single Video
(b) (c)

Figure 1. Stimuli and results for DDP. (a) Example stimuli presented in the 2AFC (left) and single-video (right) experiments.
(Source: Adapted from Groh et al.8) (b) DDP for each of the two experiments for SRs and control observers (dark and light
grey). (c) Relationship between different performance measures along the x-axis: performance across beSure (left) and
both deepfake experiments (middle and right). Colors indicate beSure performance rank to visualize the (in)dependence
between FIP ability measured by beSure, and observers’ performance for the deepfake experiments.

70 IEEE Security & Privacy May/June 2024


the same lab criteria were invited to participate. Finally, of a correct response and a random intercept for par-
Berlin Police officers who had previously participated in ticipants to model the variability in baseline log odds of
beSure (Ramon and Vowels10 and Ramon and Rjosk13), success across participants, thereby accommodating the
the only existing tool for SR identification with authentic repeated measures design.
police material (for details, see Ramon and Vowels10 and A key characteristic of the BEINF model is its ability
Ramon and Rjosk13), were invited to participate in the to accommodate data with excess zeros or ones, a phe-
deepfake detection experiments. nomenon often referred to as inflation at the boundar-
In total number, 193 individuals contributed data to ies. Traditional models, such as the beta regression, are
the first experiment (decision: which video in a pair was well suited for continuous outcomes constrained within
a deepfake), and 132 contributed to the second experi- (0, 1); however, they struggle with boundary inflation
ment (decision: whether individually presented videos because they assume that the distribution of the out-
were real or fake). Of these, 106 and 68 met the SR lab come variable is smooth across the entire interval. The
criteria (Ramon9). Note that the majority of SRs were BEINF model extends the Beta regression by incorpo-
thus not Berlin Police officers but from the ~90 indi- rating parameters that explicitly model the probability
viduals tested in the AFC Lab. Eighteen lab-identified of observing these boundary values, thus providing a
SRs were from the sample of participating police offi- more nuanced understanding of the data distribution.
cers. Note: Demographic information can be provided A full set of comparisons can be found in the supple-
only for participating Berlin Police officers (Ramon mentary materials. Given that we undertook multiple
and Vowels10) and are summarized in Table 1. [No comparisons, we used an adjusted alpha level of 0.001.
information is available for the nonrecruited/recruited Reaction time (RT) data (measured in seconds) were
observers reported originally by Groh et al.8 or the SRs winsorized (between the fifth and 95th percentiles to
reported previously (Ramon9), and after this publica- deal with outliers where participants left the response
tion, using the same lab criteria as participation did not survey software open without participating) and then
require the provision of personal information.] z-scored to improve model convergence. Altogether, our
analyses aimed at answering three distinct questions.
Analyses First, we investigated whether there is a relation-
To investigate the relationship between FIP ability and ship between DDP and processing duration, that is, a
DDP, we considered performance on lab- and police-based correlation between accuracy and RTs in the newly
procedures (beSure; Ramon and Rjosk13) and deepfake acquired data (lab-identified SRs and Berlin Police offi-
stimuli across both the single-video and 2AFC experi- cers). This served to determine the potential presence
ments. All such analyses were performed using the soft- of speed–accuracy tradeoffs that could account for the
ware/language R (R Core Team18) by using a zero-and-one obtained findings.
inflated beta regression model (BEINF), implemented Second, labeling observers categorically, we asked
using the gamlss package (Rigby and Stasinopoulos19), for whether individuals identified as lab-identified SRs
the single-video experiment (each trial within which has a (Ramon9) excel at deepfake detection relative to neuro-
fractional [0,1] outcome) and a multilevel logistic model typical observers reported previously (Groh et al.8 and
for the 2AFC experiment (each trial within which has a Ramon and Vowels10). We combine all available data
binary correct/not correct outcome). The BEINF model and consider those observers as SRs who met the pro-
is generally employed in statistical analyses when the out- posed lab-based criteria for SR identification (Ramon9),
come variable of interest is continuous but bounded within including those among Berlin Police officers. Note that
a specific interval and where the data also exhibit a nonstan- additional analyses performed for all subgroup-wise
dard distribution within the interval, such as a pronounced
skewness or the presence of peaks at the boundaries, which
are common in proportion or percentage data. For this Table 1. Demographic information for participating
model, we create an average of the individual trial results Berlin Police officers.
and regress it onto the group variable.
The multilevel binomial logistic regression model, Gender
implemented using the lme4 package (Bates et al.20), Handedness (Female/
was employed to examine the effect of group mem- n (Sample Mean Standard (Right/Left/ Male/
bership on a binary outcome (that is, correct versus Size) Age Deviation Ambidextrous) Diverse)
incorrect responses) while accounting for the noninde-
2AFC 89 42 9 75/12/2 27/62/0
pendence of repeated measures within individuals. Spe-
cifically, the model incorporates a fixed effect for the Single 65 42 9 56/7/2 21/41/0
group variable to assess its influence on the likelihood video

www.computer.org/security 71
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS

comparisons are provided in the supplementary materials while accounting for random intercepts associated with
in the accompanying OSF project. individual users. In terms of the fixed effects, the (stan-
Third, and finally, we sought to determine the dardized) RT was negatively associated with the log odds
potential relationship between FIP ability and DDP of correct deepfake detection (B = −0.373, SE = 0.029,
by considering police officers’ FIP ability in a continu- z = −12.97, p < 0.001). Here, “B” represents the fixed effect
ous manner through their previously measured perfor- regression coefficient for the standardized RT, indicating its
mance across all five subtests of the bespoke police tool effect on the log odds of correctly detecting a deepfake. “SE”
beSure (Ramon and Vowels10 and Ramon and Rjosk13). is the standard error of the estimate for “B,” quantifying the
To this end, we first performed linear regressions for uncertainty/variability. The “z” value serves as the test sta-
performance in deepfake experiments and beSure per- tistic for assessing the significance of the effect, and the asso-
formance. Additionally, given the possibility that the rela- ciated “p” value indicates the probability of observing such
tionships may be nonlinear, we also explored whether a an effect (or stronger) under the assumption that there is, in
data-driven approach would indicate predictive potential. fact, no association.
To this end, we undertook the same regressions for the Taking the exponent of the fixed effect “B,” we get an
single-video and 2AFC experiments—but this time with odds ratio of approximately 0.69. In other words, for every
a random forest (Breiman21). Random forests are a type one-standard-deviation increase in the RT, the odds of cor-
of data-adaptive, nonparametric, and tree-based machine rectly detecting a deepfake are decreased by about 31% rela-
learning algorithm that learn a function that maps from tive to the odds of someone reacting in an average amount
the predictors to the dependent variable. The forest ele- of time. It is important to note that this association between
ment refers to the fact that multiple trees are used, each an increased RT and the decreased detection accuracy does
of which is trained on a bootstrapped subsample of the not imply causality. The observed relationship might sug-
input data and input variables. This bootstrapping pro- gest that longer RTs are linked to greater uncertainty in
cess helps to prevent overfitting, a phenomenon whereby distinguishing deepfakes, potentially because more chal-
data-adaptive approaches tend to learn ungeneralizable lenging decisions require longer deliberation. However, this
functions that exhibit only good performance on the data interpretation is speculative, and further research would be
on which they are trained. necessary to explore the underlying mechanisms.
For the random forest, we use the sklearn implemen- On the other hand, for the single-video tasks, which
tation (Pedregosa et al.22) with its default values, which have a fractional performance measure [0,1], we use a
have been shown to yield consistently good performance zero-and-one inflated beta generalized additive regres-
across a range of tasks without needing hyperparameter sion model (Stasinopoulos et al.24), which we fit to,
tuning (Probst et al.23). Specifically, the core hyperpa- again, assess the association between the standardized
rameters were as follows: number of estimators, 100; RT and performance. The main coefficient to be evalu-
maximum features, all; maximum depth, unlimited; min- ated is n (estimate = −0.013, SE = 0.048, t = −0.272, p =
imum sample split, two; and criterion, squared error. No 0.786). An interpretation of these results follows in a sim-
experiments were undertaken to evaluate whether better ilar manner to those for the multilevel model. Here, “t”
hyperparameters could be identified (we assume that the is the test statistic rather than “z.” These results indicate
algorithm is already substantially more flexible than the that there is no significant relationship between the RT
alternative linear regressors under comparison). We fol- and the expected score—the threshold for significance is
low a leave-one-out cross-validation process to evaluate taken to be a = 0.05, and the value of “p” is above this.
the out-of-sample mean-squared-error performance of Taken together, analyses for both the single-video
the random forest and compare it to a “dummy” regressor, and the 2AFC experiment have ruled out speed–accu-
which simply predicts the average value of the outcome. racy tradeoffs. If anything, we observed the opposite
pattern—lower performance associated with prolonged
Results RTs. Therefore, only performance accuracy was consid-
ered in further analyses.
Relationship Between Performance and RTs
First, we explored the extent to which RTs in both Group Differences: SRs Versus Controls
experiments would be predictive of DDP. Specifically, The relationship between independently measured
we aimed to determine whether higher performance FIP ability and DDP was first investigated by catego-
accuracy could be accounted for by prolonged RTs, that rizing observers according to their SR status. Recall
is, a speed–accuracy tradeoff. that observers originated from different groups: 1)
To this end, for the 2AFC experiment, we fit the mul- previously reported SRs (Ramon9) Berlin Police offi-
tilevel logistic model to the data to assess the relationship cers who met the lab criteria and those who did not
between DDP (correct/incorrect) and a standardized RT and 2) recruited and nonrecruited observers reported

72 IEEE Security & Privacy May/June 2024


previously (Groh et al.8). We combined all non-SR and forest (Breiman21) for both experiments. For the
SR data, respectively, to investigate potential group dif- single-video experiment regression, we find that the
ferences in DDP. A bar and scatter plot for the compari- random forest has an out-of-sample mean-squared
son can be seen in Figure 1. error of 0.0072, while the mean-squared error for the
In addition, for the 2AFC experiment, we fit a gener- dummy is 0.0067. Similarly, the random forest out-of-
alized linear mixed model (with binomial/logistic link sample mean-squared error for the 2AFC experiment
function) to the data to assess the relationship between was 0.0121, while that of the dummy regression was
DDP (correct/incorrect) and SR status while account- 0.0104. As such, in both cases, the dummy regression
ing for random effects associated with individual users. was better than the random forest (lower mean-squared
The likelihood of a correct response for SRs was not sig- error). These results suggest that even a relatively pow-
nificantly different from the reference group (B = 0.082, erful data-adaptive algorithm is able to predict neither
SE = 0.082, z = 1.001, p = 0.317). single-video nor 2AFC experiment performance via an
Similarly, for the single-video tasks, which have a frac- independent and continuous measure of FIP ability,
tional performance measure [0,1], we use a zero-and-one derived from all five beSure subtests (Ramon and Vow-
inflated beta generalized additive regression model (Sta- els10 and Ramon and Rjosk13).
sinopoulos et al.24). For the n component, which rep-
resents the mean of the average score, and recalling our Discussion
adjusted alpha level of 0.001 in light of the complete set of Society is confronted with increasing amounts of digi-
group comparisons undertaken and presented in the sup- tal misinformation and a lack of solutions for their
plementary materials, the likelihood of a correct response
for SRs was not significantly different from the reference
group (B = 0.156, SE = 0.077, t = 2.019, p = 0.046). Table 2. Linear regression results for the single-video experiment
performance as the dependent variable with beSure subtest
Individual Differences: Continuous performance as predictors.
Measure of FIP via beSure
Finally, focusing on Berlin Police officers, we investi- Coefficient SE t-Value p-Value
gated the relationship between individual differences in
Constant 0.755 0.01 77.09 <0.001
DDP and their FIP ability as measured by the five sub-
tests of beSure (Ramon and Vowels10). Subtest 1 accuracy 0.004 0.013 0.265 0.792
First, we investigated potential linear relationships Subtest 2 accuracy 0.01 0.015 0.712 0.48
via Spearman correlations between the z-standardized Subtest 3 accuracy 0.004 0.013 0.298 0.767
averages across the beSure subtest rank perfor-
Subtest 4 accuracy −0.032 0.012 −2.790 0.007
mances and observers’ ranking in the single-video and
2AFC experiments. For this analysis, results from the Subtest 5 accuracy 0.006 0.012 0.516 0.608
single-video and 2AFC trials were averaged to generate R2 0.123
a singular summary score for each participant’s perfor- Adjusted R2 0.049
mance, which then served as the dependent variable in
the respective regression. For the single-video experi-
ment, the rank-order correlation with the beSure per-
Table 3. Linear regression results of 2AFC experiment
formance ranking was −0.03 (p = 0.84). For the 2AFC
performance as the dependent variable with beSure subtest
experiment, the rank-order correlation with the beSure
performance as predictors.
performance ranking was 0.12 (p = 0.27). As such, no
significant relationship was identified. Comprehensive
multiple regression outcomes for both experiments are Coefficient SE t-Value p-Value
presented in Tables 2 and 3, respectively. The sole signif- Constant 0.804 0.011 74.61 <0.001
icant predictor was the beSure Subtest 4 accuracy per- Subtest 1 accuracy 0.004 0.015 0.25 0.803
formance for the single-video experiment, B = −0.032,
Subtest 2 accuracy 0.012 0.015 0.824 0.412
t(5) = −2.790, p = 0.007—however, with an effect in
the opposite direction as one might expect. Neverthe- Subtest 3 accuracy 0.002 0.014 0.169 0.866
less, due to the modest R2 values for both single-video Subtest 4 accuracy −0.013 0.013 −0.969 0.335
and 2AFC experiments (0.123 and 0.053, respectively), Subtest 5 accuracy 0.017 0.013 1.301 0.197
we abstain from interpreting this specific finding.
R2 0.053
Second, we addressed potential nonlinear rela-
tionships via regressions performed with a random Adjusted R2 −0.004

www.computer.org/security 73
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS

detection. Compared to the number of studies reporting experiment, respectively. Observers belonged to two
automatic solutions developed toward this end, empirical groups: 1) civilians or Berlin Police officers identified
studies of human ability for deepfake detection remain as SRs via lab tests (Ramon9 and Ramon and Vowels10)
severely limited. Moreover, existing studies have not con- who represent the core of a deep-data neuroscientific
sidered two potential determinants of deepfake detection research agenda pursued in the Applied Face Cogni-
performance: stable individual differences in face iden- tion Lab (https://afclab.org/) and 2) non-SRs who
tity processing ability and professional occupation. To were previously reported neurotypical observers
address this knowledge gap, we leveraged access to two (Groh et al.8) and officers of the Berlin Police who did
unique groups of human observers: previously reported not meet the SR criteria (Ramon9). The results indicate
SRs and motivated officers from within the entire group that DDP was not related to group membership.
of ~18,000 employed by the Berlin Police (Ramon,9 These findings may be accounted for by the stimu-
Ramon and Vowels,10 and Mayer and Ramon11). The lus material presented and used. SRs outperform con-
latter had participated in beSure (Ramon and Vowels10 trols when the processing of static images of faces is
and Ramon and Rjosk13)—the only existing police FIP required (Ramon,9 Ramon and Vowels,10 and Ramon
assessment tool using authentic police material. In this and Rjosk13). Here, however, observers judged dynamic
manner, we could relate DDP to two independent, chal- stimuli. The availability of motion information may
lenging, and complementary means of FIP assessment. have leveled the field across observers.
In light of the challenges that synthetic misinformation
represents, we sought to expand our understanding of the Individual Differences in FIP in Police Officers
human limits for facial deepfake detection. To complement the categorical approach comparing
SRs to non-SRs, our final analysis concentrated on
No Evidence of Speed–Accuracy Tradeoffs police officers, who had undergone testing of FIP abil-
Independently of FIP ability, we sought to determine ity via a novel bespoke police tool: beSure (Ramon and
whether DDP is characterized by speed–accuracy trad- Vowels10 and Ramon and Rjosk13). This was done to
eoffs. It is conceivable that high performance could be address whether a potential association between FIP
attributed to the depth with which individuals opt to ability and DDP would require a more sensitive indi-
process information. In this case, high performance vidual differences approach. This continuous analytical
would come at the expense of prolonged RTs. On the approach again provided a null finding; officers’ DDP
other hand, an absence of such speed–accuracy tradeoffs was unrelated to their FIP ability rank determined via
would suggest that other factors may be more meaningful the challenging five subtests of beSure (Ramon and
determinants of observers’ deepfake detection. Overall, Vowels10 and Ramon and Rjosk13).
across diverse cohorts, we did not find a speed–accuracy
tradeoff, that is, improved performance due to associated Limitations and Future Outlook
with prolonged processing (that is, response) time. If Collectively, our results suggest that neither increased
anything, for the 2AFC experiment, performance dete- processing time, which can be considered a proxy for
riorated with processing time, while no relationship was motivation, nor FIP ability measured via two indepen-
found for the single-video experiment. dent approaches are associated with DDP. These find-
ings emerge within a large, diverse, and unique group of
SRs Versus Controls observers, which we believe represent society at large as
Next, we sought to determine whether stable differ- well as motivated law enforcement professionals.
ences in FIP ability might affect DDP. To this end, we An important consideration concerns the different
examined if individuals categorized as SRs according to number of trials completed across participants’ sub-
previously proposed lab-based diagnostic procedures groups. For the first two analyses, we combined the
(Ramon9) would outperform those who did not. Indeed, previously reported dataset (Groh et al.8) with our
recent evidence has demonstrated that SRs excel at foren- newly acquired one. According to Groh et al.,8 “[r]
sic perpetrator identification (Mayer and Ramon11). ecruited participants [were] asked to view 20 videos
Moreover, they outperform non-SRs in challenging while nonrecruited participants [could] view up to 45
identity-matching scenarios measured via beSure, the videos.” Provided uninterrupted participation, observ-
only FIP assessment tool that involves authentic police ers of the present cohort were exposed to the complete
material (Ramon and Vowels10 and Ramon and Rjosk13). set of deepfake stimuli. As such, we cannot rule out a
It is thus conceivable that SRs’ superiority extends to the greater learning effect for these observers. However,
detection of synthetic disinformation. these considerations do not hold for the third analy-
We analyzed an extensive dataset of single-trial sis, which was performed exclusively on Berlin Police
responses solicited in a single-video and a 2AFC officers’ data. Here, we also did not find any significant

74 IEEE Security & Privacy May/June 2024


association between ability and DDP. [However, since responses provided by data acquired from previously
we did not know a priori whether any differences identified SRs and Berlin Police officers) subject to
would emerge, further work testing for null effects (for analysis and analysis code can be found on the accom-
example, via equivalence tests; Lakens25) is required.] panying OSF project (https://osf.io/zw7vm/). For
One obvious caveat is that our findings are linked to the previously published data used here, please refer to the
stimulus material used, which represents a subsample of original report (Groh et al.8). This work involved human
instances submitted to the Deepfake Detection Challenge subjects or animals in its research. Approval of all ethical
(https://www.kaggle.com/c/deepfake-detection-chal and experimental procedures and protocols was granted
lenge) (see Groh et al.8). Since this challenge, the number by the Massachusetts Institute of Technology’s Commit-
of solutions available for deepfake creation has increased tee on the Use of Humans as Experimental Subjects (per-
substantially. This means that today’s deepfakes will vary formed in line with exemption identification number
much more in terms of their quality and likely detection E-3354) and the University of Fribourg’s Ethics Com-
difficulty. Indeed, it is possible that facial deepfake stimuli mittee (approval number 473).
created using state-of-the-art approaches might be pro-
cessed more proficiently by individuals with high(er) FIP References
ability. However, the stimuli used here are arguably the 1. D. J. Kersten, P. Mamassian, and A. L. Yuille, “Object per-
most extensively studied among humans (Groh et al.8). ception as Bayesian inference,” Annu. Rev. Psychol., vol. 55,
no. 1, pp. 271–304, 2004. [Online]. Available: https://
api.semanticscholar.org/CorpusID:2230247

A nother open question concerns within-observer


reliability—or consistency in judging the same
deepfake stimulus. Given repeated exposure to the same
2. H. Farid, “Creating, using, misusing, and detecting deep
fakes,” J. Online Trust Saf., vol. 1, no. 4, pp. 1–33, 2022, doi:
10.54501/jots.v1i4.56.
stimuli, it is possible that the consistency of observers’ 3. P. Pataranutaporn et al., “Ai-generated characters for sup-
judgments is related to their FIP ability (Ramon9). Relat- porting personalized learning and well-being,” Nature
edly, previous work has shown that severely impaired Mach. Intell., vol. 3, no. 12, pp. 1013–1022, 2021, doi:
individuals especially benefit from instructions that 10.1038/s42256-021-00417-9.
reveal task-relevant diagnostic information (for exam- 4. “European leaders targeted by deepfake video calls imi-
ple, Ramon and Rossion26). Thus, future work should tating mayor of kyiv,” The Guardian, Jun. 2022. [Online].
address the extent to which judgments are influenced by Available: https://www.theguardian.com/world/2022/
prior information on or familiarity with deepfakes could jun/25/european-leaders-deepfake-video-calls-mayor
affect observers’ performance—and potentially interact -of-kyiv-vitali-klitschko
with individual differences in FIP ability. Finally, facial 5. S. Dunn, “Women, not politicians, are targeted most
deepfakes may be combined with audio content, which often by deepfake videos,” Centre for International
in isolation can facilitate or hamper DDP (Groh et al.27). Governance Innovation, Waterloo, ON, Canada, 2021.
Potentially, the detection of deepfakes involving both [Online]. Available: https://www.cigionline.org/
audio and visual information could relate to stable indi- articles/women-not-politicians-are-targeted-most
vidual differences in multisensory integration. -often-deepfake-videos/
6. R. M. Chesney and D. K. Citron, “Deep fakes: A loom-
Acknowledgement ing challenge for privacy, democracy, and national secu-
Meike Ramon thanks Simon Rjosk, the Berlin Police, in rity,” California Law Rev., vol. 107, p. 1753, Jul. 2018.
particular, the Center for Innovation and Science Man- [Online]. Available: https://papers.ssrn.com/sol3/
agement for the fruitful long-standing collaboration and papers.cfm?abstract_id=3213954
shared dedication to scientific quality and transparency 7. F. Wichmann and R. Geirhos, “Are deep neural networks
and all participating police officers for their support and adequate behavioural models of human visual percep-
service. Meike Ramon is supported by a Swiss National tion?” Annu. Rev. Vis. Sci., vol. 9, no. 1, pp. 501–524, 2023.
Science Foundation PRIMA (Promoting Women in [Online]. Available: https://api.semanticscholar.org/
Academia) Grant (PR00P1 179872). M. R. and M. G. CorpusID:257880120
conceived the experiments; M. R. conducted face iden- 8. M. Groh, Z. Epstein, C. Firestone, and R. W. Picard,
tity processing assessments; M. G. conducted the deep- “Deepfake detection by human crowds, machines, and
fake experiments; M. R. and M. V. analyzed face identity machine-informed crowds,” Proc. Nat. Acad. Sci. USA, vol.
processing assessment data; M. V. and M. G. analyzed 119, no. 1, 2021, Art. no. e2110013119, doi: 10.1073/
the deepfake experimental data; M. R. wrote the ini- pnas.2110013119.
tial manuscript, and all authors edited the final version. 9. M. Ramon, “Super-recognizers – A novel diagnostic
Anonymized research data reported (that is, behavioral framework, 70 cases, and guidelines for future work,”

www.computer.org/security 75
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS

Neuropsychologia, vol. 158, Jul. 2021, Art. no. 107809, doi: 25. D. Lakens, “Equivalence tests: A practical primer for
10.1016/j.neuropsychologia.2021.107809. t tests, correlations, and meta-analyses,” Social Psy-
10. M. Ramon and M. J. Vowels, 2023, “Large-scale chol. Personality Sci., vol. 8, no. 4, pp. 355–362, 2017.
super-recognizer identification in the berlin police,” OSF, [Online]. Available: https://api.semanticscholar.org/
doi: 10.31234/osf.io/x6ryw. CorpusID:39946329
11. M. Mayer and M. Ramon, “Improving forensic perpetra- 26. M. Ramon and B. Rossion, “Impaired processing of rela-
tor identification with super-recognizers,” Proc. Nat. Acad. tive distances between features and of the eye region in
Sci. USA, vol. 120, no. 20, 2023, Art. no. e2220580120, acquired prosopagnosia—Two sides of the same holis-
doi: 10.1073/pnas.2220580120. tic coin?” Cortex, vol. 46, no. 3, pp. 374–389, 2010, doi:
12. M. Ramon, A. Bobak, and D. White, “Super-recognizers: 10.1016/j.cortex.2009.06.001.
From the lab to the world and back again,” Br. J. Psychol., vol. 27. M. Groh, A. Sankaranarayanan, N. Singh, D. Y. Kim, A.
110, no. 3, pp. 461–479, 2019, doi: 10.1111/bjop.12368. Lippman, and R. Picard. “Human detection of political speech
13. M. Ramon and S. Rjosk, beSure—Berlin Test for deepfakes across transcripts, audio, and video.” Papers With
Super-Recognizer Identification: Part I: Development. Code. [Online]. Available: https://paperswithcode.
Frankfurt am Main, Germany: Verlag fur Polizeiwissen- com/paper/human-detection-of-political-deepfakes
schaft, 2022. [Online]. Available: https://www.polizei- -across
wissenschaft.de/suche?query=978-3-86676-762-1
14. J. Nador, T. Alsheimer, A. Gay, and A. Ramon, “Image or Meike Ramon is a Swiss National Science Founda-
identity? Only super-recognizers’(memor) ability is con- tion Promoting Women in Academia group leader
sistently viewpoint-invariant,” Swiss Psychol. Open, vol. 1, and an assistant professor. She leads the Applied
no. 1, pp. 1–15, 2021, doi: 10.5334/spo.28. Face Cognition Lab and directs the Cognitive and
15. J. Nador, M. Zoia, M. Pachai, and M. Ramon, “Psychophys- Affective Regulation Laboratory at the University
ical profiles in super-recognizers,” Sci. Rep., vol. 11, no. 1, of Lausanne, 1015 Lausanne, Switzerland. Her
2021, Art. no. 13184, doi: 10.1038/s41598-021-92549-6. research interests include face processing and rec-
16. M. Linka, M. D. Broda, T. A. Alsheimer, B. de Haas, ognition, cognitive neuroscience, and its applica-
and M. Ramon, “Characteristic fixation biases in tions in government and industry. Ramon received
super-recognizers,” J. Vis., vol. 22, no. 8, p. 17, 2022, doi: a Ph.D. focused on personally familiar face pro-
10.1167/jov.22.8.17. cessing in the healthy and damaged brain from
17. Prolific, 2021. [Online]. Available: https://www.prolific. UCLouvain. She is a board member of the Asso-
com/ ciation for Independent Research. Contact her at
18. R Core Team, “R: A language and environment for sta- meike.ramon@gmail.com.
tistical computing,” R Foundation for Statistical Com-
puting, Vienna, Austria, Version 4.1.0, 2021. [Online]. Matthew Vowels is a junior lecturer at the Institute of
Available: https://www.R-project.org/ Psychology at the University of Lausanne, 1015 Lau-
19. R. A. Rigby and D. M. Stasinopoulos, “Generalized addi- sanne Switzerland, a visiting research fellow at the
tive models for location, scale and shape,” J. Roy. Statisti- Centre for Vision, Speech and Signal Processing,
cal Soc. C (Applied Statistics), vol. 54, no. 3, pp. 507–554, University of Surrey, and a senior researcher for The
2005, doi: 10.1111/j.1467-9876.2005.00510.x. Sense Innovation and Research Center in the depart-
20. D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear ment of radiology for the Lausanne University Hos-
mixed-effects models using lme4,” J. Statistical Softw., vol. 67, pital. His research interests include machine learning,
no. 1, pp. 1–48, 2015, doi: 10.18637/jss.v067.i01. computer vision, causality, and statistics. Contact him
21. L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, at matthew.vowels@unil.ch.
pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
22. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and Matthew Groh is a Donald P. Jacobs Scholar and assis-
B. Thirion, “Scikit-learn: Machine learning in Python,” tant professor in the Department of Management and
JMLR, vol. 12, no. 85, pp. 2825–2830, 2011. Organizations at Kellogg School of Management and
23. P. Probst, M. Wright, and A. Boulesteix, “Hyperparam- by courtesy the Department of Computer Science at
eters and tuning strategies for random forest,” Wires McCormick School of Engineering, Northwestern
Data Mining Knowl. Discovery, vol. 9, no. 3, 2018, Art. no. University, Evanston, IL 60208 USA. His research
e1301, doi: 10.1002/widm.1301. interests include human-AI collaboration, computa-
24. M. D. Stasinopoulos, R. A. Rigby, G. Z. Heller, V. Vou- tional social science, affective computing, deepfakes,
douris, and F. D. Bastiani, Flexible Regression and Smooth- and generative AI. Groh received a Ph.D. in media
ing: Using GAMLSS in R. Boca Raton, FL, USA: CRC arts and sciences from MIT. Contact him at matthew.
Press, 2017. groh@kellogg.northwestern.edu.

76 IEEE Security & Privacy May/June 2024

You might also like