Deepfake Detection by Police Experts
Deepfake Detection by Police Experts
Meike Ramon | University of Lausanne and AIR – Association for Independent Research
Matthew Vowels | University of Lausanne, Lausanne University Hospital and University of Lausanne,
and The Sense Innovation and Research Center
Matthew Groh | Northwestern University
We examined human deepfake detection performance (DDP) in relation to face identity processing
ability among Berlin Police officers, including Super-Recognizers (SRs). While we find no relationship,
further research into human DDP using state-of-the-art static deepfakes is needed to establish the
potential value of SR-deployment.
www.computer.org/security 69
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
potentially wide-ranging societal implications. Such knowl- of Helsinki. All participants were healthy volunteers, were
edge is particularly pertinent for organizations that are provided with informed written consent, and were not
expected to monitor and mitigate threats by deepfakes: financially compensated for their participation.
law enforcement professionals. Therefore, in this study, we
investigated the impact of human factors—professional Experiments
occupation and individual differences in face identity pro- Participants were invited to participate in two deepfake
cessing ability—on deepfake detection performance. We detection tests reported previously by Groh et al.8 and
did so by testing two unique cohorts of observers: previ- exemplified in Figure 1(a). The first experiment involves
ously reported SRs (Ramon9) and law enforcement profes- presenting two stimuli in a 2AFC design; the second pres-
sionals from within the 18,000 officers of the Berlin Police ents a single stimulus. Observers are required to decide
(Ramon and Vowels10). Their performance was measured which of the two stimuli in the 2AFC setting represents a
across using identical stimulus material, experimental set- deepfake and report their confidence in the single-video
tings, and neurotypical control observers’ data as reported stimulus being a deepfake. Participants could com-
previously (Groh et al.8). plete as many trials as they wished. The full 2AFC and
single-video experiments comprised a total of 56 and 56
Methods trials, respectively (for full details, see Groh et al.8).
This research complies with all relevant ethical regula-
tions, and the Massachusetts Institute of Technology’s Participants
Committee on the Use of Humans as Experimental Sub- The data reported in this study originated from different
jects approved the deepfake detection portion of this sources. First, data published previously by Groh et al.8
study as Exempt Category 3 – Benign Behavioral Inter- included nonrecruited observers (who arrived at the website
vention. This study’s exemption identification number is via organic links on the Internet) and observers recruited
E-3354. All procedures and protocols were approved by from Prolific.17 These data were considered as representing
the University of Fribourg’s Ethics Committee (approval neurotypical controls (as no independent measure of their
number 473) and conducted in accordance with both FIP ability was available). Second, data from lab-identified
their guidelines as well as those set forth in the Declaration SRs reported previously (Ramon9) and thereafter using
(a)
SR Versus Non-SR Comparison
1 2
1.5
Performance
z-Scored Performance
0.8 1
beSure
0.5
0
Accuracy
0.6 –0.5
–1
0.4 –1.5
0.2
Non-SR
SR
0
2AFC Single Video beSure 2AFC Single Video
(b) (c)
Figure 1. Stimuli and results for DDP. (a) Example stimuli presented in the 2AFC (left) and single-video (right) experiments.
(Source: Adapted from Groh et al.8) (b) DDP for each of the two experiments for SRs and control observers (dark and light
grey). (c) Relationship between different performance measures along the x-axis: performance across beSure (left) and
both deepfake experiments (middle and right). Colors indicate beSure performance rank to visualize the (in)dependence
between FIP ability measured by beSure, and observers’ performance for the deepfake experiments.
www.computer.org/security 71
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
comparisons are provided in the supplementary materials while accounting for random intercepts associated with
in the accompanying OSF project. individual users. In terms of the fixed effects, the (stan-
Third, and finally, we sought to determine the dardized) RT was negatively associated with the log odds
potential relationship between FIP ability and DDP of correct deepfake detection (B = −0.373, SE = 0.029,
by considering police officers’ FIP ability in a continu- z = −12.97, p < 0.001). Here, “B” represents the fixed effect
ous manner through their previously measured perfor- regression coefficient for the standardized RT, indicating its
mance across all five subtests of the bespoke police tool effect on the log odds of correctly detecting a deepfake. “SE”
beSure (Ramon and Vowels10 and Ramon and Rjosk13). is the standard error of the estimate for “B,” quantifying the
To this end, we first performed linear regressions for uncertainty/variability. The “z” value serves as the test sta-
performance in deepfake experiments and beSure per- tistic for assessing the significance of the effect, and the asso-
formance. Additionally, given the possibility that the rela- ciated “p” value indicates the probability of observing such
tionships may be nonlinear, we also explored whether a an effect (or stronger) under the assumption that there is, in
data-driven approach would indicate predictive potential. fact, no association.
To this end, we undertook the same regressions for the Taking the exponent of the fixed effect “B,” we get an
single-video and 2AFC experiments—but this time with odds ratio of approximately 0.69. In other words, for every
a random forest (Breiman21). Random forests are a type one-standard-deviation increase in the RT, the odds of cor-
of data-adaptive, nonparametric, and tree-based machine rectly detecting a deepfake are decreased by about 31% rela-
learning algorithm that learn a function that maps from tive to the odds of someone reacting in an average amount
the predictors to the dependent variable. The forest ele- of time. It is important to note that this association between
ment refers to the fact that multiple trees are used, each an increased RT and the decreased detection accuracy does
of which is trained on a bootstrapped subsample of the not imply causality. The observed relationship might sug-
input data and input variables. This bootstrapping pro- gest that longer RTs are linked to greater uncertainty in
cess helps to prevent overfitting, a phenomenon whereby distinguishing deepfakes, potentially because more chal-
data-adaptive approaches tend to learn ungeneralizable lenging decisions require longer deliberation. However, this
functions that exhibit only good performance on the data interpretation is speculative, and further research would be
on which they are trained. necessary to explore the underlying mechanisms.
For the random forest, we use the sklearn implemen- On the other hand, for the single-video tasks, which
tation (Pedregosa et al.22) with its default values, which have a fractional performance measure [0,1], we use a
have been shown to yield consistently good performance zero-and-one inflated beta generalized additive regres-
across a range of tasks without needing hyperparameter sion model (Stasinopoulos et al.24), which we fit to,
tuning (Probst et al.23). Specifically, the core hyperpa- again, assess the association between the standardized
rameters were as follows: number of estimators, 100; RT and performance. The main coefficient to be evalu-
maximum features, all; maximum depth, unlimited; min- ated is n (estimate = −0.013, SE = 0.048, t = −0.272, p =
imum sample split, two; and criterion, squared error. No 0.786). An interpretation of these results follows in a sim-
experiments were undertaken to evaluate whether better ilar manner to those for the multilevel model. Here, “t”
hyperparameters could be identified (we assume that the is the test statistic rather than “z.” These results indicate
algorithm is already substantially more flexible than the that there is no significant relationship between the RT
alternative linear regressors under comparison). We fol- and the expected score—the threshold for significance is
low a leave-one-out cross-validation process to evaluate taken to be a = 0.05, and the value of “p” is above this.
the out-of-sample mean-squared-error performance of Taken together, analyses for both the single-video
the random forest and compare it to a “dummy” regressor, and the 2AFC experiment have ruled out speed–accu-
which simply predicts the average value of the outcome. racy tradeoffs. If anything, we observed the opposite
pattern—lower performance associated with prolonged
Results RTs. Therefore, only performance accuracy was consid-
ered in further analyses.
Relationship Between Performance and RTs
First, we explored the extent to which RTs in both Group Differences: SRs Versus Controls
experiments would be predictive of DDP. Specifically, The relationship between independently measured
we aimed to determine whether higher performance FIP ability and DDP was first investigated by catego-
accuracy could be accounted for by prolonged RTs, that rizing observers according to their SR status. Recall
is, a speed–accuracy tradeoff. that observers originated from different groups: 1)
To this end, for the 2AFC experiment, we fit the mul- previously reported SRs (Ramon9) Berlin Police offi-
tilevel logistic model to the data to assess the relationship cers who met the lab criteria and those who did not
between DDP (correct/incorrect) and a standardized RT and 2) recruited and nonrecruited observers reported
www.computer.org/security 73
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
detection. Compared to the number of studies reporting experiment, respectively. Observers belonged to two
automatic solutions developed toward this end, empirical groups: 1) civilians or Berlin Police officers identified
studies of human ability for deepfake detection remain as SRs via lab tests (Ramon9 and Ramon and Vowels10)
severely limited. Moreover, existing studies have not con- who represent the core of a deep-data neuroscientific
sidered two potential determinants of deepfake detection research agenda pursued in the Applied Face Cogni-
performance: stable individual differences in face iden- tion Lab (https://afclab.org/) and 2) non-SRs who
tity processing ability and professional occupation. To were previously reported neurotypical observers
address this knowledge gap, we leveraged access to two (Groh et al.8) and officers of the Berlin Police who did
unique groups of human observers: previously reported not meet the SR criteria (Ramon9). The results indicate
SRs and motivated officers from within the entire group that DDP was not related to group membership.
of ~18,000 employed by the Berlin Police (Ramon,9 These findings may be accounted for by the stimu-
Ramon and Vowels,10 and Mayer and Ramon11). The lus material presented and used. SRs outperform con-
latter had participated in beSure (Ramon and Vowels10 trols when the processing of static images of faces is
and Ramon and Rjosk13)—the only existing police FIP required (Ramon,9 Ramon and Vowels,10 and Ramon
assessment tool using authentic police material. In this and Rjosk13). Here, however, observers judged dynamic
manner, we could relate DDP to two independent, chal- stimuli. The availability of motion information may
lenging, and complementary means of FIP assessment. have leveled the field across observers.
In light of the challenges that synthetic misinformation
represents, we sought to expand our understanding of the Individual Differences in FIP in Police Officers
human limits for facial deepfake detection. To complement the categorical approach comparing
SRs to non-SRs, our final analysis concentrated on
No Evidence of Speed–Accuracy Tradeoffs police officers, who had undergone testing of FIP abil-
Independently of FIP ability, we sought to determine ity via a novel bespoke police tool: beSure (Ramon and
whether DDP is characterized by speed–accuracy trad- Vowels10 and Ramon and Rjosk13). This was done to
eoffs. It is conceivable that high performance could be address whether a potential association between FIP
attributed to the depth with which individuals opt to ability and DDP would require a more sensitive indi-
process information. In this case, high performance vidual differences approach. This continuous analytical
would come at the expense of prolonged RTs. On the approach again provided a null finding; officers’ DDP
other hand, an absence of such speed–accuracy tradeoffs was unrelated to their FIP ability rank determined via
would suggest that other factors may be more meaningful the challenging five subtests of beSure (Ramon and
determinants of observers’ deepfake detection. Overall, Vowels10 and Ramon and Rjosk13).
across diverse cohorts, we did not find a speed–accuracy
tradeoff, that is, improved performance due to associated Limitations and Future Outlook
with prolonged processing (that is, response) time. If Collectively, our results suggest that neither increased
anything, for the 2AFC experiment, performance dete- processing time, which can be considered a proxy for
riorated with processing time, while no relationship was motivation, nor FIP ability measured via two indepen-
found for the single-video experiment. dent approaches are associated with DDP. These find-
ings emerge within a large, diverse, and unique group of
SRs Versus Controls observers, which we believe represent society at large as
Next, we sought to determine whether stable differ- well as motivated law enforcement professionals.
ences in FIP ability might affect DDP. To this end, we An important consideration concerns the different
examined if individuals categorized as SRs according to number of trials completed across participants’ sub-
previously proposed lab-based diagnostic procedures groups. For the first two analyses, we combined the
(Ramon9) would outperform those who did not. Indeed, previously reported dataset (Groh et al.8) with our
recent evidence has demonstrated that SRs excel at foren- newly acquired one. According to Groh et al.,8 “[r]
sic perpetrator identification (Mayer and Ramon11). ecruited participants [were] asked to view 20 videos
Moreover, they outperform non-SRs in challenging while nonrecruited participants [could] view up to 45
identity-matching scenarios measured via beSure, the videos.” Provided uninterrupted participation, observ-
only FIP assessment tool that involves authentic police ers of the present cohort were exposed to the complete
material (Ramon and Vowels10 and Ramon and Rjosk13). set of deepfake stimuli. As such, we cannot rule out a
It is thus conceivable that SRs’ superiority extends to the greater learning effect for these observers. However,
detection of synthetic disinformation. these considerations do not hold for the third analy-
We analyzed an extensive dataset of single-trial sis, which was performed exclusively on Berlin Police
responses solicited in a single-video and a 2AFC officers’ data. Here, we also did not find any significant
www.computer.org/security 75
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
Neuropsychologia, vol. 158, Jul. 2021, Art. no. 107809, doi: 25. D. Lakens, “Equivalence tests: A practical primer for
10.1016/j.neuropsychologia.2021.107809. t tests, correlations, and meta-analyses,” Social Psy-
10. M. Ramon and M. J. Vowels, 2023, “Large-scale chol. Personality Sci., vol. 8, no. 4, pp. 355–362, 2017.
super-recognizer identification in the berlin police,” OSF, [Online]. Available: https://api.semanticscholar.org/
doi: 10.31234/osf.io/x6ryw. CorpusID:39946329
11. M. Mayer and M. Ramon, “Improving forensic perpetra- 26. M. Ramon and B. Rossion, “Impaired processing of rela-
tor identification with super-recognizers,” Proc. Nat. Acad. tive distances between features and of the eye region in
Sci. USA, vol. 120, no. 20, 2023, Art. no. e2220580120, acquired prosopagnosia—Two sides of the same holis-
doi: 10.1073/pnas.2220580120. tic coin?” Cortex, vol. 46, no. 3, pp. 374–389, 2010, doi:
12. M. Ramon, A. Bobak, and D. White, “Super-recognizers: 10.1016/j.cortex.2009.06.001.
From the lab to the world and back again,” Br. J. Psychol., vol. 27. M. Groh, A. Sankaranarayanan, N. Singh, D. Y. Kim, A.
110, no. 3, pp. 461–479, 2019, doi: 10.1111/bjop.12368. Lippman, and R. Picard. “Human detection of political speech
13. M. Ramon and S. Rjosk, beSure—Berlin Test for deepfakes across transcripts, audio, and video.” Papers With
Super-Recognizer Identification: Part I: Development. Code. [Online]. Available: https://paperswithcode.
Frankfurt am Main, Germany: Verlag fur Polizeiwissen- com/paper/human-detection-of-political-deepfakes
schaft, 2022. [Online]. Available: https://www.polizei- -across
wissenschaft.de/suche?query=978-3-86676-762-1
14. J. Nador, T. Alsheimer, A. Gay, and A. Ramon, “Image or Meike Ramon is a Swiss National Science Founda-
identity? Only super-recognizers’(memor) ability is con- tion Promoting Women in Academia group leader
sistently viewpoint-invariant,” Swiss Psychol. Open, vol. 1, and an assistant professor. She leads the Applied
no. 1, pp. 1–15, 2021, doi: 10.5334/spo.28. Face Cognition Lab and directs the Cognitive and
15. J. Nador, M. Zoia, M. Pachai, and M. Ramon, “Psychophys- Affective Regulation Laboratory at the University
ical profiles in super-recognizers,” Sci. Rep., vol. 11, no. 1, of Lausanne, 1015 Lausanne, Switzerland. Her
2021, Art. no. 13184, doi: 10.1038/s41598-021-92549-6. research interests include face processing and rec-
16. M. Linka, M. D. Broda, T. A. Alsheimer, B. de Haas, ognition, cognitive neuroscience, and its applica-
and M. Ramon, “Characteristic fixation biases in tions in government and industry. Ramon received
super-recognizers,” J. Vis., vol. 22, no. 8, p. 17, 2022, doi: a Ph.D. focused on personally familiar face pro-
10.1167/jov.22.8.17. cessing in the healthy and damaged brain from
17. Prolific, 2021. [Online]. Available: https://www.prolific. UCLouvain. She is a board member of the Asso-
com/ ciation for Independent Research. Contact her at
18. R Core Team, “R: A language and environment for sta- meike.ramon@gmail.com.
tistical computing,” R Foundation for Statistical Com-
puting, Vienna, Austria, Version 4.1.0, 2021. [Online]. Matthew Vowels is a junior lecturer at the Institute of
Available: https://www.R-project.org/ Psychology at the University of Lausanne, 1015 Lau-
19. R. A. Rigby and D. M. Stasinopoulos, “Generalized addi- sanne Switzerland, a visiting research fellow at the
tive models for location, scale and shape,” J. Roy. Statisti- Centre for Vision, Speech and Signal Processing,
cal Soc. C (Applied Statistics), vol. 54, no. 3, pp. 507–554, University of Surrey, and a senior researcher for The
2005, doi: 10.1111/j.1467-9876.2005.00510.x. Sense Innovation and Research Center in the depart-
20. D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear ment of radiology for the Lausanne University Hos-
mixed-effects models using lme4,” J. Statistical Softw., vol. 67, pital. His research interests include machine learning,
no. 1, pp. 1–48, 2015, doi: 10.18637/jss.v067.i01. computer vision, causality, and statistics. Contact him
21. L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, at matthew.vowels@unil.ch.
pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
22. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and Matthew Groh is a Donald P. Jacobs Scholar and assis-
B. Thirion, “Scikit-learn: Machine learning in Python,” tant professor in the Department of Management and
JMLR, vol. 12, no. 85, pp. 2825–2830, 2011. Organizations at Kellogg School of Management and
23. P. Probst, M. Wright, and A. Boulesteix, “Hyperparam- by courtesy the Department of Computer Science at
eters and tuning strategies for random forest,” Wires McCormick School of Engineering, Northwestern
Data Mining Knowl. Discovery, vol. 9, no. 3, 2018, Art. no. University, Evanston, IL 60208 USA. His research
e1301, doi: 10.1002/widm.1301. interests include human-AI collaboration, computa-
24. M. D. Stasinopoulos, R. A. Rigby, G. Z. Heller, V. Vou- tional social science, affective computing, deepfakes,
douris, and F. D. Bastiani, Flexible Regression and Smooth- and generative AI. Groh received a Ph.D. in media
ing: Using GAMLSS in R. Boca Raton, FL, USA: CRC arts and sciences from MIT. Contact him at matthew.
Press, 2017. groh@kellogg.northwestern.edu.