0% found this document useful (0 votes)
114 views16 pages

Automated Online Exam Proctoring System

This document summarizes an automated online exam proctoring system that uses audio and video from a webcam and microphone to continuously monitor test takers and detect potential cheating behaviors during online exams. The system analyzes visual and audio cues like user verification, text detection, voice detection, active window detection, gaze estimation and phone detection to classify if the test taker is cheating. An evaluation with 24 subjects performing various cheating behaviors while taking online exams showed the system can accurately and robustly detect cheating in real-time. The system aims to maintain academic integrity for online exams by providing automated proctoring to replace costly human proctoring.

Uploaded by

merline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views16 pages

Automated Online Exam Proctoring System

This document summarizes an automated online exam proctoring system that uses audio and video from a webcam and microphone to continuously monitor test takers and detect potential cheating behaviors during online exams. The system analyzes visual and audio cues like user verification, text detection, voice detection, active window detection, gaze estimation and phone detection to classify if the test taker is cheating. An evaluation with 24 subjects performing various cheating behaviors while taking online exams showed the system can accurately and robustly detect cheating in real-time. The system aims to maintain academic integrity for online exams by providing automated proctoring to replace costly human proctoring.

Uploaded by

merline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015 1

Automated Online Exam Proctoring


Yousef Atoum, Liping Chen, Alex X. Liu, Stephen D. H. Hsu, and Xiaoming Liu

Abstract—Massive open online courses (MOOCs) and other Cheating


forms of remote education continue to increase in popularity
and reach. The ability to efficiently proctor remote online exam-
inations is an important limiting factor to the scalability of this
next stage in education. Presently, human proctoring is the most OEP
software
common approach of evaluation, by either requiring the test taker
to visit an examination center, or by monitoring them visually
and acoustically during exams via a webcam. However, such No
methods are labor-intensive and costly. In this paper, we present Cheating
a multimedia analytics system that performs automatic online
exam proctoring. The system hardware includes one webcam, Fig. 1: Based on the audio-visual streams captured by a wearcam, a
one wearcam, and a microphone, for the purpose of monitoring webcam with an integrated microphone, our OEP system automati-
the visual and acoustic environment of the testing location. The cally and continuously detects cheat behaviors during online exams.
system includes six basic components that continuously estimate
the key behavior cues: user verification, text detection, voice
detection, active window detection, gaze estimation and phone
detection. By combining the continuous estimation components, committing academic cheating activity is on the rise. Nearly
and applying a temporal sliding window, we design higher-
level features to classify whether the test taker is cheating
74% of students in 2013 indicated that it would be somewhat
at any moment during the exam. To evaluate our proposed easy to cheat in online exams. They also found that in 2013,
system, we collect multimedia (audio and visual) data from 24 about 29% of the students admitted to cheating in online
subjects performing various types of cheating while taking online exams. When exams are administered in a conventional and
exams. Extensive experimental results demonstrate the accuracy, proctored classroom environment, the students are monitored
robustness, and efficiency of our online exam proctoring system.
by a human proctor throughout the exam. In contrast, there is
Index Terms—Online exam proctoring (OEP), user verification, no convenient way to provide human proctors in online exams.
gaze estimation, phone detection, text detection, speech detection, As a consequence, there is no reliable way to ensure against
covariance feature.
cheating. Without the ability to proctor online exams in a
convenient, inexpensive, and reliable manner, it is difficult for
I. I NTRODUCTION MOOC providers to offer reasonable assurance that the student
has learned the material, which is one of the key outcomes of
M ASSIVE open online courses (MOOCs) offer the po-
tential to significantly expand the reach of today’s
educational institutions, both by providing a wider range of
any educational program, including online education.
A typical testing procedure for online learners is the fol-
educational resources to enrolled students and by making lowing: students come to an on-campus or university-certified
educational resources available to people who cannot access testing center and take an exam under human proctoring. New
a campus due to location or schedule constraints. Instead emerging technologies, such as, e.g., Kryterion and ProctorU,
of taking courses in a typical classroom on campus, now allow students to take tests anywhere as long as they have
students can take courses anywhere in the world using a an Internet connection. However, they still rely on a person
computer, where educators deliver knowledge via various “watching” the exam-taking. For example, Kryterion employs
types of multimedia content. According to a recent survey [1], a human proctor watching a test taker through a webcam from
more than 7.1 million students are taking, at least, one online a remote location. The proctors are trained to watch and listen
course in 2013 in America. It also states that 70% of higher for any unusual behaviors of the test taker, such as unusual
education institutions believe that online education is a critical eye movements, or removing oneself from the field of view.
component of their long-term strategy. They can alert the test taker or even stop the test.
Exams are a critical component of any educational program, In this paper, we introduce a multimedia analytics system
and online educational programs are no exception. In any to perform automatic and continuous online exam proctoring
exam, there is a possibility of cheating, and therefore, its (OEP). The overall goal of this system is to maintain aca-
detection and prevention are important. Educational credentials demic integrity of exams, by providing real-time proctoring
must reflect actual learning in order to retain their value to so- for detecting the majority of cheating behaviors of the test
ciety. The authors in [15] state that the percentage of students taker. To achieve such goals, audio-visual observations about
Yousef Atoum is with the Department of Electrical and Computer Engi- the test takers are required to be able to detect any cheat
neering, Michigan State University, East Lansing, MI. Liping Chen, Alex behavior. Many existing multimedia systems [23], [35] have
X. Liu, and Xiaoming Liu are with the Department of Computer Science and been utilizing features extracted from audio-visual data to
Engineering, Michigan State University, East Lansing, MI. Stephen D. H. Hsu
is with the Department of Physics and Astronomy, Michigan State University, study human behavior, which has motivated our technical
East Lansing, MI. Corresponding author: Xiaoming Liu, liuxm@cse.msu.edu approach. Our system monitors such cues in the room where

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
2 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

the test taker resides, using two cameras and a microphone. As control procedures that enable faculty to increase the difficulty
shown in Fig. 1, the first camera is located above or integrated and thus reduce the likelihood of cheating. In [34], the authors
with the monitor facing the test taker. The other camera can offer a secure web-based exam system along with network
be worn or attached to eyeglasses, capturing the field of view design which is expected to prevent cheating.
of the test taker. In this paper, these two cameras are referred Online human monitoring is one common approach for
to as the “webcam” and “wearcam” respectively. The webcam proctoring online exams. The main downside is that it’s very
also has a built-in microphone to capture any sound in the costly in terms of requiring many employees to monitor the
room. Using such sensors, we propose to detect the following test takers. Researchers have also proposed different strategies
cheat behaviors: (a) cheat from text books/notes/papers, (b) in full monitoring, such as in [13], where they use snapshots
using a phone to call a friend, (c) using the Internet from the to reduce the bandwidth cost of transmitting large video
computer or smartphone, (d) asking a friend in the test room, files. Authors in [24] attempt to do semi-automated machine
and (e) having another person take the exam other than the proctoring, by building a desktop robot that contains a 360◦
test taker. camera and motion sensors. This robot transmits videos to
We propose a hybrid two-stage algorithm for our OEP a monitoring center if any suspicious motion or video is
system. The first stage focuses on extracting middle-level captured. The main problem is that a single camera cannot see
features from audio-visual streams that are indicative of what the subject sees, and as a result even humans may have
cheating. These mainly consist of six basic components: user a hard time detecting many cheating strategies. For example,
verification, text detection, speech detection, active window a partner who is outside the camera view, but who can see
detection, gaze estimation, and phone detection. Each com- the test questions (e.g., on a second monitor), could supply
ponent produces either a binary or probabilistic estimation of answers to the test taker using silent signals, or writing on a
observing certain behavior cues. In the second stage, a joint piece of paper which is visible to the test taker.
decision across all components is carried out by extracting Among all prior work, the most relevant work to ours is
high-level temporal features from the OEP components at the the Massive Open Online Proctoring framework [17], which
first stage. These new features are utilized to train and test a combines both automatic and collaborative approaches to
classifier to provide real-time continuous detection of cheating detect cheating behaviors in online exams. Their hardware
behavior. To evaluate the OEP system, we collect multimedia includes four components: two webcams, a gaze tracker, and
(audio and visual) data from 24 subjects performing various an EEG sensor. One camera is mounted above the monitor
types of cheating while taking a multiple choice and fill in the capturing the face, and the other is placed on the right-
blank math exam. Extensive experimental results demonstrate hand side of the subject capturing the profile of the subject.
the accuracy, robustness, and efficiency of our online exam Motion is used for classification by extracting dense trajectory
proctoring system in detecting cheating behavior. features. However, this work is limited to only one type of
This paper makes the following contributions: cheating (i.e., reading answers from a paper), with evaluation
• Proposes a fully automated online exam proctoring sys- on a small set of 9 subjects with 84 cheat instances. Since
tem with visual and audio sensors for the purpose of many types of cheating do not contain high-level motion, it is
maintaining academic integrity. not clear how this method can be extended to handle them. To
• Designs a hybrid two-stage multimedia analytics ap- the best of our knowledge, there is no prior work on a fully
proach where an ensemble of classifiers extracts middle- automated online proctoring system that detects a wide variety
level features from the raw data, and transforming them of cheating behaviors.
into high-level features leads to the detection of cheating. Beyond educational applications, in the multimedia commu-
• Collects a multimedia dataset composed of two videos nity, there is prior work on audio-visual-based behavior recog-
and one audio for each subject, along with label infor- nition. Authors in [35] study audio-visual recordings of head
mation of all cheating behaviors. This database is publicly motion in human interaction, to analyze socio-communicative
available for future research 1 . and affective behavioral characteristics of interacting partners.
[21] automatically predicts the hireability in real job inter-
views, using applicant and interviewer nonverbal cues ex-
II. R ELATED WORK tracted from the audio-visual data. In [10], they automatically
Over the years, the demand for online learning has increased estimate high and low levels of group cohesion using audio-
significantly. Researchers have proposed various methods to video cues. In [16], the authors use audio-visual data to detect
proctor online exams in the most efficient and convenient way a wide variety of threats and aggression, such as unwanted
possible, yet still preserve academic integrity. These methods behaviors in public areas. Their two-stage methodology de-
can be categorized into three categories: (a) no proctoring [7], composes low-level sensor features into high-level concepts
[34], (b) online human monitoring [8], [13], and (c) semi- to produce threat and aggression detection. While there is
automated machine proctoring [17], [24]. No proctoring does similarity between their methodology and ours, our unique
not mean that test takers have the freedom of cheating. Instead, two-camera imaging allows us to leverage the correlation
cheating is minimized in various ways. In [7], the authors between the two distinct visual signals. The addition of audio
believe they can prompt academic honesty by proposing eight to video was also proven to complement many visual analysis
problems, such as object tracking [14], event detection retrieval
1 http://cvlab.cse.msu.edu/project-OEP.html in field sports [27], and vision-based HCI system [23].

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
3

Fig. 2: The architecture of the Online Exam Proctoring (OEP) system.

One of our novel ideas is to use a second wearcam for fused to generate high-level features, which are then used for
capturing the full field of the view of the subject. This is training and testing a cheat classifier. The high-level features
similar to the research in first person vision where visual include the component-dependent features, such as the mean
analysis is performed on the wearcam. For example, [30] and standard deviation within a window, and features based on
temporally segments human motion into actions and performs the correlation among the components, such as the covariance
activity classification in the context of cooking. [31] uses features [32]. It is crucial to use a diverse and rich set of
a wearcam to detect the iris and estimate the visual field in features to improve the overall detection performance of the
front of the subject, which helps to identify where exactly the OEP system, since the detection of some cheating behaviors
subject is looking. In contrast to the single wearcam in the relies on the ignition of multiple behavior cues.
first person vision, our OEP system utilizes two cameras to The remainder of this section describes the following topics:
capture both what the subject sees and his/her own behavior, (A) the hardware components of the OEP system, (B) through
which enables comprehensive behavior profiling. (G) the six basic components of the system, and (H) the high-
level features and classification of the cheating behavior.
III. P ROPOSED M ETHOD
In this work, we aim to develop a multimedia analysis A. Hardware Components
system to detect a wide variety of cheating behaviors during During an exam, the test taker may cheat by hearing or
an online exam session. Our proposed online exam process viewing forbidden information. Therefore, the OEP system
includes two phases, the preparation phase and exam phase. In hardware should be designed in a way to hear what the test
the preparation phase, the test taker has to authenticate himself taker hears and see what the test taker sees. This leads to our
before beginning the exam, by using a password and face design of three hardware components: a webcam, a wearcam,
authentication. This phase also includes calibration steps to and a microphone. The webcam is mounted on top of the
ensure that all sensors are connected and functioning properly. monitor facing the test taker and serves multiple purposes,
Further, the test taker learns and verbally acknowledges the e.g., knowing who is the test taker, what is he doing, and
rules of using the OEP system, such as, no second person is where is he looking. The wearcam is a wearable camera
allowed in the same room, the test taker should not leave the intending to be attached to the test taker’s head, such that
room during the exam phase, etc. the camera is pointing to the same pose direction as the
In the exam phase, the test taker takes the exam, under face. Since the wearcam essentially captures the field of view
the continuous “monitoring” of our OEP system for real-time of the test taker, analyzing its video content enable us to
cheating behavior detection. As shown in Fig. 1, we use three detect the “viewing-based” cheating behaviors, such as reading
sensors (i.e., webcam, wearcam and microphone) to capture from books, notes, papers, and smartphones. The wearcam
audio-visual cues of the exam environment and the test taker. contributes significantly in estimating the head gaze, which is
The sensed data is first processed using six components to ex- an important behavior cue. Note that employing the wearcam
tract middle-level features as seen in Fig. 2. These components is a distinct novelty of our system design, as well as an
are: user verification, text detection, speech detection, active advantage over prior exam proctoring systems. This design
window detection, gaze estimation, and phone detection. After is not only motivated by the need to see what the test taker
that, the middle-level features within a temporal window are sees, but also the growing popularity and decreasing cost

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
4 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

Algorithm 1: User verification algorithm.


Data: A new frame It , hf
Result: vp , vn
Initialization: v = 0, c0 = c1 = 0 ;
Viola-Jones face detector → vn (t) ;
switch vn (t) do
case 0
if c0 > τ0 then
pt = vp (t) = c0 = 0; % warning is sent
else
c0 + +;
Fig. 3: The extracted region of the face (green) and the body (red). vp (t) = vp (t − 1);
pt = F (vp (t), v, t̄, p̄);
case 1
of wearable cameras. Finally, as an integrated device of the c0 = c1 = 0;
webcam, the microphone captures what the test taker hears - if v = 1 then
based on our rules, any detected human voice is considered Compute ht , pb = hTt hb ;
if pb > τv & pt−1 > τv then
as potential cheating.
pt = F (vp (t), v, t̄, p̄);
During the system design, we experimented to find a suit- else
able prototype for the wearcam. We initially tested the system v = 0;
with a Sony action cam by utilizing a headband. However, if v = 0 then N
the relatively heavy weight and the need to synchronize ct = x t hf , vp (t) = PSR(ct );
pt = F (vp (t), v, t̄, p̄);
the webcam and wearcam made this option undesirable. We
if pt > τv then
finally decided to attach a regular wired webcam to a pair of v = 1, t̄ = t, p̄ = pt ;
eyeglasses, considering the fact that webcams are becoming case > 1
smaller in size, lighter in weight, cheaper over the years, and if c1 > τ0 then
have real-time wireless capabilities. Similar ideas have also pt = vp (t) = c1 = 0; % warning is sent
else
been adopted in the research community to understand human
c1 + +;
behavior [30], [31]. Note that our OEP system does not depend vp (t) = vp (t − 1);
on a specific choice of cameras, if a more suitable wearcam is pt = F (vp (t), v, t̄, p̄);
available in the future, we can easily adopt it in our system.
Both cameras capture video at a resolution of 640 × 480
and a frame rate of fs = 25 fps. Since the OEP system starts
faces with a mugshot of the test taker. In the meantime, a set
to grab video streams from two cameras at the same time, the
of frontal-view images of the test taker is captured, where we
two video and audio streams are automatically synchronized
detect the faces via the Viola-Jones face detector [33], and train
during the test session. a MACE Filter hf . As shown in Fig. 3, from the body region
of the images, we extract a 160-dim HSV color histogram of
B. User Verification the clothing hb . The body region has a width equal to twice
One of the major concerns in online exams is that the test the width of the detected face, and a height equal to half the
taker solicits assistance from another person on all or part height of the face. During the exam phase, when a new frame
of an exam. An OEP system should be able to continuously It is captured by the webcam, we first perform face detection.
verify whether the test taker is who he claims to be throughout Depending on the number of detected faces vn (t), we handle
the entire exam session. The test taker is also expected to it correspondingly, as described in Algorithm 1.
take the exam alone without the aid of another person in If only one face is detected in the new frame (vn (t) = 1);
the room. While there are various options for continuous user this is the most likely case since the test taker is required to
authentication, such as keystroke dynamics, we decide to use take the exam alone. Let xt be the appearance feature of the
face verification due to its robustness. detected face, pt be the probability of user authenticity, and
There are a number of challenges for user verification in v be an indicator flag on whether the test taker is verified. If
OEP. First, face detection under various lighting and poses the user is not verified (i.e., v = 0) in the previous frame,
N we
is difficult. Second, due to the partial occlusion caused by verify the user by performing cross-correlation ct = xt hf ,
the eyeglasses with the attached wearcam (Fig. 2), the perfor- where ct is the correlation output at time t. For computational
mance of face detection and verification can be more fragile. efficiency, correlation is computed in the Fourier domain using
Finally, although face detection has improved substantially Fast Fourier Transform (FFT), and then transformed back to
over the years, occasional miss detections and false alarms the spatial domain via inverse FFT. Savvides et al. showed that
are inevitable, and how to handle this is another challenge. c is sharply peaked for authentic subjects, and does not exhibit
We propose to overcome these challenges by using an a strong peak for impostors [28]. The Peak-to-Sidelobe Ratio
approach integrating both face and body cues. We use the Min- (PSR) is defined to measure the strength of the correlation
imum Average Correlation Energy (MACE) filter to perform peak, where a PSR value greater than five is considered as
face verification [28]. During the preparation phase, initial face an authenticated user. We denote the PSR value computed at
authentication is conducted by matching the webcam-captured time t as vp (t), which is further converted into the probability

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
5

Algorithm 2: Face verification probability F .


Data: vp (t), v, t̄, p̄
Result: pt
if v = 0 then
if vp (t) < 5 then
pt = 0;
else if vp (t) > 10 then
pt = 1;
else
pt = 51 (vp (t) − 5);
else
pt = p̄e−k(t−t̄) ;

Fig. 4: Positive (left) and negative (right) samples for text detection.
measure of user authenticity pt , by using the function F ,
C. Text Detection
pt = F (vp (t), v, t̄, p̄), (1) In a closed-book exam, reading from text is a major form of
cheating, where the text can be from a book, printout, notes,
as explained in Algorithm 2. If pt is larger than a predefined etc. It is obvious that the webcam alone cannot effectively
threshold τv , the face is verified, and we denote t̄ the last detect this cheat type since the webcam might not “see” the
verified time and p̄ the last verified probability. Otherwise, the book or printout. On the other hand, the wearcam captures
face continues to be verified in the next frame. everything in the field of view of the test taker. Hence, any
When the next frame arrives and only one face is detected, text seen by the test taker can very likely be seen, and detected,
if the user is verified before (v = 1), we rely on body tracking through the wearcam.
due to its robustness to head poses, instead of face verification. While text detection is a well-studied topic, detecting text
Specifically, we compute the histogram of the clothing ht , and in online exams could be challenging, since the test taker may
compare it to hb . If their similarity pb is larger than a threshold attempt to cheat from text with small font, or place the text
τv , pt is calculated as pt = p̄e−k(t−t̄) , where k is the decay far away from the camera. Further, we need to differentiate
speed of the exponential function. After ∆t = t − t̄ seconds text on printed papers vs. the text on the computer screen or
from the last verification time, the face needs to be verified the keyboard, since detection of the latter is not considered
again even if pb > τv all the time. This is reasonable because as cheating, as shown in Fig. 4. Note that for this work, we
an impostor could wear the same clothes as the test taker. focus on printed text only, rather than handwriting. In the case
of handwriting, the aid of other capabilities might be needed,
There are cases where no face is detected in the current
such as estimating the eye gaze of the user, since cheating
frame (vn (t) = 0). When the number of consecutive frames
from text requires the test taker to look at it for some time.
without detected faces, c0 , is bigger than a threshold τ0 ,
Moreover, motion blur could also be introduced due to fast
the system determines that the user has left the exam and a
head movements. In such cases, a motion blur detector would
warning is sent with an assigned high probability of cheating.
be employed and then we can skip text detection on these
Face verification is required to continue the exam when the
frames with blurred motion.
user appears again. If c0 ≤ τ0 , we do not make any decision
We develop a learning-based approach for text detection.
and wait for the next frame. This tolerance is necessary
First, we collect a set of 186 positive training images that
because the face might not be detected in certain scenarios,
contain text in a typical office environment, and 193 negative
e.g., the large pose, illumination changes or occlusion.
training images (Fig. 4). Then a learning algorithm based on
If more than one face is detected (vn (t) > 1), we also the GIST features [22] is applied to the training images. We
consider some tolerance, similar to the case of vn (t) = 0. perform cross-validation to estimate the algorithm’s parame-
When the number of consecutive frames with multiple detected ters, and finally, the algorithm can predict the probability of
faces, c1 , is less than τ0 , we do not make any decision and text in a testing video frame.
wait for the next frame. Otherwise, there is indeed more than The GIST feature is well known in the vision community.
one person in front of the computer. A warning is sent, with For example, [22] introduces how to compute GIST features,
a high probability of cheating. based on a low dimensional representation of a given image,
The user verification component provides continuous es- termed “Spatial Envelope”. A set of perceptual dimensions
timation per frame regarding the number of faces and PSR (naturalness, openness, roughness, expansion, ruggedness) that
values, which are stored in two vectors, “numFaces” and represent the dominant spatial structure of an image is used.
“facePSR”, respectively. The numFaces vector, vn , is a direct Since the GIST features of images are 512-dimensional vec-
indication of cheating when vn (t) 6= 1. However, the facePSR tors, we apply PCA to reduce them to a lower dimension,
alone, vp , may only implicitly represent cheating. This output which are then used for training a binary SVM classifier. Given
will be converted to high-level features to serve in the process a testing video frame, the output of the SVM classifier is
of detecting other cheat behaviors as seen in Fig. 2. stored as one element of the “textProb” vector, denoted by

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
6 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

Sound type # files Length Sound type # files Length


of peaks in power over time, the regularity of power peaks,
speech 4 154 keyboard typing 7 18
burp 3 2 key jingle 5 21 the range of the total power over time, and time-localized
chair moving 1 8 paper moving 5 14 frequency percentiles over various frequency ranges.
cough 5 7 phone ring 7 16
door knocking 6 8 runny nose 3 5
With the collection of features from training samples in
open/close door 4 4 sigh 8 10 Table I, we use a binary SVM classifier for speech detection.
drink 7 15 silence 2 9 During testing, the output of the SVM classifier is stored as
fart 4 6 breath 5 33
gasp 4 4 spit 3 4 one element of the “voiceProb”, denoted by vv , representing
hiccup 5 2 steps 6 25 the probability of detecting speech within a sound segment.
TABLE I: Collected sound samples, with the total number and
duration length (in seconds) of sound files. E. Active Window Detection
The Internet and computers are an open gateway to valuable
vt , representing the probability of detecting text in a frame. information for answering exam questions. The authors in [15]
indicate that cheating from the Internet is the most frequent
among e-learners. In [7], they use Blackboards Respondus
D. Speech Dectection Lockdown Browser (RLB) to access the online exam. RLB
One of the most likely cheating behaviors in online exams is a special browser where the test taker is locked into the
is to seek verbal assistance from another person in the same exam and has no way to exit/return, cut/paste, or electronically
room, or remotely via a phone call. In fact, from the audio- manipulate the system. However, some exams might require
visual dataset collected in our work, this is the most frequent Internet access to some specific websites, or perhaps the use
cheating behavior. By requiring the test taker to take the exam of e-mail or chat functions. Moreover, some test takers might
in a quiet room with no one around, any human speech being have saved files and documents on the computer containing
detected could be considered a potential cheating instance. answers to the exam. Therefore, it is critical to keep track of
Therefore, we design algorithms in this component to detect how many windows the test taker is opening.
speech from acoustic signals. In our OEP system, we give the user full Internet and
There are unique challenges for speech detection in OEP. computer access during the exam. We periodically estimate
Test takers who attempt cheating tend to use a low voice while the number of active windows running in the system, denoted
speaking to others. Therefore, one challenge is to be able by vw , obtained from the operational system API. Most of the
to detect speech at any level of amplitude. Second, speech time, there should be only one active window, which is the
can be confused with many environmental sounds in the test online exam itself. If vw (t) > 1 at a specific time t during the
room, such as noises generated from moving objects (e.g., exam, we assume the test taker is cheating, and a warning
chair, door, or keyboard), while others might be caused by the will be displayed on the monitor requesting an immediate
test taker, e.g., coughs or breathing. This can be especially shutdown of the opened window. The probability of cheating
challenging when speech is overlaid with other sounds. increases as the test taker keeps the unexpected window
Following a learning-based speech detection scheme, we opened longer. Since this component relies on the operational
first collect a wide variety of typical sounds in an office system API, the accuracy of active window detection is 100%.
environment, such as breath, burp, chair moving, cough, steps,
etc. Table I shows the number of files and the length of the
audios for each sound category. These sounds are either found F. Gaze Estimation
online [4] or recorded by ourselves. We only consider speech In traditional classroom-based proctoring, the abnormal
as the positive samples, while the remaining categories of head gaze direction and its dynamics over time can be a strong
sound are negative. In total, the lengths of positive and negative indicator of potential cheating. For example, an abnormal gaze
samples are 154 and 211 seconds, respectively. is when the test taker’s eyes are off the screen for an extended
Unlike text detection where the unit of classification is an period of time, or if the head quickly gazes around a few
image, the unit of speech detection is an acoustic segment. times. Although abnormal gaze does not directly constitute
A segment is defined either when the amplitudes of all its a cheating behavior, it is an important cue to suggest the
samples are larger than a threshold, or with a fixed duration potential subsequent cheating actions.
Ls . Due to its simplicity and robustness, we decide to adopt As a classic computer vision problem [19], head gaze esti-
the latter approach. It is a trade-off to determine the length of mation is a particularly challenging problem in our application
Ls , as the longer duration leads to a higher detection rate for due to the spontaneous head motion of the test taker as well
detecting long speech, but a lower rate for shorter speech. as the partial occlusion by the eyeglasses and wearcam. To
The acoustic segment is represented by the short-time address this issue, we take advantage of both visual sensors to
Fourier transform (STFT) using Hamming windows [9]. We enhance head gaze estimation. From the wearcam, gaze can
divide the frequencies from 200 Hz to 4 KHz into 16 dif- be inferred based on the relative 2D location of the monitor
ferent channels. Then we extract a 138-dimensional feature, screen. From the webcam, we may estimate the gaze from the
which encodes the mean and standard deviation of the power face in the video frame. By combining the information from
percentile in each frequency channel and of the total power, both cameras, we accurately estimate the head gaze of the test
bandwidth, the most powerful frequency channel, the number taker in a wide range of yaw and pitch angles.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
7

second approach of head gaze estimation via the face image


captured by the webcam.
The basic idea of this second approach is similar to the
approach in [3]. At the initial step, we detect a set of strong
corner points on the face [29], and then convert them to 3D
(a) (b) (c)
model points by using a sinusoidal model. This model attempts
to map the 2D corner points on a 3D sinusoidal surface, which
is an approximation of the true 3D face surface. Secondly,
we track these points by using the Lucas-Kanade method, and
estimate a rotation matrix based on the changes of the tracking
(d) (e) (f) points. We observe that at small gaze angles, the screen-based
approach is superior to the face-based approach. Therefore,
Fig. 5: Screen detection process: (a) input frame It , (b) grayscale
image, (c) histogram of (b), (d) converted binary image based on the the face-based approach is only utilized when vg > θg .
threshold, (e) the largest region after connected component analysis, For each frame, we store the results of the gaze estimation
and (f) estimated screen using the convex hull of the largest region. into elements of two vectors, “gazeLR” and “gazeUD”, which
are denoted as vg1 and vg2 , respectively. The first represents
the yaw estimation, and the second is the pitch estimation.
We now describe the gaze estimation from the wearcam,
Since the estimated gaze is an angular value in the range of
where the core routine is to extract the position of the screen
[− π2 , π2 ], we normalize them such that vg1 ∈ [−1, 1], where
automatically. We achieve this based on a simple observation
−1 means the user is looking far left at an angle of − π2 and
that the pixels of the screen are brighter than other pixels.
1 is towards the far right at π2 . The same applies to vg2 .
Specifically, as seen from Fig. 5(a-d), we first convert the
image to grayscale, and then to binary by using a proper
threshold, which is set to the mean intensity of the grayscale G. Phone Detection
image. Using connected component analysis and only keeping Our online exam rule prohibits the use of any type of
the largest region, we obtain a candidate region of the screen. mobile phones. Therefore, the presence of a mobile phone in
Finally, the screen is extracted by computing the convex hull the testing room can be an indication of potential cheating.
of the large region. With advancements in mobile phone technology, there are
In the preparation phase, the user is required to be in frontal many ways to cheat from them, such as reading saved notes,
view of the webcam, while performing initial authentication. text messaging friends, browsing the Internet, and taking a
As a result, it is reasonable to assume that the screen is snapshot of the exam to share with other test takers.
near the center of the video frame from the wearcam. We Phone detection is challenging due to the various sizes,
indeed verify this before completing the preparation phase. models and shapes of phones (a tablet could also be considered
In order to use the screen position to estimate the head gaze a type of phone). Some test takers might have large touch
in the exam phase, we calibrate the screen position during the screens while others might use a button-based flip phones.
preparation phase. That is, we estimate the screen position, and Moreover, cheating from a phone is usually accompanied with
denote its center as cs , width as ws , and height as hs . Note various occlusions, such as holding the phone under the desk,
that calibrating the screen is also very important for other or covering part of the phone with their hand.
components, such as the text and phone detection. We also To enable this capability, we utilize the video captured
learn an HSV model of the screen consisting of two thresholds, from the wearcam, since it sees what the test taker is seeing.
an upper and lower bound of possible screen intensity across We perform phone detection based on a similar approach for
the color channels. The bounds are defined by the mean screen-based gaze estimation, i.e., searching for pixels that
and standard deviation of each channel in the preparation are brighter than the background pixels. The motivation of
phase. Using this model, in the exam phase, an HSV pixel using the screen’s brightness over detecting the phone object,
is converted to foreground (i.e., 1 in the binary image), if and is that we don’t want to claim there is a phone-based cheating
only if all the H, S and V intensities fall within the learned behavior unless the phone is switched on. By using additional
bound. constraints on the area of potential local regions to exclude
During the exam phase, given a new frame, we use the HSV
large (i.e., the monitor) and small (i.e., random noise) objects,
model to convert the frame to a binary image and then estimate
whose thresholds are denoted as τl and τs respectively, we can
the screen position ĉs . We assume the distance between the test
estimate a candidate local region for the phones screen. We
taker and the screen is set to a fixed distance of d. Knowing
chose to represent the estimated phone screen by using the
d, cs and ĉs , the head pose is calculated by
area of the local region.
kcs − ĉs k Given a video frame from the wearcam, the output of
vg = arctan . (2)
d the phone detection model is stored as one element of the
It is obvious that we may only estimate vg using the screen “phoneProb” vector, denoted by vph . Since the phone detection
region when the screen is visible in the wearcam video. That module detects phone with an area in the range of [τs , τl ],
is, when the head gaze is larger than θg , the screen is out-of- we normalize them such that vph ∈ [0, 1], representing the
view from the wearcam video frame. In this case, we use the probability of detecting a phone in the frame. Since the vector

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
8 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

At time t, the high-level features are extracted from all


six vectors within the temporal window wt , and used to
represent the segment. The high-level features of each segment
are composed of the mean µ, standard deviation σ of each
component vector, and the covariance features C.
The covariance feature is an effective visual feature used in
many vision systems, including pedestrian detection [32]. Let
vi be the ith component vector obtained from one segment.
Fig. 6: Segment-based labeling process for a subject. In this example, We compute a sfs × 3 matrix Ai = [ vi |v0i | |v00i | ], where
the test taker cheats two times during the exam. Note how the window |v0i |, |v00i | are the absolute values of the first and second order
w shifts at exact increments with an 80% overlap. At each shift, a derivatives, respectively. Due to the sparsity of vph (i.e., most
segment is formed and assigned a label based on the majority vote elements of the vector are zeros as seen in Fig. 2), we exclude
of ground truth labels that falls within w. it from extracting covariance features. Therefore, combining
Ai of all the remaining five vectors yields a sfs × 15 matrix,
A = [A1 A2 ...A5 ]. To compute the covariance feature, we
vph could be noisy, we apply a median filter of a fixed size
apply the following equation:
sm , to eliminate the random noise.
1
C= (A − mean(A))T (A − mean(A)), (3)
H. Cheating Behavior Detection s−1
At this stage, we have the continuous output of the OEP where mean() computes the mean across all rows. Since C is a
basic components (i.e., vp ,vt ,vv ,vg1 ,vg2 ,vph ,vw ,vn ), where all 15 × 15 symmetry matrix, by keeping the upper triangular, the
vectors have the same sampling rate, i.e., one element per covariance feature of a segment is a 120-dimensional vector.
frame. We now present how to further analyze these vectors Finally, each extracted segment has a 132-dimensional feature,
to detect cheat behaviors. Note that, as seen by the blue dashed including the µ, σ (6 dimension each obtained from the 6
arrows in Fig 2, the latter two vectors vw and vn (i.e., number basic components), and the covariance feature (obtained from
of active windows and faces), are used directly to provide a 5 basic components excluding the phone detection), to be used
cheat decision. On the other hand, the remaining six vectors for cheat classification.
will be utilized for extracting high-level features, which will 2) SVM cheat classifier: As with the OEP components, we
then be used for learning a SVM classifier to make continuous use SVM for classifier learning [5]. For all training videos in
decisions on cheat behaviors. the OEP dataset, the segments with no cheating are considered
In our algorithm design, we highlight the correlation among as samples of the negative class, and the rest segments are of
the multiple components, which is extremely valuable in the positive class. We divide the positive cheating samples
detecting many cheat behaviors. For instance, it is shown into three main categories. (a) Any text related cheating from
that when test takers cheat by talking to a person in the test books, papers and notes is assigned to class 1. (b) Any cheating
room, there is a high correlation between the gaze and speech involving speech such as asking a person in the room, calling
estimation, which means that the test taker tends to look at the a friend on a phone, or any other speech detected in the room,
person during this process. Another example is between the is assigned to class 2. (c) Cheating from a phone or laptop
gaze and text detection, where the subjects tend to turn left or device is assigned to class 3. Class 0 is reserved for the no
right to search for a book or some notes. We now explain how cheating segments (the negative class). It is observed that a
we design these high-level features, and the cheat classifier multi-class SVM, consisting of a set of three pair-wise binary
used in the OEP system, in the following two subsections. classifiers (class 0 vs. 1, 0 vs. 2, etc.), performs better than the
1) Feature extraction: Since cheating behaviors occur over binary classifier (class 0 vs. class 1, 2, 3) During the testing,
a time duration, features need to be defined based on the we feed the feature of each segment to three classifiers, and
temporal window, which is commonly adopted in other be- use the average of the three classification scores as the final
havior recognition work [2]. We define a temporal window w measure of the cheating likelihood.
with a fixed length of s seconds for the purpose of feature
extraction. By shifting the window throughout the middle- IV. OEP DATABASE C OLLECTION
level feature vectors with a fixed overlap of l, we generate Since there is no publicly available database for online
multiple segments, which are the units for both training and exams, we carefully designed a protocol for data collection
testing. Given that we manually label the ground truth (cheat and labeling. The data collection took place in a room with
vs. non-cheat) for all collected videos at each second, we regular office furniture. We prepared a mathematics online
can convert this labeling to the ground truth label of each exam consisting of several multiple choices and fill in the
segment. That is, the binary ground truth label of a segment blank questions as shown in Fig. 7. During the preparation
is determined by the mass majority of per-second ground truth phase of the exam, we inform the test taker of a set of rules
labels within a segment. The window length s is preferred to they need to obey: (a) No books, notes or any sort of text are
be exact integer seconds, as well as an odd number of seconds allowed in the room. (b) Phones and laptops are prohibited.
to remove potential equality. The temporal segmentation and (c) The student has to solve the problems without the help
labeling process are illustrated in Fig. 6. from any other person. (d) Using the Internet is prohibited.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
9

Fig. 8: OEP dataset examples illustrating various cheat types. The examples are grouped in pairs showing both webcam and wearcam at a
specific time of the exam. The subjects are cheating from books, notes, papers, smartphones, the Internet, or asking someone in the room.

Cheat type frequency Total cheat duration per type


4%
3% 16% 12%
10% 1%
34%
Read text
Ask friend
Search Internet
Call friend
21% Search Phone

50% 50%

Fig. 9: Statistics of cheating behavior in the OEP dataset. Cheat


Fig. 7: Two questions of the mathematics exam that are given to test types: (1) cheat from book/note/paper, (2) talk in room with a person,
takers during data collection. (3) use the Internet, (4) ask a friend over the phone, and (5) use a
phone or other devices.

A total of 24 subjects, all of whom are students at Michigan


State University, participated in the data collection. The first 15 ing behaviors with no cheating. Even though these percentages
subjects were actors that pretended to be taking the exam. They may not depict real life exam scenarios (e.g., 1% cheat vs. 99%
were asked to perform cheating behaviors during the session, normal), it is necessary for the OEP system to include as
without any instructions on what cheating behavior to perform many cheating instances as possible to learn and evaluate a
or how to perform them. One issue with these subjects is that cheat classifier. Fig. 9 shows a full description of the cheat
potentially artificial behaviors are observed during the acting. behaviors in our OEP dataset. The total duration of all types
Therefore, to capture real-world exam scenarios, we asked of cheating is reported to be 7, 235 seconds. The most frequent
nine students to take the real exam, where their scores were cheat behavior is type 2 then type 1, summing up to a total of
recorded. Knowing that they are not likely to cheat in the data 84% of all cheat activities. The total number of cheat behaviors
capturing room, the proctor invokes the cheating behaviors by performed by all subjects is equal to 569 instances, varying in
talking, walking up to the student, or handing them a book, the type and duration of cheating.
etc. The combination of these two types of subjects enriches The five cheat types defined in our system cover all kinds of
the database with various cheat techniques, as well as the sense cheating behaviors we could manually identify in the collected
of engagement in real exams. OEP dataset. It is reasonable to assume that they are also
For each of 24 sessions, we collect the audio and two videos the most common cheating techniques in the real world. Note
from both cameras as seen in Fig. 2. Each session varied in that the techniques used within a specific type can vary from
length with an average time of 17 minutes. Human annotation one subject to another, increasing the level of difficulty in
and labeling are performed offline after collecting the data by detecting some of the instances. For example, some students
viewing the two videos and audio simultaneously. The labeling may open a book in front of them to cheat from, while others
of one cheat instance consists of three pieces of information: hide the book behind the computer screen or below the desk
the start time, end time and type of cheating. We label five introducing partial occlusion. Moreover, some students talk in
different types of cheating behaviors: (1) cheating from a book, a room with another person asking for help where both are
notes or any text found on papers. (2) talking to a person in the visible in the webcam, while others might speak with another
room. (3) using the Internet. (4) asking a friend a question over person who is not visible in any of the two cameras. Some
the phone. (5) using a phone. The labeling process for every speak with a low voice (i.e., whispering) while others speak
session is done carefully and required nearly 30∼35 minutes normally. Many other variations are also present in this dataset,
per session. Fig. 8 illustrates examples of different types of since we did not constrain the subjects in how to cheat.
cheating from various subjects. Note that the SVM cheat classifier combines cheat type 2
Nearly 20% of the total video length has various cheating and 4 into one class (i.e., Class 2), since both types involve
activities while the remaining 80% contains normal exam tak- speech. Moreover, cheat type 3 is not detected by the SVM

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
10 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

Kernel Dim= 50 Dim= 100 Dim= 200


linear 86.85% 85.63% 88.09%
quadratic polynomial 73.55% 73.61% 74.53%
cubic polynomial 78.61% 81.91% 74.50%
radial basis function 93.38% 93.43% 94.25%
sigmoid 81.51% 83.94% 82.74%
TABLE II: Accuracy of classifying the validation data using SVM
with different kernel functions and PCA dimensions.

γ Dim= 50 Dim= 100 Dim= 200


0.1 83.94% 85.21% 84.73%
1 92.63% 93.82% 93.02%
5 94.23% 92.18% 93.38%
Fig. 10: An example of segment and instance-based metrics. 10 88.90% 90.12% 88.88%
TABLE III: Accuracy of classifying the validation data using RBF
kernel with different γ and PCA dimensions.
cheat classifier; instead, we detect it by the active window
detection module which delivers an immediate cheat decision
as seen in Fig. 2. b) Instance-based metric: As illustrated in Fig. 10, to
V. E XPERIMENTAL R ESULTS compute the instance-based metric, we filter the segment-based
classification results in the following way. If more than 50%
In this section, we design experiments to answer the follow-
of the segments, regardless of their relative locations, within
ing questions: 1) How well can the system detect cheating?
a cheating instance are correctly classified as cheating, this is
2) How do different feature sets affect the performance? 3)
a correctly detected instance. Otherwise, it is a miss detection
What is the detectability of each cheat type? 4) Is there
at the instance level. The TDR in the instance-based metric is
any correlation between the six components of the OEP
defined as:
system? 5) What is the system efficiency at a component P
and system level? We now discuss different aspects of our # detected cheating instances of subject i
TDR = i P . (6)
experiments. We start by explaining the evaluation procedure. i # cheating instances of subject i
Then we analyze the individual performance for a couple of To evaluate false alarm in the instance-based metric, as long
basic components of our OEP system. After that, we test the as the number of consecutively detected false cheat segments is
performance of the entire OEP system. Finally, we describe over sf , we define this as a falsely detected instance, regardless
the OEP system efficiency. of its length. Since the instances within the cheat-free portion
of the session is not well defined, we compute FAR w.r.t. the
A. Performance Evaluation total length (in minutes) of cheat-free videos. Finally, the FAR
We define two metrics to evaluate the OEP system, a in the instance-based metric is defined as,
segment-based metric and an instance-based metric, with an P
# of false cheat instances of subject i
example in Fig. 10. The segment-based metric evaluates the FAR = Pi . (7)
estimated classifier decisions at the segment level, which is i # of cheat-free minutes of subject i
the most straightforward measurement of the classification
accuracy. A cheating instance is defined for the entire duration
of one continuous cheating behavior, regardless of how long it B. Basic Component Analysis
is. The instance-based metric evaluates the detection accuracy In this section we demonstrate the accuracy of the two
based on the unit of cheating instance. Therefore, it is the individual components, text and speech detection, which are
“perceived” system accuracy of the user, and can answer the most important ones among all six components. The other
questions such as “if a test taker cheats 10 times, how many components are evaluated along with the entire OEP system
times can OEP detect?” Both segment- and instance-based in the remaining sections. First of all, we set the parameters
metrics are represented by True Detection Rate (TDR) and used in the six basic components as the following: τ0 = 3,
False Alarm Rate (FAR), but computed in different ways. τv = 0.9, k = 1, d = 0.6 meters, θg = π4 , τl = 15, 000,
a) Segment-based metric: For segment-based metric, τs = 5, 000 and sm = 50. All experimental results reported
TDR is calculated by: in this section are evaluated with a 5-fold cross-validation on
P
# detected cheating segments of subject i the positive and negative training samples as seen in Fig. 4
TDR = P i ,
i # groundtruth cheating segments of subject i for text, and Table I for speech.
(4) 1) Text detection analysis: In text detection, the key pa-
where i denotes the test subject ID. Since it is also important rameters are the PCA dimensionality and the type of SVM
to not claim that a test taker is cheating when he/she is not, kernel. Different choices of the parameters will affect the text
we compute FAR by: detection performance. Using a two-class SVM [5], Table II
illustrates the detection performance on the validation dataset,
P
# of false cheat segments of subject i
FAR = Pi . (5) with different PCA dimensions and types of SVM kernel. Note
i # of cheat-free segments of subject i

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
11

Ls 0.5s 1s 2s
Accuracy 95.99% 98.14% 99.72% facePSR 0.65 0.33 0.00 0.42 0.67

TABLE IV: Accuracy of classifying audio samples with different textProb 0.33 0.92 0.00 0.58 0.05
Ls lengths.
voiceProb 0.00 0.00 0.67 0.75 1.00

Kernel Accuracy gazeLR 0.42 0.58 0.75 0.00 0.17


linear 94.37% gazeUD 0.67 0.05 1.00 0.17 0.00
quadratic polynomial 95.68%
facePSR textProb voiceProb gazeLR gazeUD
cubic polynomial 95.99%
raidal basis function 60.65%
sigmoid 68.17% Fig. 11: Comparing the importance of different correlation of the
five OEP components.
TABLE V: Accuracy of classifying audio samples using SVM with
different kernel functions.
classification performance. First of all, we attempt to compare
two different methods for computing the covariance features.
that reducing the dimensionality does not significantly reduce
The first method is to compute C as in Section III-H1. In
the detection performance. From this table, we see that the
the second method, we compute the 3 × 3 covariance matrix
radial basis function (RBF) performs better than other kernels.
from each OEP component independently, and extract a 6-
Since the RBF kernel relies on a good choice of γ, we tested
dimensional covariance feature due to symmetry. Concatenat-
the detection performance using RBF kernel with different γ
ing that of all five components (i.e., excluding vph ) results
values as seen in Table III. It appears that using the SVM with
with a 30-dimensional covariance feature C̄. The difference
RBF kernel (γ = 5) performs best on the validation dataset,
between C and C̄, is that C has the ability to highlight, if
where the feature dimension has been reduced to 50. We use
any, the correlation across the five OEP components, whereas
these specific parameters in our final OEP system.
C̄ only finds the correlation within the statistics of each
2) Speech detection analysis: We first analyze the speech
component. When comparing two types of covariance features
detection performance with different acoustic segment lengths
in cheat classification, we observe that the first one, C,
Ls . The testing results in Table IV illustrates that the larger the
achieves higher classification accuracy, which indicates that
segment size, the higher accuracy can be achieved. The reason
incorporating cross-component correlation in the high-level
is that the longer audio segment carries more information
feature benefits cheat classification.
about speech. However, in a real-world situation the longer
The covariance feature C has a total of 120 dimensions.
the segments are, the more likely the short speech instances
Within this large feature pool, which individual features are
will miss detection. To balance between these two cases, we
most relevant (or important) to the cheat classification task? To
choose the fixed duration Ls as 500ms with a 100ms shift.
answer this question, we apply an AdaBoost feature selection
In order to choose the best kernel, we train the SVM
technique [33] to select the most discriminative features among
classifiers using different kernels, and Table V gives the testing
all elements of C. Given the training data in each of three
accuracy. From this table, we can see that the cubic polynomial
trials, Adaboost selects the top 40 features from the 120
function performs best over other kernels. Moreover, we test
features of C. By repeating this for all three trials, we count
the performance of SVM using cubic polynomial kernel with
how many times a feature has been selected, and normalize the
different γ values, and it appears γ = 0.0072 generates the
counts by subtracting the minimum count and dividing with
highest accuracy on the testing sound samples.
the difference of the maximum and minimum counts. This
leads to the importance of correlation map in Fig. 11.
C. OEP System Analysis Some important observations can be made: (1) The voice
a) Experimental setup: All experiments are based on detection component has a significant role in detecting cheat
partitioning the dataset into two equal folds in the subject behaviors when combined with the gaze estimation. This
space for training and testing, while keeping the numbers means when a test taker cheats by asking a friend in the
of real and acting test takers equal between the two folds. room, or by talking on the phone, he/she tends to change his
This partition is repeated in three trials while maintaining the head gaze direction. The same applies to the text detection and
distribution of real vs. acting subjects. All reported results in gaze components. (2) The inner-correlation of the component
the remaining section are based on the average of three trials. is observed as seen on the diagonal of Fig. 11, where the
We set the window size s to 5 seconds, and the window shifts text, speech, and PSR vectors have high importance in the
with an overlap of 1 second, which corresponds to an 80% OEP system. The importance of the face PSR component is
overlap between consecutive segments. We set sf to 3 for also relatively high, which is understandable, because for a
computing the FAR in the instance-based metric. The multi- test taker to cheat, this normally requires him to stop looking
class SVM of the cheat classifier uses a linear kernel with directly to the screen (i.e., webcam), and hence the PSR value
the cost set to 10 [5], and all other parameters are set to the changes accordingly. (3) A large number of correlation across
default values by LIBSVM. components tend to have no importance, and therefore do not
b) Feature analysis: We start by analyzing the charac- provide useful information in detecting cheat behaviors, such
teristics of the covariance features in the context of cheat as facePSR & voiceProb and voiceProb & textProb.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
12 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

0.92
0.9

0.85 0.9 0.9

0.8 0.88

Instance−based TDR
0.85
0.75 0.86

TDR
TDR

0.8
0.7
0.84
0.65
0.75
0.82
0.6
s = 3 sec. 0.7
Binary SVM 0.8
0.55 s = 5 sec.
Multi−class SVM
s = 7 sec.
0.5 0.78 0.65
−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 −0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
FAR FAR Instance−based FAR (cheat/minute)

Fig. 12: Segment-based performance compar- Fig. 13: Performance comparison when using Fig. 14: Instance-based performance evalua-
ison via Binary SVM vs. Multi-class SVM. various window sizes. tion.

Cheating type Detection rate FAR distribution


Text detection 85.8% 50.8%
Speech detection 89.3% 43.9%
Phone detection 100.0% 5.3%
TABLE VI: Error analysis of the OEP system at 2% FAR.

c) OEP system results: In our cheat detection classifier,


we observe that using a multi-class SVM achieves a higher
performance compared to the two-class SVM, as shown in Fig. 17: The GUI for human proctoring used by three proctors.
Fig 12. This is partly because the positive class (i.e., all cheat-
ing behaviors) contains extremely diverse types of cheating,
detecting the cheat instances when test takers use the phone.
varying from reading text to verbally asking through speech
Part of the reason is the limited phone-based cheat samples in
events, which implies huge variations in the feature space.
the database - only 4% of cheat instances (23 cases for training
Hence, it is challenging to find a single hyperplane to best
and testing) as seen from Fig. 9. On the other hand, introducing
discriminate the negative class from the positive class. In
the phone detection module to the system is accompanied with
contrast, using a four-class SVM defines multiple hyperplanes
false alarms equal to 5.3% of all FAR. We show the entire
to better separate the four classes locally, which results in
classification results of two subjects in Fig. 15 and some of
better overall cheating vs. non-cheating classification.
the system failure cases in Fig. 16.
We further explore the temporal segments by changing the
window size s as shown in Fig. 13. Here s is assigned to
be 3, 5, or 7 seconds. We avoid selecting s > 7 because the D. Performance of Human Proctoring
majority of cheat behaviors tend to be short in duration. Note Human proctoring is the most common approach of vali-
that the best performance is achieved when s = 5 seconds, dating online exams nowadays, by monitoring the test taker
with a TDR of 0.87 ± 0.03 at an FAR of 0.02. visually and acoustically via a webcam. In order to access its
Using the experimental setup based on the best parameters, performance and contrast with our OEP system, we conduct
we evaluate our system using the instance-based metric. The an experiment imitating a human proctoring system, similar
result is illustrated in Fig. 14. We see that our OEP system is to the services offered by ProctorU. All testing videos used
able to detect cheating at an instance-based TDR of 0.80±0.04 in our system were provided to three different people with
and an FAR of 0.2 cheats per minute. This means that on experience in teaching, along with a graphical user interface
average only one false alarm occurs per five minutes of the (GUI) designed to manually record the cheating instances, as
normal cheat-free exam. shown in Fig. 17. The GUI contained only one button which
Given the best results in the segment-based metric of toggles between Cheat and Stopped cheating when clicked.
Fig. 13, we are interested in what types of cheating behavior The proctor had the ability to run one or two videos at the
constitute the missing detection error and false alarm error. same time to imitate a real proctoring environment where one
Based on the ground truth labels of the segments, we can cat- proctor usually “watches” multiple tests simultaneously. The
egorize each wrongly classified segment (either miss detected proctors were not given any instructions other than to click
one or false alarm one) into one of the three cheat classes, the cheat button at the beginning of a cheat behavior, and to
and illustrate the results in Table VI. We realize that speech click again at the end of that same behavior.
detection performs better than text detection at a TDR equal After collecting the results of the three proctors, we compare
to 89.3% with an FAR set to 2%. This is expected: detecting them with the results of the OEP system individually, as well
text from the wearcam is very challenging due to resolution, as jointly in two different schemes: (a) the majority of the
lighting and perspective distortion. It is also found that the proctors decision (i.e., two out of three need to agree), and
number of false alarms related to text is also higher than (b) the intersection of the proctors decision (i.e., all three
speech. The phone detection has shown to work accurately for need to agree). Table VII shows the total cheat time labeled

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
13

Estimated cheat probability Groundtruth cheat label Threshold

Cheating

Normal

0 250 500 750 1,000 1,250 1,500

Cheating

Normal

0 100 200 300 400 500 600 700 800 900


Time in seconds

Fig. 15: Results of Subject 10 (top) and Subject 16 (bottom) based on the chosen threshold that produces a segment-based FAR of 0.2.
Subject 10 cheats 15 times during the exam, while 2 of them are not detected. Subject 16 cheats 10 times, where 3 of them are not detected
along with 1 false alarm. Best viewed in color.

Fig. 16: Failure examples of the OEP system, showing the frames from both of the webcam and wearcam along with the estimated cheat
behavior probability for a specific duration in the test taking illustrated by the x-axis. (a, b, c) represent cases where the OEP struggles to
recognize the cheat activity of type 1, 2, and 4, respectively. (d, e, f) are false alarms where the system claims the subjects are cheating, but
the ground truth reflects otherwise.

Segment metric Instance metric


by the proctors, the segment TDR and FAR, and the instance Results of Cheat time (s) TDR FAR TDR FAR
TDR and FAR. Based on the ground truth labeling, the testing Proctor 1 4,567 0.86 0.13 0.88 0.90
videos contain cheating behaviors for a total time of 3, 199 Proctor 2 4,728 0.85 0.14 0.83 0.69
seconds. It is clear that the human proctors reported cheating Proctor 3 5,504 0.71 0.22 0.72 0.77
durations much larger than the actual cheat time, which is Majority 4,650 0.87 0.13 0.85 0.77
reflected negatively in the FAR measurements. Part of the Intersection 2,758 0.60 0.06 0.58 0.33
OEP 2,958 0.87 0.02 0.85 0.42
reason is the slow reaction of humans towards switching on/off
the cheat duration. Typically, ∼2 seconds are needed before TABLE VII: Comparison of human proctoring and OEP system.
confirming that the student has started/ended the cheating
behavior. Furthermore, human proctors lose their attention
span in some parts of the proctoring session, which leads E. System Efficiency
to lower TDR. Note that the OEP results are chosen at The six OEP basic components are all implemented in C++.
an operation point where the TDR is the most similar to The high-level feature extraction and cheat classification are
the human performance of “Majority”, which appears to be implemented in Matlab. Table VIII shows the system efficiency
the best among all human performance. In general, when break down in frame per second (FPS), while the system runs
achieving the same TDR, the OEP system can maintain a on a personal desktop computer with Windows 8 (Intel i5
lower FAR than the human proctors. We recognize that, in this CPU at 3.0 GHz with 8 GB RAM). It can be observed that
comparison, the precise onset and offset locations of a cheating the computation cost of Stage 2 cheat detection is negligible
duration matter, which may not be the case in a real-world compared to that of the basic components. Among the six basic
scenario and that would change the comparison accordingly. components, text detection is the slowest one which requires

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
14 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

Stage 1- Basic FPS


the context of the exam and the test taker’s preference. For
component
User Verification 10 Stage 2- Cheat FPS example, if the exam is an open-book exam, the OEP system
Text detection 4 detection per seg. should exclude the text detection component. Some other types
Speech detection 25 Features extraction 1,816 of exam might require the test taker to talk such as oral
Window detection 1,000 Cheat classification 932 exams, and hence removing the speech detection component
Gaze estimation 175 is necessary.
Phone detection 37
Even with all the aforementioned system enhancements, it
TABLE VIII: Efficiency of basic components, feature extraction and is possible that the automatic OEP system might not achieve
classification of the OEP system. perfect performance (i.e., detecting all cheating behaviors with
no false alarm). We note that even in traditional classroom
proctoring, it is likely that the proctor will fail to detect some
238 ms per frame. Based on these costs, if a test taker takes an cheating behaviors, due to either the attention span of the
exam for 1 minute, our OEP system would require a total of proctor or highly concealed action. Therefore, as long as OEP
∼6 minutes to finish processing the videos from two cameras can capture the majority of cheating behaviors with reasonably
along with the audio. Note that this 6X slower-than-real-time small false alarm, it will be a useful contribution to online
speed is based on the assumption that all six basic components education. Furthermore, we may also allow humans to man-
process every frame in 25 FPS videos. In reality, it is very ually inspect the instances with high probability of cheating
likely that we may process at a lower frame rate, yet still from our system. For example, setting a proper threshold in
maintain similar detection performance, since for example, the Fig. 15 detects all such instances. This manual inspection helps
test taker would need a few seconds in text-based cheating. to verify the true detections, as well as suppress the false
alarms. Hence, the combination of using OEP to detect likely
VI. D ISCUSSION cheating instances within the entire session, and the manual
The main contribution of this work is to present a compre- inspection on a very small subset of data, can achieve an
hensive framework for online exam proctoring. While we have excellent trade-off between system accuracy and cost. Finally,
achieved good performance in our evaluation, our framework as visual analysis technology progresses, it is obvious that the
can certainly be improved in a number of ways. For the basic workload of manual inspection will become less and less.
components, we can either apply more advanced algorithms
for each component, such as the deep learning-based feature
representation, typing-based continuous authentication [25], VII. C ONCLUSIONS
[26], face alignment-based pose estimation [12], [18], [19],
upper body alignment [20], and model personalization [6]. We This paper presents a multimedia analytics system for online
may also expand the array of basic components, to include exam proctoring, which aims to maintain academic integrity
additional components such as pen detection. For cheat clas- in e-learning. The system is affordable and convenient to
sification, we can explore temporal-spatial dynamic features, use from the text taker’s perspective, since it only requires
similar to the work in video-based activity recognition [36]. having two inexpensive cameras and a microphone. With the
Moreover, the system efficiency can also be improved while captured videos and audio, we extract low-level features from
maintaining a high accuracy in recognizing cheat events as six basic components: user verification, text detection, speech
suggested in [11], by selecting more suitable features and detection, active window detection, gaze estimation, and phone
classifiers, as well as selecting a smaller number of frames detection. These features are then processed in a temporal
instead of utilizing all frames. window to acquire high-level features, and then are used for
We recognize that there always exists a possibility that cheat detection. Finally, with the collected database of 24
concealed cheating activities might happen outside the fields test takers representing real-world behaviors in online exam,
of view of both cameras. To remedy this, our system plans we demonstrate the capabilities of the system, with nearly
to generate random commands, such as asking the test taker 87% segment-based detection rate across all types of cheating
to look around or under the desk to check the surrounding behaviors at a fixed FAR of 2%. These promising results
environment of exam. To detect whether the test taker has warrant further research on this important behavior recognition
tampered with the sensors, once in a while our system can problem and its educational application.
display a simple icon on the computer screen to validate that
the wearcam can “see” it, or play a quick sound clip to validate
that the microphone can “hear” it. The randomness of such VIII. ACKNOWLEDGEMENTS
commands and intervention will likely make our system more
robust against deliberate cheating behavior. This project was sponsored in part by the Michigan State
Note that the definition of cheating behavior depends on University Targeted Support Grants for Technology Devel-
the context of the exam, such as oral exam, open-book exam, opment (TSGTD) program. The authors thank anonymous
etc. Our proposed hybrid two-stage algorithm enables the user volunteers for participating in our OEP database collection.
to take into consideration such context of the exam. The six The authors also thank Baraa Abu Dalu, Muath Nairat, and
basic components extracted in the first stage can be considered Mohammad Alrwashdeh for contributing to the study of hu-
as system building blocks, which are reconfigurable based on man proctoring.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
15

R EFERENCES [28] M. Savvides, B. V. Kumar, and P. Khosla. Face verification using corre-
lation filters. 3rd IEEE Automatic Identification Advanced Technologies,
pages 56–61, 2002.
[1] I. E. Allen and J. Seaman. Grade change: Tracking online education [29] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Conf.
in the united states, 2013. Babson Survey Research Group and Quahog Computer Vision and Pattern Recognition (CVPR), pages 593–600,
Research Group, LLC. Retrieved on, 3(5), 2014. 1994.
[2] Y. Atoum, S. Srivastava, and X. Liu. Automatic feeding control for dense [30] E. Spriggs, F. De la Torre, and M. Hebert. Temporal segmentation and
aquaculture fish tanks. IEEE Signal Processing Letters, 22(8):1089– activity classification from first-person sensing. In Proc. IEEE Conf.
1093, 2015. Computer Vision and Pattern Recognition Workshops (CVPRW), pages
[3] D. L. Baggio. Enhanced human computer interface through webcam 17–24, 2009.
image processing library. Natural User Interface Group Summer of [31] A. Tsukada, M. Shino, M. Devyver, and T. Kanade. Illumination-free
Code Application, pages 1–10, 2008. gaze estimation method for first-person vision wearable device. In Proc.
[4] S. V. Bailey and S. V. Rice. A web search engine for sound effects. In Int. Conf. Computer Vision (ICCV) Workshops, pages 2084–2091, 2011.
Audio Eng. Society Convention 119. Audio Eng. Society, 2005. [32] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classification
[5] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell.,
machines. ACM T-TIST, 2(3):27, 2011. 30(10):1713–1727, 2008.
[6] J. Chen and X. Liu. Transfer learning with one-class data. Pattern [33] P. Viola and M. Jones. Rapid object detection using a boosted cascade
Recognition Letters, 37:32–40, 2014. of simple features. In Proc. IEEE Conf. Computer Vision and Pattern
[7] G. Cluskey Jr, C. R. Ehlen, and M. H. Raiborn. Thwarting online exam Recognition (CVPR), volume 1, pages 511–518, 2001.
cheating without proctor supervision. Journal of Academic and Business [34] A. Wahid, Y. Sengoku, and M. Mambo. Toward constructing a secure
Ethics, 4:1–7, 2011. online examination system. In Proc. of the 9th Int. Conf. on Ubiquitous
[8] P. Guo, H. feng yu, and Q. Yao. The research and application of online Information Management and Communication, page 95. ACM, 2015.
examination and monitoring system. In IT in Medicine and Education, [35] B. Xiao, P. Georgiou, B. Baucom, and S. Narayanan. Head motion
2008. IEEE Int. Sym. on, pages 497–502, 2008. modeling for human behavior analysis in dyadic interaction. IEEE Trans.
[9] D. Hoiem, Y. Ke, and R. Sukthankar. Solar: sound object localization Multimedia, 17(7):1107–1119, 2015.
and retrieval in complex audio environments. In Proc. IEEE Int. Conf. [36] Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen. Spatio-temporal
Acoustics, Speech and Signal Processing (ICASSP), volume 5, pages phrases for activity recognition. In Proc. European Conf. Computer
429–432, 2005. Vision (ECCV), pages 707–721, Florence, Italy, Oct. 2012.
[10] H. Hung and D. Gatica-Perez. Estimating cohesion in small groups using
audio-visual nonverbal behavior. IEEE Trans. Multimedia, 12(6):563– Yousef Atoum is a Ph.D. student in the Department
575, 2010. of Electrical and Computer Engineering at Michi-
[11] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event gan State University. He received a B.S. degree
recognition in internet videos. IEEE Trans. Multimedia, 17(8):1174– in Computer Engineering from Yarmouk University,
1186, 2015. Jordan, in 2009, and a M.S. degree in Electrical and
[12] A. Jourabloo and X. Liu. Pose-invariant 3d face alignment. In Proc. Computer Engineering from Western Michigan Uni-
Int. Conf. Computer Vision (ICCV), pages 3694–3702, 2015. versity, in 2012. His research interests include object
[13] I. Jung and H. Yeom. Enhanced security for online exams using group tracking, computer vision and pattern recognition.
cryptography. Education, IEEE Trans. on, 52(3):340–349, 2009.
[14] V. Kilic, M. Barnard, W. Wang, and J. Kittler. Audio assisted robust
visual tracking with adaptive particle filtering. IEEE Trans. Multimedia,
17(2):186–200, 2015.
[15] D. L. King and C. J. Case. E-cheating: Incidence and trends among
college students. Issues in Information Systems, 15(1), 2014.
[16] I. Lefter, L. J. Rothkrantz, and G. J. Burghouts. A comparative study
on automatic audio–visual fusion for aggression detection using meta-
information. Pattern Recognition, 34(15):1953–1963, 2013.
[17] X. Li, K.-m. Chang, Y. Yuan, and A. Hauptmann. Massive open online Liping Chen is a Ph.D. student in the Department of
proctor: Protecting the credibility of moocs certificates. In ACM CSCW, Mathematics and a M.S. student in the Department
pages 1129–1137. ACM, 2015. of Computer Science and Engineering at Michigan
[18] X. Liu. Video-based face model fitting using adaptive active appearance State University. He received his B.S. degree in
model. Image and Vision Computing, 28(7):1162–1172, July 2010. Information and Computational Mathematics from
[19] X. Liu, N. Krahnstoever, T. Yu, and P. Tu. What are customers looking
Nanjing University of Aeronautics and Astronau-
at? In Proc. IEEE Conf. Advanced Video and Signal Based Surveillance
tics, Nanjing, China, in 2006. His research interests
(AVSS), pages 405–410, London, UK, Sept. 2007.
[20] X. Liu, T. Yu, T. Sebastian, and P. Tu. Boosted deformable model include numerical analysis, tensor eigenvalue, and
for human body alignment. In Proc. IEEE Conf. Computer Vision and computer vision.
Pattern Recognition (CVPR), pages 1–8, Anchorage, Alaska, June 2008.
IEEE.
[21] L. Nguyen, D. Frauendorfer, M. Mast, and D. Gatica-Perez. Hire me:
Computational inference of hirability in employment interviews based on
nonverbal behavior. IEEE Trans. Multimedia, 16(4):1018–1031, 2014.
[22] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic
representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145–
175, 2001.
[23] M. Reale, S. Canavan, L. Yin, K. Hu, and T. Hung. A multi-gesture Alex X. Liu received his Ph.D. degree in Com-
interaction system using a 3-d iris disk model for gaze estimation and an puter Science from The University of Texas at
active appearance model for 3-d hand pointing. IEEE Trans. Multimedia, Austin in 2006. He received the IEEE & IFIP
13(3):474–486, 2011. William C. Carter Award in 2004, a National Sci-
[24] W. Rosen and M. Carr. An autonomous articulating desktop robot for ence Foundation CAREER award in 2009, and the
proctoring remote online examinations. In Frontiers in Education Conf., Michigan State University Withrow Distinguished
2013 IEEE, pages 1935–1939, 2013. Scholar Award in 2011. He is an Associate Edi-
[25] J. Roth, X. Liu, and D. Metaxas. On continuous user authentication tor of IEEE/ACM Transactions on Networking, an
via typing behavior. IEEE Trans. Image Process., 10:4611–4624, Oct. Associate Editor of IEEE Transactions on Depend-
2014. able and Secure Computing, and an Area Editor of
[26] J. Roth, X. Liu, A. Ross, and D. Metaxas. Investigating the discrim- Computer Communications. He received Best Paper
inative power of keystroke sound. IEEE Trans. Inf. Forens. Security, Awards from ICNP-2012, SRDS-2012, and LISA-2010. His research interests
10(2):333–345, 2015. focus on networking and security.
[27] D. Sadlier and N. O’Connor. Event detection in field sports video using
audio-visual features and a support vector machine. IEEE Trans. Circuits
Sys. Video Technol., 15(10):1225–1233, 2005.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2656064, IEEE
Transactions on Multimedia
16 ACCEPTED WITH MINOR REVISION BY IEEE TRANSACTION ON MULTIMEDIA, DEC 30, 2015

Stephen D. H. Hsu is Vice President for Research


and Graduate Studies and Professor of Theoretical
Physics at Michigan State University. Hsu gradu-
ated from the California Institute of Technology
and received his doctorate from the University of
California, Berkeley. A Junior Fellow of the Har-
vard Society of Fellows, he was on the faculty at
Yale University before moving to the University of
Oregon where he served as Director of the Institute
for Theoretical Science. Hsus primary work has been
in applications of quantum field theory, particularly
to problems in cosmology and fundamental physics. He has also made
contributions in genomics and bioinformatics, finance, and in encryption and
information security. He founded two Silicon Valley companiesSafeWeb and
Robot Genius Inc., and is a scientific advisor to BGI Shenzhen.

Xiaoming Liu is an Assistant Professor at the


Department of Computer Science and Engineering
of Michigan State University. He received the Ph.D.
degree in Electrical and Computer Engineering from
Carnegie Mellon University in 2004. Before joining
MSU in Fall 2012, he was a research scientist at
General Electric (GE) Global Research. His research
interests include computer vision, patter recognition,
biometrics and machine learning. As a co-author, he
is a recipient of Best Industry Related Paper Award
runner-up at ICPR 2014, Best Student Paper Award
at WACV 2012 and 2014, and Best Poster Award at BMVC 2015. He has been
an Area Chair for numerous conferences, including FG, ICPR, WACV, ICIP,
and CVPR. He is an Associate Editor of Neurocomputing journal. He has
authored more than 100 scientific publications, and has filed 22 U.S. patents.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like