0% found this document useful (0 votes)
211 views15 pages

Characterizing Pixel Tracking Through The Lens of Disposable Email Services

This document presents a study that analyzes the use of disposable email services and measures email tracking. The study collected a dataset of over 2.3 million emails from 7 popular disposable email services over 3 months. The goals were to understand how disposable email services are used in practice and potential risks, and to measure the prevalence of email tracking. Key findings include that accounts registered with disposable emails can be easily hijacked, emails are sometimes not disposed of as claimed, and email tracking is highly prevalent, with over 24% of emails containing tracking pixels. Various tracking techniques were observed trying to hide tracking behavior.

Uploaded by

Cristina Salcedo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views15 pages

Characterizing Pixel Tracking Through The Lens of Disposable Email Services

This document presents a study that analyzes the use of disposable email services and measures email tracking. The study collected a dataset of over 2.3 million emails from 7 popular disposable email services over 3 months. The goals were to understand how disposable email services are used in practice and potential risks, and to measure the prevalence of email tracking. Key findings include that accounts registered with disposable emails can be easily hijacked, emails are sometimes not disposed of as claimed, and email tracking is highly prevalent, with over 24% of emails containing tracking pixels. Various tracking techniques were observed trying to hide tracking behavior.

Uploaded by

Cristina Salcedo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Characterizing Pixel Tracking through the Lens of

Disposable Email Services


Hang Hu, Peng Peng, Gang Wang
Department of Computer Science, Virginia Tech
{hanghu, pengp17, gangwang}@vt.edu

Abstract—Disposable email services provide temporary email services are highly popular. For example, Guerrilla Mail, one
addresses, which allows people to register online accounts without of the earliest services, has processed 8 billion emails in the
exposing their real email addresses. In this paper, we perform past decade [3].
the first measurement study on disposable email services with
two main goals. First, we aim to understand what disposable While disposable email services allow users to hide their
email services are used for, and what risks (if any) are involved real identities, the email communication itself is not necessar-
in the common use cases. Second, we use the disposable email ily private. More specifically, most disposable email services
services as a public gateway to collect a large-scale email dataset maintain a public inbox, allowing any user to access any
for measuring email tracking. Over three months, we collected a
dataset from 7 popular disposable email services which contain disposable email addresses at any time [6], [5]. Essentially
2.3 million emails sent by 210K domains. We show that online disposable email services are acting as a public email gateway
accounts registered through disposable email addresses can be to receive emails. The “public” nature not only raises interest-
easily hijacked, leading to potential information leakage and ing questions about the security of the disposable email service
financial loss. By empirically analyzing email tracking, we find itself, but also presents a rare opportunity to empirically collect
that third-party tracking is highly prevalent, especially in the
emails sent by popular services. We observe that trackers are email data and study email tracking, a problem that is not
using various methods to hide their tracking behavior such as well-understood.
falsely claiming the size of tracking images or hiding real trackers In this paper, we have two goals. First, we want to
behind redirections. A few top trackers stand out in the tracking understand what disposable email services are used for in
ecosystem but are not yet dominating the market.
practice, and whether there are potential security or privacy
I. I NTRODUCTION risks involved with using a disposable email address. Second,
we use disposable email services as a public “honeypot” to
An Email address is one of the most important components collect emails sent by various online services and analyze
of personally identifiable information (PII) on the Internet. email tracking in the wild. Unlike the extensively-studied web
Today’s online services typically require an email for account tracking [29], [34], [43], [48], [9], [10], [18], email tracking
registration and password recovery. Unfortunately, email ad- is not well-understood primarily due to a lack of large-scale
dresses are often unprotected. For example, email addresses email datasets. The largest study so far [17] has analyzed
used to register online social networks might be collected by emails from 902 “Shopping” and “News” websites. In this
malicious third-parties [45], thus exposing users to spam and paper, we aim to significantly increase the measurement scale
spear phishing attacks [40]. Massive data breaches, especially and uncover new tracking techniques.
those at sensitive services (e.g., Ashley Madison [22]), can
expose user footprints online, leading to real-world scandals. Understanding Disposable Email Services. In this paper,
In addition, email addresses are often leaked together with we collect data from 7 popular disposable email services from
passwords [51], [56], allowing malicious parties to link user October 16, 2017 to January 16, 2018 over three months. By
identities across different services and compromise user ac- monitoring 56,589 temporary email addresses under popular
counts via targeted password guessing [57]. usernames, we collect in total 2,332,544 incoming email mes-
As a result, disposable email services have become a sages sent from 210,373 online services and organizations. We
popular alternative which allows users to use online services are well aware of the sensitivity of email data. In addition to
without giving away their real email addresses. From dis- working with IRB, we also take active steps to ensure research
posable email services, a user can obtain a temporary email ethics such as detecting and removing PII from the email
address without registration. After a short period of time, the content and removing personal emails. Our analysis reveals
emails will be disposed by the service providers. Users can use key findings about the usage of disposable email services.
this disposable email address for certain tasks (e.g., registering First, there is often a delay to dispose of the incoming
an account on a dating website) without linking their online emails. Certain services would hold the emails for as long
footprints to their real email addresses (e.g., work or personal as 30 days, in spite of the claimed 25 minutes expiration
email). In this way, potential attacks (e.g., spam, phishing, time. Second, we find that users are using disposable email
privacy leakage) will be drawn to the disposable addresses addresses to register accounts in a variety of online services.
instead of the users’ real email accounts. Disposable email While the vast majority of emails are spam and notifications,
we did find a large number of emails (89,329) that are Disposable Email Disposable Email
Service Service
Req: email address
used for account registration, sending authentication code, and Req: username=”david”
x.com x.com
even password reset. Third, accounts registered via disposable david@x.com tt1hfd5m@x.com
emails are easily hijackable. We find risky usage of dispos- (a) User-specified Address. (b) Randomly-assigned Address.
able email addresses such as registering sensitive accounts at
Fig. 1: Two types of disposable email addresses.
financial services (e.g., PayPal), purchasing bitcoins, receiving
scanned documents, and applying for healthcare programs.
disconnect the user’s online activities from her real identity,
Measuring Email Tracking. Email tracking involves em- and avoid attracting spam emails to the real email accounts.
bedding a small image (i.e., tracking pixel) into the email body There are two types of disposable email services, based on
to tell a remote server when and where the email is opened how temporal email addresses are assigned (Figure 1).
by which user. When the email is opened, the email client
• User-specified Addresses (UA). Most services allow users
fetches the pixel and this notifies the trackers. To measure
to specify the username they want to use. For example, a
email tracking in the wild, we build a new tool to detect both
user can obtain a temporary address “david@x.com” by
first-party tracking (where the email sender and the tracker are
specifying a username “david”. The user-specified address
the same) and third-party tracking (where the email sender and
is more memorable for users.
the tracker are different) from the collected email dataset.
• Randomly-assigned Addresses (RA). Some services cre-
We have three key observations. First, email tracking is
ate temporal email addresses for users by randomly gen-
highly prevalent, especially with popular online services. Out
erating usernames. For example, a user may be assigned
of the 2.3 million emails, 24.6% of them contain at least one
to a random address that looks like “tt1hfd5m@x.com”.
tracking link. In terms of sender domains, there are 2,052
Users may refresh the web page to receive a different
sender domains (out of 210K domains in our dataset) ranked
random address each time.
within the Alexa top 10K. About 50% of these high-ranked
domains perform tracking in their emails. Second, we find that While disposable email services allow users to temporarily use
stealthy tracking techniques are universally preferred, either an email address, this email address and the received messages
by falsely claiming the size of tracking images in HTML or are not necessarily “private”. More specifically, most dispos-
hiding the real trackers through redirection. Popular online able email services are considered to be public email gateways,
services are significantly more likely to use “stealthy” tracking which means any users can see other users’ temporary inbox.
techniques. Third, although a small number of trackers stand For example, if a user A is using david@x.com at this
out in the tracking ecosystem, these trackers are not yet moment, then another user B may also access the inbox of
dominating the market. The top 10 email trackers are used david@x.com at the same time. Very few disposable email
by 31.8% of the online domains, generating 12% of the services have implemented the sandbox mechanisms to isolate
tracking emails. This is different from web tracking where each temporary inbox. The only service we find that maintains
one dominating tracker (i.e., Google) can track user visits of a private inbox is inboxbear.com, which distinguishes
80% of the online services [31]. each inbox based on the browser cookie. Therefore, many
disposable email services have made it clear on their websites
Contributions. Our work makes three key contributions. (or Terms of Services) that the email inbox is public and users
• First, we perform the first measurement study on dispos- should not expect privacy [6], [5].
able email services by collecting a large-scale dataset (2.3
million emails) from 7 popular services over 3 months. B. Email Tracking
• Second, our analysis provides new insights into the com- Email tracking is a method that allows the sender to know
mon use cases of disposable email services and uncovers whether an email is opened by the receiver. A common method
the potential risks of certain types of usage. is to embed a small image (e.g., a 1×1 pixel) in the message
• Third, we use the large-scale email dataset to empirically body. When the receiver reads the email, the image will be
measure email tracking in the wild. We show the stealthy automatically loaded by sending an HTTP or HTTPS request
tracking methods used by third-party trackers collect data to a remote server. The remote server can be either the original
on user identifiers and user actions. email sender or a third-party service. In this way, the remote
server will know when (based on timestamp) and where (based
II. BACKGROUND on IP) the email is read by which person (based on email
address) using what device (based on “User-Agent”).
A. Disposable Email Services
Email tracking is part of the broader category of web
Disposable email services are online web services where tracking. Web tracking, typically through third-party cookies
users can obtain a temporary email address to receive (or and browser fingerprints, has been extensively studied [15],
send) emails. After a short usage, the email address and its [29], [34], [43], [12], [46], [48], [28], [19], [9], [10], [18],
messages will be disposed by the service provider. Dispos- [38]. However, very few studies have systematically examined
able email services allow users to register an online account email tracking because real-world email datasets are rarely
without giving away their real email addresses. This helps to available to researchers. The largest measurement study so
far [17] collected data by signing up for “Shopping” and As discussed above, we focus the on services that offer user-
“News” websites to receive their emails. The resulting dataset specified addresses (UA), and select the top 7 disposable email
contains 902 email senders. The limited number and category services as shown in Table II. These services are very popular.
of online services severely limit researchers’ ability to draw For example, guerrillamail.com self-reported that they have
generalizable conclusions. processed nearly 8 billion emails in the past decade. mailne-
We believe that the disposable email services provide a sia.com self-reported that they received 146k emails per day.
unique opportunity to study email tracking at a much larger While most of these services only provide the functionality of
scale and uncover new tracking techniques in the wild. First, receiving emails, a few (e.g., guerrillamail.com) also
disposable email services are public, which allows us to provide the functionality of sending emails. In this work, we
collect emails sent to disposable email addresses. Second, only focus on the incoming emails received by the disposable
users of disposable email services have broadly exposed the email addresses (to analyze email tracking).
email addresses to the Internet (by registering various online Selecting Popular Usernames. We construct a list of
accounts), which helps to attract emails from a wide range popular usernames to set up disposable email addresses. To
of online services (and spammers). The resulting data, even do so, we analyze 10 large leaked databases (that contain
though still has biases, is likely to be much more diversified. email addresses) from LinkedIn, Myspace, Zoosk, Last.fm,
Mate1.com, Neopets.com, Twitter, 000webhost.com, Gmail,
III. DATA C OLLECTION
Xsplit. These databases are publicly available and have been
To understand how disposable email services are used, we widely used for password research [56], [16], [30], [52],
collect emails that are sent to disposable addresses. First, [55], [57], [51]. By combining the 10 databases, we obtain
we describe our data collection process. We then present a 430,145,229 unique email addresses and 349,553,965 unique
preliminary analysis of the dataset. Finally, we discuss the usernames. We select the top 10,000 most popular usernames
active steps we take to ensure research ethics. for our data collection. The top 5 usernames are info, john,
admin, mail, and david, where “info” and “david”
A. Data Crawling Methodology have been used 800,000 and 86,000 times, respectively.
Since disposable email addresses are public gateways, our To confirm that popular usernames are more likely to receive
method is to set up a list of disposable email addresses and emails, we perform a quick pilot test. We scan all 7 disposable
monitor the incoming emails. In this paper, we primarily email services, and examine how many addresses under the
focus on user-specified addresses for data collection efficiency. 10,000 most popular usernames contain incoming emails.
We select a list of “popular” usernames which increases our From a one-time scan, we find that 8.74% of the popular
chance to receive incoming emails. In order to increase our usernames contain emails at the moment we checked the inbox.
chance of receiving incoming emails, we select a list of “high As a comparison, we scan a list of random 10,000 usernames
frequency” usernames. Disposable email addresses under such and found that only about 1% of addresses contain emails,
usernames are often used by multiple users at the same time. which confirms our intuition.
In comparison, monitoring randomly-assigned (RA) addresses Time Interval for Crawling. For each disposable email
did not return many incoming emails. For example, in a pilot service, we build a crawler to periodically check the email
test, we monitored 5 RA email services (eyepaste.com, addresses under the top 10,000 usernames. To minimize the
getnada.com, mailto.space, mytemp.email, and impact on the target service, we carefully control the crawling
tempmailaddress.com) for 5 days. We only succeeded in speed and force the crawler to pause for 1 second between two
collecting data from getnada.com and all inboxes in other consecutive requests. In addition, we keep a single crawling
RA services were empty. In total, we scanned 194,054 RA thread for each service. Under this setting, it would take
addresses, and collected 1,431 messages from 1,430 inboxes more than 6 hours to scan all 10K addresses. Considering
(a hit rate of 0.74%). The reason for the low hit rate is that that certain disposable email services would frequently dispose
randomly-assigned addresses come from a much larger address incoming emails, our strategy is to have an early timeout.
space than user-specified ones. Accordingly, in this paper, we Suppose a service keeps an email for t hours, we design our
focus on user-specified addresses for data collection. crawler to stop the current scan once we hit the t-hour mark,
Selecting Disposable Email Services. We spent a few and immediately start from the top of the username list. This
days searching online for “disposable email” and “temporary strategy is to make sure we don’t miss incoming emails to
email” to find popular services. This process mimics how the most popular addresses. Since emails are more likely to
normal users would discover disposable email services. By hit the top addresses, this strategy allows us to collect more
examining the top 100 entries of the searching results, we find emails with the limited crawling speed.
31 disposable email services (19 UA and 12 RA services1 ). To set up the early-timeout, we need to measure the email
UA services are typically more popular than RA services. For deletion time for each service. We perform a simple experi-
example, the top 5 sites have 4 UA services and 1 RA service. ment: for each service, we first generate 25 random MD5 hash
strings as usernames. This is to make sure these addresses are
1 Two of the RA services have adopted CAPTCHAs for their sites. not accidentally accessed by other users during the experiment.
TABLE I: The expiration time of disposable emails. We show TABLE II: Statistics of the collected datasets.
the expiration time claimed on the website and the actual Dispos. Uniq. Sender Msgs w/
expiration time obtained through measurements. Website # Emails
Address Address (Domain) Sender Address
guerrillamail 1,098,875 1,138 410,457 (190,585) 1,091,230 (99%)
Website Claimed Time Actual Time (Min., Avg., Max.) mailinator 657,634 10,000 27,740 (16,342) 55,611 (8%)
guerrillamail.com “1 hour” 1, 1, 1 (hour) temp-mail 198,041 5,758 1,748 (1,425) 13,846 (7%)
mailinator.com “a few hours” 10.5, 12.6, 16.5 (hours) maildrop 150,641 9,992 786 (613) 3,950 (3%)
temp-mail.org “25 mins” 3, 3, 3 (hours) mailnesia 106,850 9,983 1,738 (686) 4,957 (5%)
maildrop.cc “Dynamic” 1, 1, 1 (day) mailfall 75,179 9,731 3,130 (288) 75,164 (100%)
mailnesia.com “Dynamic” 12.6, 12.8, 13.1 (days) mailsac 45,324 9,987 11,469 (8,019) 45,315 (100%)
Total 2,332,544 56,589 452,220 (210,373) 1,290,073 (55%)
mailfall.com “25 mins” 30, 30, 30 (days)
mailsac.com “Dynamic” 19.9, 20.3, 20.7 (days)
Biases of the Dataset. This dataset provides a rare
Then, we send 25 emails in 5 batches (12 hours apart). In opportunity to study disposable email services and email
the meantime, we have a script that constantly monitors each tracking. However, given the data collection method, the
inbox to record the message deletion time. In this way, we dataset inevitably suffers from biases. We want to clarify these
obtain 25 measurements for each disposable email service. biases upfront to provide a more accurate interpretation of the
As shown in Table I, disposable email services often analysis results later. First, our dataset only covers the user-
don’t delete emails as quickly as promised. For example, specified addresses but not the randomly-assigned addresses.
mailfall.com claimed to delete emails in 25 minutes Second, our data collection is complete with respect to the
but in actuality, held all the emails for 30 days. Similarly popular email addresses we monitored, but is incomplete with
temp-mail.org claimed to delete emails in 25 minutes but respect to all the available addresses. As such, any “volume”
kept the emails for 3 hours. This could be an implementation metrics can only serve as a lower bound. Third, we don’t claim
error of the developers or a false advertisement by the service. the email dataset is a representative sample of a “personal
Many of the services claim that the expiration time is not fixed inbox”. Intuitively, users (in theory) would use disposable
(which depends on their available storage and email volume). email addresses differently relative to their personal email
Based on Table I, we only need to apply the early-timeout for addresses. Instead, we argue the unique value of this dataset
temp-mail and guerrillamail to discard lower-ranked is that it covers a wide range of online services that act as the
usernames, using a timeout of 1 hour and 3 hours respectively. email senders. The data allows us to empirically study email
tracking from the perspective of online services (instead of
B. Disposable Email Dataset
the perspective of email users). It has been extremely difficult
We applied the crawler to 7 disposable email services from (both technically and ethically) for researchers to access and
October 16, 2017 to January 16, 2018 for three months. In analyze the email messages in users’ personal inboxes. Our
total, we collected 2,332,544 email messages sent to mon- dataset, obtained from public email gateways, allows us to
itored email addresses. Our crawler is implemented using take a first step measuring the email tracking ecosystem.
Selenium [7] to control a headless browser to retrieve email
content. The detailed statistics are summarized in Table II. C. Ethical Considerations and IRB
For 5 of the disposable email services, we can cover all 10K We are aware of the sensitivity of the dataset and have
addresses and almost all of them have received at least one taken active steps to ensure research ethics: (1) We worked
email. For the other 2 email services with very a short expi- closely with IRB to design the study. Our study was reviewed
ration time (temp-mail and guerrillamail), we focus by IRB and received an exemption. (2) Our data collection
on an abbreviated version of the popular usernames list. The methodology is designed following a prior research study on
number of emails per account has a highly skewed distribution. disposable SMS services [41]. Like previous researchers, we
About 48% of disposable email addresses received only one carefully have controlled the crawling rate to minimize the
email, and 5% of popular addresses received more than 100 impact on the respective services. For example, we enforce a
emails each. 1-second break between queries and explicitly use a single-
Each email message is characterized by an email title, email thread crawler for each service. (3) All the messages sent
body, receiver address (disposable email address), and sender to the gateways are publicly available to any Internet users.
address. As shown in Table II, not all emails contain all the Users are typically informed that other users can also view the
fields. 4 of the 7 disposable email services do not always emails sent to these addresses. (4) We have spent extensive
keep the sender email addresses. Sometimes the disposable efforts on detecting and removing PII and personal emails
email services would intentionally or accidentally drop sender from our dataset (details in §IV-A). (5) After data collection,
addresses. In addition, spam messages often omit the sender we made extra efforts to reach out to users and offer users
address in the first place. In total, there are 1,290,073 emails the opportunity to opt out. More specifically, we send out
(55%) containing a sender address (with a total of 452,220 an email to each of the disposable email addresses in our
unique sender addresses). These sender addresses correspond dataset, to inform users of our research activity. We explained
to 210,373 unique sender domain names. From the email body, the purpose of our research and offered the opportunity for
we extracted 13,396,757 URLs (1,031,580 unique URLs after users to withdraw their data. So far, we did not receive
removing URL parameters). any data withdraw request. (6) Throughout our analysis, we
did not attempt to analyze or access any individual accounts TABLE III: PII detection accuracy based on ground-truth, and
registered under the disposable email addresses. We also did the number of detected PII instances in our dataset.
not attempt to click on any URLs in the email body (except PII Ground-truth Evaluation # Detected in
the automatically loaded tracking pixels). (7) The dataset is Type #Email #Inst. F1 Precis. Recall Our Data
Credit 16 25 1.00 1.00 1.00 1,399
stored on a local server with strict access control. We keep SSN 13 15 1.00 1.00 1.00 926
the dataset strictly to ourselves. EIN 16 29 1.00 1.00 1.00 701
Overall, we believe the analysis results will benefit the Phone 20 50 0.99 0.98 1.00 726,138
VIN 15 19 0.97 1.00 0.95 43,438
community with a deeper understanding of disposable email
services and email tracking, and inform better security prac-
tices. We hope the results can also raise the awareness of the results indicate that people indeed use the disposable email
risks of sending sensitive information over public channels. services to communicate sensitive information.
IV. A NALYZING D ISPOSABLE E MAILS Removing Personal Emails. We further remove potentially
personal emails including replied emails and forwarded emails.
In this section, we analyze the collected data to understand We filter these emails based on “Re: ” and “Fwd: ” in the
how disposable email services are used in practice. Before email titles. Although this step may not be complete, it helps
our analysis, we first detect and remove PII and the potential to delete email conversations initiated by the users. In total, we
personal emails from the dataset. Then we classify emails into filter out 30,955 such emails (1.33%). This again shows use
different types and infer their use cases. More specifically, we of disposable email addresses for personal communications.
want to understand what types of online services with which
users would register. Further, we seek to understand how likely
B. Categorizing Disposable Emails
it is for disposable email services to be used in sensitive tasks
such as password resets. Next, using the remaining data, we infer the common
use cases of disposable email services by classifying email
A. Removing PII and Personal Emails messages. First, we manually analyze a sample of emails
to extract the high-level categories of emails (ground-truth
Removing PII. Since email messages sent to these gate-
dataset). Second, we build a machine learning classifier and
ways are public, we suspect careless users may accidentally re-
use it to classify the unlabeled emails. Third, we analyze the
veal their PII. Thus, we apply well-established methods to de-
classification results to examine common usage cases.
tect and remove the sensitive PII from the email content [49].
Removing PII upfront allows us to analyze the dataset (includ- Manual Analysis and Email Clustering. To assist the
ing manual examination) without worrying about accidentally manual analysis, we first cluster similar email messages to-
browsing sensitive user information. Here, we briefly introduce gether. For efficiency considerations, we only consider the
the high-level methodology and refer interested readers to [49] subject (or title) of the email message for the clustering.
for details. The idea is to build a list of regular expressions Since we don’t know the number of clusters in the dataset,
for different PII. We first compile a ground-truth dataset to we exclude clustering methods that require pre-defining the
derive regular expressions and rules. Like [49], we also use number of clusters (e.g., K-means). Instead, we use ISODATA
the public Enron Email Dataset [8] which contains 500K algorithm [13] which groups data points based on a cut-
emails. We focused on the most sensitive PIIs and labeled a off threshold of the similarity metric. We use Jaccard index
small ground-truth set for credit card numbers, social security to measure the keyword similarity of two email subjects.
numbers (SSN), employer identification numbers (EIN), phone Given two email subjects, we extract all their keywords into
numbers, and vehicle identification numbers (VIN) as shown two sets wi and wj . Then we calculate their similarity as
|w ∩w |
in Table III. Then we build regular expressions for each PII sim(i, j) = |wii ∪wjj | .
type. For credit card numbers, we check the prefix for popular We set the cut-off threshold as 0.2 to loosely group similar
credit card issuers such as VISA, Mastercard, Discover and email titles together. In total, we obtain 91,306 clusters, most
American Express, and we also use Luhn algorithm [32] to of which are small with less than 100 emails (98%). The
check the validity of a credit card number. As shown in cluster size distribution is highly skewed. The top 500 clusters
Table III, the regular expressions have good precision and cover 56.7% of the total email messages. A few large clusters
recall. (with over 1000 emails) typically represent spam campaigns.
We applied the regular expressions to our dataset and To make sure 0.2 is a reasonable threshold, we have tried even
detected a large number of PIIs including 1,399 credit card smaller thresholds to merge some of the clusters. For example,
numbers, 926 SSNs, 701 EINs, and 40K VINs and 700K if we set the threshold to 0.1 and 0.01, we get 26,967 and
phone numbers. All the detected PII are automatically blacked- 19,617 clusters respectively. However, manual examination
out by the scripts. Note that the 700K phone numbers are shows that the emails in the same cluster no longer represent
not necessarily users’ personal phone numbers, but can be a meaningful group. We stick to 0.2 as the threshold. By
phone numbers of the email sending services. We take a manually examining 500+ clusters (prioritizing larger ones),
conservative approach to blackout all the potential PII. The we summarize 4 major types of emails.
• Account Registration: emails to confirm account regis- Spam 1,612,361 (94.75%)
tration in online services.
Registration 61,812 (3.63%)
• Password Reset: emails that instruct the user to reset
Password Reset 14,715 (0.86%)
passwords for an online account.
• Authentication: emails that contain a one-time authenti- Authentication 12,802 (0.75%)

cation code for login.


• Spam: all other unsolicited emails including newsletters,
advertisements, notifications from online services, and
phishing emails. Fig. 2: Email classification results.

Email Classification. We need to further develop an email


classifier because the clusters do not map well to each of the email addresses. Intuitively, after the users obtain the dispos-
email categories. For example, a cluster may contain both able email addresses, they will use the email addresses for
spam emails and emails that are used to confirm account certain online tasks (e.g., registering accounts), which will ex-
registration. Below, we build a machine learning classifier to pose the addresses and attract incoming emails. By analyzing
classify emails into the four categories. these incoming emails, we can infer at which services the user
For classifier training, we manually labeled a ground-truth registered the accounts, and what the accounts are used for.
dataset of 5,362 emails which contains 346 account regis-
Types of Emails. As shown in Figure 2, while spam
tration emails, 303 password reset emails, 349 authentication
emails take the majority, there is a non-trivial number of emails
emails and 4,364 spam emails. Note that we have labeled more
that are related to account management in various online ser-
spam emails than other categories because our manual exam-
vices. In total, there are 89,329 emails involved with account
ination suggests that there are significantly more spam emails
registration, password resets or sending authentication codes.
in the dataset. For each email, we combine the text in the email
These emails are sent from 168,848 unique web domains.
title and the email body, and apply RAKE (Rapid Automatic
We refer these 3 types of emails as account management
Keyword Extraction) [44] to extract a list of keywords. RAKE
emails. Account management emails are indicators of previous
is a domain independent keyword extraction algorithm based
interactions between the user and the email sending domain.
on the frequency of word appearance and its co-occurrence
They are explicit evidence that users have used the disposable
with other words. In this way, less distinguishing words such
email addresses to register accounts in the web services.
as stopwords are automatically ignored. We use extracted
keywords as features to build a multi-class SVM classifier. Breakdown of Spam Emails. The spam emails take a large
We have tested other algorithms such as Decision Tree and portion of our dataset (1,612,361 emails, 94%), which deserve
Random Forests. However, the SVM performed the best. We a more detailed break-down. Some of the spam messages also
also tested word2vector [35] to build the feature vector, and indicate previous interactions between a user and the email
its results are not as good as RAKE (omitted for brevity). sender. For example, if a user has registered an account or
Through 5-fold cross-validation, we obtain a precision of RSS at an online service (e.g. Facebook), this service may
97.23% and a recall of 95.46%. This is already highly accurate periodically send “social media updates”, “promotions”, or
for a multi-class classifier — as a baseline, a random classi- notifications to the disposable email address. We call them
fication over 4 classes would return an accuracy of 25%. We notification spam. Such notification messages almost always
manually checked some of the classification errors, and found include an unsubscribe link at the bottom of the email to allow
that a few account registration and authentication emails are users to opt out. As such, we use this feature to scan the spam
labeled as spam due to “spammy” keywords (e.g., “purchase”). messages and find 749,602 notification messages (counting for
Note that two types of emails are not applicable here. 46.5% of the spam messages).
First, 58,291 (2.50%) of the emails do not have any text The rest of unsolicited spam messages may come from
content. Second, 535,792 (22.97%) emails are not written malicious parties, representing malware or phishing cam-
in English. Since our classifier cannot analyze the text of paigns. To identify the malicious ones, we extract all the
these emails, they are not part of the classification results in clickable URLs from the email content, and run them against
Figure 2 (we still consider these emails in the later analysis the VirusTotal blacklists (which contains over 60 blacklists
of email tracking). To make sure our classification results maintained by different security vendors [41], [11]), and the
are trustworthy, we randomly sampled 120 emails (30 per eCrimeX blacklist (a phishing blacklist maintained by the Anti
category) to examine manually. We only find 5 misclassified Phishing Work Group). In total, we identify 84,574 malicious
emails (4% error rate), which shows that the ground-truth spam emails (5.2%) that contain at least one blacklisted URL.
accuracy transfers well onto the whole dataset. Finally, we apply the same ISODATA clustering algorithm
to the rest of the spam emails (which count for 48.3%) to
C. Inferring Usage Cases identify spam campaigns. We find 19,314 clusters and the top
Next, we examine disposable email service usage. Recall 500 clusters account for 75.6% of the spam emails. Manual
that our dataset contains emails received by the disposable examination shows that the top clusters indeed represent
TABLE IV: Top 5 sender domains of registration emails, password reset emails and authentication emails.
Rk. Registration Emails Password Reset Emails Authentication Emails
sender domain # msg category sender domain # msg category sender domain # msg category
1 facebookmail.com 2,076 Social Net facebookmail.com 931 Social Net frys.com 987 Shopping
2 gmail.com 1,015 Webmail twitter.com 508 Social Net paypal.com 622 Business
3 aol.com 928 Search miniclip.com 415 Games ssl.com 418 IT
4 avendata.com 733 Business retailio.in 223 Business id.com 163 Business
5 axway.com 720 Education gmail.com 145 Webmail facebookmail.com 161 Social Net

TABLE V: Top 10 categories of the email sender domains for of emails. twitter and miniclip (for gaming) also fall
spam and account management emails. into the same category. It is possible that some accounts are
fake accounts registered by spammers [58]. Since we decided
Rk. Account Management Email Spam Email
Category # Msg (domain) Category # Msg (domain) not to back-track (or login into) any individual user’s account
1 Business 12,699 (2,079) Business 251,822 (31,433) for ethical considerations, we cannot systematically differen-
2 IT 6,759 (1,228) Marketing 145,538 (1,855) tiate them. Previous research on anonymous community (e.g.,
3 Software 5,481 (571) IT 108,933 (6,091)
4 Social Net 5,362 (149) Shopping 104,361 (5,361) 4chan, Reddit) show that users prefer anonymized identifiers
5 Marketing 5,320 (430) Social Net 102,342 (1,223) when posting sensitive or controversial content [54], [33]. We
6 Shopping 3,307 (370) Education 73,038 (6,218) suspect normal users may use the disposable email address
7 Education 2,946 (673) Software 44,560 (3,217)
8 Search 2,154 (74) Travel 39,211 (3,444) to create such social media accounts for similar purposes.
9 Finance 2,017 (302) News 38,567 (1,533) PayPal accounts have additional risks. If a user accidentally
10 Webmail 1,575 (46) Adult 30,777 (1,344) binds a real credit card to the account, it means any other users
may take over the PayPal account by resetting the password.
large spam campaigns, most of which are pornography and Another common use case is to obtain free goods. For
pharmaceutical spam. example, users often need to register an email address to
obtain demos or documents from software solutions and
Categories of Email Senders. To understand what types
educational services, e.g., axway.com, avendata.com,
of online services users interact with, we further examine the
retailio.in, and ssl.com. Users can also obtain a
“categories” of email sender domains. The “categories” are
discount code from shopping services (e.g., frys.com). An-
provided by VirusTotal. Table V shows the top 10 categories
other common case (not in the top-5) is to use the disposable
for spam emails and account management emails. We have
email address to register for free WiFi in airports and hotels.
two main observations.
Finally, we observe cases (not in the top 5) where users try
First, the emails are sent from a very broad range of domain
to preserve anonymity: For example, people used disposable
categories. This suggests that users have used the disposable
email addresses to file anonymous complaints to the United
email addresses to register accounts in all different types of
States Senate (86 emails).
websites. There are in total 121 different categories, and the
top-10 categories only cover 51.01% of account management Note that gmail.com is special: it turns out that many
emails and 58.25% of spam emails, which confirms the high small businesses cannot afford their own email domains and
diversity of usage. Second, we observe that disposable email directly use Gmail (e.g., pizza@gmail.com). Thus, The
addresses are often used to register potentially sensitive ac- domain gmail.com does not represent Gmail, but is a col-
counts. Categories such as “online social networks”, “finance”, lection of small businesses. aol.com has a similar situation.
“shopping” have made the top-10 for account management Case Studies: Risky Usage. We observe other cases that
emails. This could introduce risks if a user accidentally may involve risks. These cases may be not as common as
left PII or credit card information in the registered account. those shown in Table IV, but if their accounts are hijacked
Accounts registered under disposable email addresses are (through the public disposable email addresses), the real-world
easily hijackable. Any other users can take over the registered consequences are more serious. For example, there are 4,000+
accounts by sending a password-reset link to the disposable emails from healthcare.gov, the website of the Affordable
email address, which will be publicly accessible. Given the Care Act. It is likely that people have used disposable email
14,000+ password-reset emails in our dataset, it is possible addresses to register their healthcare accounts where each
that malicious parties are already performing hijacking. account carries sensitive information about the user.
Case Studies: Common Usage. Next, we use specific Similarly, there are emails from mypersmail.af.mil
examples to illustrate the common usage cases. Table IV (Air Force Service Center), suggesting that people have used
lists the top 5 email sending domains for registration, pass- disposable email address to register Air Force personnel ac-
word reset and authentication emails. We show users use counts. The registration is open to civilian employees who
disposable email addresses to register accounts in gaming must use their SSN and date of birth for the registration [1].
and social network services in order to enjoy the online ser- A password reset option is also available on the website.
vices without giving away real email addresses. For example, In addition, more than 32,990 emails are used to receive
facebookmail.com appears in the top-5 of all three types scanned documents from PDF scanning apps (e.g., Tiny Scan-
ner). It is possible for an attacker to obtain all the scanned B. Tracking Detection Method
documents by hijacking these disposable email addresses. Given an email, we design a method to determine if
Finally, there are over 1000 emails from digital currency or the email contains tracking pixels. First, we survey popular
digital wallet services such as buyabitcoin.com.au and email tracking services (selected through Google searching) to
thebillioncoin.info. While most emails are related examine how they implement the tracking pixels. After analyz-
to account registrations, some are related to bitcoin purchase ing Yesware, Contact Monkey, Mailtrack, Bananatag, Streak,
confirmations (e.g., receipts). If these accounts hold bitcoins, MailTracker, The Top Inbox, and Hub Spot, we observe two
anyone has a chance to steal them. common characteristics. First, all 8 services embed small or
transparent HTML image tags that are not visible to users (to
D. Summary remain stealthy). Second, the image URLs often contain some
We show that disposable email services are primarily used form of user identifiers (either the receiver’s email address
to register online accounts. While most of the incoming emails or IDs created by the tracking services). This is because the
are spam and notifications (94%), we did find a large number tracker wants to know “who” opened the email. Next, we
of emails (89,000+) that are related to account registration, design a detection method based on these observations.
password reset, and login authentication. There is a strong Steps to Detect Pixel Tracking. Given an email, we first
evidence that users use disposable email services for sensitive extract all the HTML image tags and corresponding URLs.
tasks. We find 1000+ credit card numbers and 926 SSNs Here, we focus on tracking URLs that notify the tracker about
accidentally revealed in the emails and 30K replied and the user identity. We filter out links that do not contain any
forwarded emails that indicate a personal usage. More im- parameters2 . Then for each image URL, we follow the four
portantly, accounts registered with disposable email addresses steps below to detect email tracking.
can be easily hijacked through a password reset. • Step 1: Plaintext Tracking Pixel: if the link’s parameters
contain the receiver’s email address in plaintext, then the
V. E MAIL T RACKING M EASUREMENTS
image is a tracking pixel.
Next, we use the large-scale email dataset to analyze email • Step 2: Obfuscated Tracking Pixel: if the link’s param-
tracking in the wild. We seek to answer three key questions. eters contain the “obfuscated version” of the receiver’s
First, what types of tracking techniques do trackers use in email address, then the image is a tracking pixel. We
practice, and what is the nature of the data leaked through apply 31 hash/encoding functions on the receiver email
tracking. Second, how prevalent is third-party tracking among address to look for a match (see Appendix). We also
different types of online services? Third, who are the top test two-layer obfuscations by exhaustively applying two-
trackers in the tracking ecosystem and how dominant are they? function combinations, e.g., MD5(SHA1()). In total,
In the following, we first describe the threat model and our we examine 992 obfuscated strings for each address.
method to detect third-party tracking, and then present the We didn’t consider salted obfuscation here due to the
measurement results. extremely high testing complexity.
• Step 3: Invisible HTML Pixel: we check if the image
A. Threat Model is trying to hide based on the HTML height and width
By embedding a small image in the email body, the email attributes. We consider the image as a tracking pixel if
sender or third-parties can know whether the email has been both the height and width are below a threshold t or the
opened by the receiver. When an email is opened, the tracking HTML tag is set to be “hidden” or “invisible”.
• Step 4: Invisible Remote Pixel: trackers may inten-
pixel will be automatically loaded from a remote server via
HTTP/HTTPS (which does not require any user actions). tionally set a large height or width in HTML to avoid
Based on the request, the remote server will know who (based detection. If the HTML height or width is above t, we use
on the email address or other identifiers) opened the email a web crawler to fetch the actual image from the remote
at what location (based on IP) and what time (timestamp) server. If the actual image size is below t, regardless the
using what device (“User-Agent”). The privacy leakage is HTML attributes, we regard it as a tracking pixel.
more serious when the remote server is a third-party. Step-1 and step-2 are adapted from the method described
Email tracking works only if the user’s email client accepts in [17]. We explicitly look for parameters in the image URL
HTML-based email content, which is true for most modern that leak the receiver’s email address. However, it is still
email clients. However, careful users may use ad-blockers possible that trackers use an obfuscation method that is not
to block tracking pixels [17]. In this paper, we make no listed in Table XI (e.g., keyed-hash). More importantly, the
assumption about a user’s email client, and only focus on tracker can use a random string as the identifier and keep
the tracking content in the email body. Note that JavaScript the mapping in the back-end. As such, we introduce step 3
is not relevant to email tracking since JavaScript will not be and step 4 as a complementary way to capture the tracking
automatically executed [4]. Alternatively, email tracking can behavior that cannot be detected by [17].
be done through querying font files. We did not find any font- 2 Image URLs without parameters will still reveal the user’s IP but are not
based tracking in our dataset and omit it from the threat model. necessarily for tracking
TABLE VI: Email tracking detection results. *Tracking party is based on 1.29 million emails that have a sender address.
Tracking Party* Tracking Method
Attributes Total Tracking Stats
1st-party 3rd-party Plaintext Obfuscat. Invis. HTML Invis. remote
# Image URLs 3,887,658 1,222,961 (31.5%) 509,419 179,223 200,682 200,247 548,166 537,266
# Email Messages 2,332,544 573,244 (24.6%) 264,501 149,303 35,702 29,445 473,723 124,900
# Sender Domains 210,373 11,688 (5.5%) 5,403 7,398 1,478 597 9,149 1,802
# Tracker Domains N/A 13,563 5,381 2,302 2,403 984 9,935 2,282

1.2e+06

CDF of Tracking Pixels


1
1e+06 17,328 46,638
0.8
Invisible HTML
Image Count

800000
0.6 (548,166)
600000
0.4
400000 Plaintext Obfuscated
0.2 (200,682) (200,247)
200000

0
0
0 1 2 3 4 5 6 7 8 9 1011121314151617181920 2 200 400 600 800 1000 1200 1418 Invisible Remote
Image Size (max. of height and width) HTML Size of Invisible Remote Pixels 113,933 (537,266) 85,501

Fig. 3: Distribution of the HTML image Fig. 4: The HTML image size of invisi- Fig. 5: # of tracking URLs under differ-
size. ble remote pixels. ent tracking methods.

To set the threshold t for tracking pixels, we plot Figure 3 TABLE VII: Obfuscation methods used in the tracking URLs.
to show the image size distribution in our dataset. Image size 1-layer Obf. Track URLs 2-layer Obf. Track URLs
is defined as the larger value between the height and width. MD5 183,527 (91.7%) Base64 (Urlencode) 765 (0.4%)
Base64 9,876 (4.9%) Urlencode (Base64) 134 (0.1%)
As shown in Figure 3, there is a clear peak where the image SHA1 2,754 (1.4%) Base64 (Base64) 49 (0.0%)
size is 1 (1.1 million images). There are also 60K images of Urlencode 2,094 (1.0%) MD5 (MD5) 29 (0.0%)
a “zero” size. To be conservative, we set the threshold t = 1. Crc32 704 (0.4%) Urlencode (Urlencode) 9 (0.0%)
SHA256 268 (0.1%)
Our method is still not perfect, since we might miss trackers Base16 38 (0.0%)
that use bigger tracking images. The detection result is only a
lower-bound of all possible tracking.
product recommendations using the same template but use
Alternative Tracking Methods. In addition to the methods different “ProductIDs” in the image URLs. This approach
above, we have tested other alternative methods, which did not easily introduces false positives.
return positive results in our pilot test. For completeness, we Third-party Tracking. To differentiate first-party and
briefly discuss them too. First, other than URL parameters, third-party tracking, we match the domain name of the email
trackers use subdomain names to carry the user identifiers. sender and that of the image URL. Since we use domain name
For example, a tracker (e.g., tracker.com) may register to perform the matching, all the “subdomains” belong to the
many subdomains, and use each subdomain to represent a user same party. For example, mail.A.com and image.A.com
(e.g., u1.tracker.com, u2.tracker.com). To look for match with each other since they share the same domain name.
such trackers, we sort the domain names of image URLs If the email sender’s domain name is different from that of
based on their number of subdomains. We only find 3 domain the image tracking URL, we then check their WHOIS record
names (list-manage.com, sendgrid.com and emltrk.com) that to make sure the two domains are not owned by the same
have more than 1000 subdomains. However, we find that they organization. We regard the tracking as a third-party tracking
are not using subdomain names as user identifiers. Instead, if the two domain names belong to different organizations.
each subdomain is assigned to represent a “customer” that
adopted their tracking services. For example, a tracking URL VI. M EASUREMENT R ESULTS
office-artist.us12.list-manage.com is used by We apply our detection method to the 2.3 million emails,
online service office-artist.com to track their users. and the results are summarized in Table VI. In total, we
We have examined all the tracking domains with over 50 extracted 3.9 million unique image URLs and 1.2 million of
subdomains and did not find any subdomain-based tracking. them (31.5%) are identified as tracking links. These tracking
A limitation of step-1 and step-2 is that they cannot cap- links are embedded in 573K emails (24.6%). Out of the 210K
ture trackers that use a random string as the identifier. An email sender domains, we find that 11.6K of them (5.5%)
alternative approach is cluster image URLs that follow the have embedded the tracking pixels in their emails. In total,
same templates. Then the differences in the URLs are likely we identify 13,563 unique tracker domains. In the following,
to be the unique user identifiers. However, our pilot test we first characterize different email tracking techniques and
shows that the majority of the differences in image URLs the “hidden trackers”. Then we focus on third-party tracking
are indeed personalized content, but the personalized content and identify the top trackers. Finally, we analyze how different
is not for tracking. For example, online services often send online services perform tracking.
1 1
Plaintext
0.9 0.9

CDF of Sender Domain


Obfuscated
Percentage of Trackers

1 0.8 0.8
Other Tracking

CDF of Trackers
0.7 0.7
0.8
0.6 0.6
0.5 0.5
0.6
0.4 0.4
0.4 0.3 0.3
0.2 0.2
0.2 0.1 0.1
0 0
0 1 10 20 30 40 50 1 10 100 1000
First-Party Third-Party # of Third-Party Trackers # of Sender Domains

Fig. 6: Different tracking methods of Fig. 7: # of third-party trackers per Fig. 8: # of sender domains associated to
first-party and third-party trackers. sender. each tracker.

A. Email Tracking Techniques TABLE VIII: Top 10 hidden trackers, ranked by the # of
trackers that redirect traffic to them.
As shown in Table VI, there is almost an equal number of
tracking URLs that send plaintext user identifiers (200,682) Rank Hidden Tracker # Direct Trackers # Emails
1 liadm.com 252 29,643
and those that send obfuscated identifiers (200,247). For the 2 scorecardresearch.com 227 27,301
obfuscated tracking, we find 12 obfuscated methods are used 3 eloqua.com 192 3,639
by trackers (out of 992 obfuscations tested). As shown in 4 doubleclick.net 164 96,430
5 rlcdn.com 132 42,745
Table VII, MD5 is applied in the vast majority of these 6 adsrvr.org 130 48,858
tracking URLs (91.7%) followed by Base64 (4.9%). We did 7 pippio.com 59 41,140
find cases where the obfuscation functions are applied more 8 hubspot.com 47 3,995
9 serving-sys.com 41 18,116
than once but these cases are rare (<0.5%). This observation 10 dotomi.com 40 23,526
is consistent with the previous smaller-scale study [17].
There are even more tracking links that use invisible pixels.
We find 548,166 invisible HTML pixels where the HTML hidden trackers are less likely to be blacklisted (by adblockers)
size attributes are 1×1 or smaller or the image tags are since they do not directly appear in the HTML. To capture
set to be “hidden”. Meanwhile, we find 537,266 additional hidden trackers, we crawled all of the 1,222,961 tracking
invisible remote pixels which falsely claim their HTML size URLs. We find that a large number of the tracking URLs
attributes even though the actual image is 1×1. By analyzing have redirections (616,535, 50.4%). In total, we obtain 2,825
the HTML attributes of the invisible remote pixels, we find unique hidden tracker domains. Table VIII shows the top 10
that 20% of them did not specify the size attributes. For the hidden trackers (ranked by the number of the direct trackers
remaining images that specified the size, Figure 4 shows the that redirect traffic to them).
size distribution. These pixels declare much larger image sizes Hidden trackers may also act as direct trackers in certain
in HTML (possibly to avoid detection) while the actual image emails. We find that 2,607 hidden trackers have once appeared
is only 1×1 (invisible to users). to be direct trackers in out dataset. In total, hidden trackers
Figure 5 shows the overlaps of the tracking URLs detected are associated with 112,068 emails and 2260 sender domains
by different methods. We find 17K (8.6%) the plaintext track- (19.3% of sender domains that adopted tracking). Interestingly,
ing URLs are also using invisible HTML pixels; 114K (56.8%) many first-party tracking links also share the user information
plaintext tracking URLs are using invisible remote pixels. This with hidden trackers in real-time. More specifically, there are
suggests that trackers prefer stealthier methods when sending 9,553 emails (220 sender domains) that share user identifiers
plaintext identifiers. For obfuscated tracking URLs, although to a hidden tracker while performing first-party tracking.
the “remote” invisible pixels are still preferred (86K, 42.7%),
the ratio is more balanced compared to the usage of HTML B. Third-party Tracking
pixels (47K, 23.3%). When the parameters are obfuscated, the Next, we focus on third-party tracking and identify the top
trackers are likely to put in less effort towards hiding their trackers. This analysis is only applicable to emails that contain
tracking pixels. a sender address (1.2 million emails).
Hidden Trackers. Through our analysis, we find hidden Overall Statistics. Third-party tracking is highly prevalent.
trackers when we try to fetch the tracking pixels from the As shown in Table VI, there are 149k emails with third-party
remote servers. More specifically, when we request the images, tracking. Interestingly, there are more sender domains with
the request will be first sent to the “direct tracker” (based on third-party tracking (7,398) than those with first-party tracking
the image URL) and then redirected to the “hidden trackers”. (5,403). In total, we identify 2,302 third-party trackers.
The hidden trackers are not directly visible in the email body Figure 6 breaks-down the tracking methods used by first-
and can only be reached through HTTP/HTTPS redirections. and third-party trackers. To make sure different tracking meth-
In this way, user identifiers are not only leaked to the direct ods don’t overlap, we present plaintext tracking and obfuscated
tracker but also to the hidden trackers in real time. Intuitively, tracking, and regard the rest of the invisible pixel tracking as
TABLE IX: Top third-party trackers for each type of tracking method.
Rk. Top Trackers (# Sender Domains / # Email Messages)
plaintext (total: 513 / 4,783) obfuscated (total: 200 / 5,737) invis. HTML (total: 6,106 / 126,286) invis. remote (total: 1,180 / 21,906)
1 mczany.com (66 / 290) alcmpn.com (36 / 2,173) list-manage.com (1,367 / 19,564) hubspot.com (168 / 743)
2 emltrk.com (61 / 956) pippio.com (29 / 2,104) sendgrid.net (849 / 10,416) google-analytics.com (164 / 3,671)
3 socursos.net (28 / 93) rlcdn.com (11 / 246) returnpath.net (333 / 12,628) rs6.net (98 / 629)
4 vishalpublicschool.com (27 / 65) dotomi.com (11 / 218) rs6.net (217 / 2645) doubleclick.net (56 / 2,678)
5 52slots.com (26 / 48) bluekai.com (8 / 201) emltrk.com (197 / 2,362) tradedoubler.com (29 / 98)
6 joyfm.vn (18 / 26) emailstudio.co.in (6 / 17) klaviyomail.com (112 / 2,188) mixpanel.com (29 / 144)
7 jiepop.com (17 / 52) acxiom-online.com (5 / 517) exct.net (103 / 491) salesforce.com (27 / 64)
8 karacaserigrafi.com (16 / 120) lijit.com (5 / 118) exacttarget.com (88 / 2,203) publicidees.com (15 / 84)
9 dfimage.com (15 / 53) sparkpostmail.com (5 / 9) dripemail2.com (86 / 919) gstatic.com (14 / 191)
10 doseofme.com (15 / 32) mmtro.com (4 / 85) adform.net (76 / 550) mfytracker.com (12 / 16)

TABLE X: Top third-party trackers across the full dataset. responsible for 12% of the tracking emails. Although top
“ ” means the tracker is also a hidden tracker. “ ” means the trackers are taking a big share of the market, they are not
tracker is not a hidden tracker. as dominating as the top tracker (i.e. Google) in web tracking.
Rk. Top Trackers Type # Senders # Emails For example, previous measurements show that Google can
1 list-manage.com 1,367 19,564 track users across nearly 80% of the top 1 million sites [31].
2 sendgrid.net 849 10,416
3 returnpath.net 345 12,784
Clearly, in the email tracking market, Google is not yet as
4 rs6.net 292 3,274 dominating as it is in the web tracking.
5 emltrk.com 226 3,328
6 google-analytics.com 225 5,174 C. Tracking by Online Services
7 doubleclick.net 208 12,968 Finally, we analyze different online services and seek to
8 hubspot.com 192 874
9 eloqua.com 150 1,981 understand whether the popularity of online services and the
10 rlcdn.com 133 7,117 service type would correlate to different tracking behaviors.
Subtotal 3,715 (31.8%) 68,914 (12.0%)
Popular vs. non-Popular Online Services. We first
examine how tracking correlates with the popularity of online
“other tracking”. Figure 6 shows that third-party trackers are services. We reference Alexa’s top 1 million domains for the
less likely to collect the user email address as the identifier. ranking [2]. Note that Alexa’s ranking is primarily applied
Figure 7 shows the number of third-party trackers used to the web domain instead of the email domain. Accordingly,
by each sender domain (corresponding to an online service). we check the MX record of Alexa top 1 million domains to
We find that the vast majority (83%) of online services use perform the match. We find that out of the 210,373 sender
a single third-party tracker. About 17% of online services domains, 18,461 domains are within Alexa top 1 million, and
have multiple third-party trackers, sharing user information 2,052 are within the Alexa top 10K. For our analysis, we treat
with multiple-parties at the same time. The extreme case is the Alexa top 10K as the popular domains, and the rest as
amazonses.com which uses 61 third-party trackers. non-popular domains. In total, the small portion of popular
Top Trackers. From the third-party tracker’s perspective, domains (0.98%) contributed 15.9% of the total emails.
Figure 8 shows that only a small number of trackers are Figure 9 shows that tracking is much more prevalent among
used broadly by different online services. To analyze the top popular domains. About 50% of popular domains adopted
trackers, we present Table IX to list top third-party trackers tracking in their emails. As a comparison, less than 10% of
for each tracking method. We rank the trackers based on the non-popular domains have adopted email tracking. Regarding
number of online services that use them. A popular tracker different tracking methods, plaintext tracking and obfuscated
should be used by many online services. For reference, we tracking are not as prevalent as invisible pixel tracking, which
also show the number of emails associated with each tracker. is true for both popular and non-popular domains. Figure 10
We observe that top trackers under different tracking meth- shows that popular domains are slightly more likely to have
ods rarely overlap with each other. This indicates that a first-party tracking than third-party tracking. Figure 11 shows
tracker usually sticks to a specific tracking method. The that popular domains are more likely to use tracking methods
most dominating trackers per category are mczany.com that are harder to detect. More specifically, we focus on two
(plaintext tracking), alcmpn.com (obfuscated tracking), types of stealthy tracking including: invisible remote pixels
list-manage.com (invisible HTML), and hubspot.com (where the HTML tags falsely claim the image size) and
(invisible remote). Noticeably, under the “stealthy” remote hidden trackers (trackers hide behind redirection). We observe
tracking, we also find that google-analytics.com and a big difference: about 12% – 16% of popular domains have
doubleclick.net make the top 10, which are Google’s used stealthy tracking and only 1% of non-popular domains
trackers that have dominated web tracking [48], [9], [29]. use such tracking methods.
Table X shows the top trackers across the full dataset, Type of Online Services. In Figure 12, we focus on the top
including all the hidden trackers. We show that the top 10 10 categories of sender domains and analyze the ratio of them
trackers collectively cover 33.5% of online services, and are that adopted email tracking. Not too surprisingly, “marketing”
Percentage of Sender Domains 1 1 0.3

Percentage of Sender Domains

Percentage of Sender Domains


Plaintext First-Party Tracker Invisible Remote Pixel
Obfuscated Third-Party Tracker Hidden Tracker
0.8 Other Tracking 0.8 No Tracking 0.25
No Tracking
0.2
0.6 0.6
0.15
0.4 0.4
0.1
0.2 0.2 0.05

0 0 0
Popular Non-Popular Popular Non-Popular Popular Non-Popular

Fig. 9: Tracking methods used by pop- Fig. 10: Tracking methods used by pop- Fig. 11: Evasion methods used by pop-
ular (Alexa top 10K) and non-popular ular (Alexa top 10K) and non-popular ular (Alexa top 10k) and non-popular
sender domains. sender domains. sender domains.

First-Party No Tracking to understand user perceptions towards the benefits and risks
Percentage of Sender Domains

Third-Party
of using disposable email services and identify the potential
1 misunderstandings with respect to their security.
0.8
Email Tracking and Countermeasures. The most
0.6
0.4
straightforward way to prevent email tracking is to stop
0.2
rendering emails in HTML (i.e., plaintext email) or block all
0
the outgoing requests that are not initiated by user clicks. The
drawback, however, is a degradation of user experience since
m

sh etin

IT ping

so

ed are

tra atio

he l

bl h

bu

pa ss
og
ar

op g

ftw

uc

al

si

rk
ve n
k

ne

the images in the email (if they are not embedded) cannot
t

ed

be displayed. To address this problem, Gmail has a special


Fig. 12: Type of tracking used by different sender domains. design where the Gmail server fetches all the images on behalf
of the users. In this way, the tracker cannot collect users’ IP
addresses. However, the tracker can still obtain the following
services have the highest ratio of tracking. In fact, many
information: (1) the user indeed opens the email; (2) the time
marketing services themselves are email tracking services
of email opening; and (3) the user’s identifier (if the identifier
(first-party tracking). Popular tracking domains also include
is a parameter of the tracking URL).
shopping websites and information technology websites.
A more promising way is to perform targeted HTML
VII. D ISCUSSION filtering [17] to remove tracking related image tags. Since
most of tracking pixels are invisible, removing them would not
Risk Mitigation for Disposable Email Addresses. Our hurt the user experience. This is very similar to ad-blocking
study reveals risky use cases of disposable email services. The where the ad-blocker construct filtering rules to detect and
root source of risk is the public nature of the disposable email remove ads on websites. In addition to static HTML analysis,
inboxes. Randomly-assigned addresses cannot fully mitigate we believe dynamic analysis is necessary since (1) trackers
this problem since multiple users can still access the same may falsely claim the HTML size attributes, and (2) the real
address at the same time (see §III-A). One possible counter- trackers may hide behind the redirection.
measure is to implement sandbox using cookies. For example,
if a current user is using the inbox, then other users who do not Email Tracking Notification. For the sake of transparency,
possess the same cookie cannot access the same inbox. The it is necessary to inform users when tracking is detected.
inbox will become available again once the current user closes Today, many websites are required (e.g., by EU Privacy
her session. If the disposable email service does not implement Directive) to display a notice to inform users when cookies
sandbox, we believe it is necessary for the service to clearly are used for web tracking. More recently, EU’s new GDPR
inform users about the public nature of the inbox. In addition, policy forbids online services from tracking users with emails
it is also important for the service to clearly communicate without unambiguous consent. However, there is no such
the email expiration time to users. Our results show that two privacy policy in the U.S.. While legislation may take a long
disposable email services host the emails much longer than time, a more immediate solution is to rely on email services
what they promised (e.g., 30 days of delay). or email clients to notify users.
Users of disposable email services should proactively delete A Comparison with Previous Research. The most re-
their emails whenever possible. More importantly, users should lated work to ours is a recent study that analyzed emails
avoid revealing their PII in both the temporary inbox and in the tracking of 902 websites (12,618 emails) [17]. In this work,
accounts they registered through the disposable email address. we collect a dataset that is larger by orders of magnitude.
Due to the public nature of the disposable email services, Some of our results confirm the observations of the small-
accounts registered with disposable email addresses can be scale study. For example, we show that obfuscation is widely
easily hijacked through a password reset. A future direction is used to encode user identifiers for tracking and MD5 is the
most commonly used method, both of which are consistent names (instead of URL parameters) to identify individual
with [17]. Interestingly, Some of our results are different, in users, or use font links (instead of image links). However,
particular, the top third-party trackers (Table IX). For example, we did not find such cases in our dataset. In addition, our
doubleclick.net, which was ranked 1st by [17], is only current method cannot detect tracking URLs that use both
ranked 7th based on unique sender domains (ranked 2nd based large tracking images and random strings as user identifiers.
on email volume) in our dataset. list-manage.com was
IX. R ELATED W ORK
ranked 10th by [17] but came to the top in our analysis. There
are a couple reasons that may contribute to the differences. Web Tracking and Email Tracking. Web tracking has
First, the previous work collected a small email dataset from been extensively studied by researchers in the past decade [15].
902 sender domains, while we collected emails from 210,000+ Researchers have analyzed third-party web tracking across
sender domains. Second, the previous study collected data different websites [29] and countries [23]. Consistently, dif-
from “Shopping” and “News” categories, while our dataset ferent studies have shown that Google is the top tracker
covers more than 100 website categories. Third, previous work on the web [34], [43] where 80% of Alexa top 1 million
only considered tracking URLs that contain an explicit user websites have Google-owned trackers [31]. Web tracking has
identifier (i.e., email address), while we cover more tracking turned into a cat-and-mouse game. Researchers have studies
methods (e.g., invisible or remote pixels). various tracking techniques such as flash cookies [46], [12],
canvas fingerprinting, evercookies, and cookie syncing [9],
VIII. L IMITATIONS [18]. While adblockers help to reduce tracking, anti-adblockers
The first limitation is that our analysis only covers dis- are also increasingly sophisticated [59], [24], [36], [39].
posable email services with user-specified addresses (UA). Disposable Accounts and Phone Verified Accounts. Pre-
This is mainly due to the difficulty to obtain data from vious work has studied disposable SMS services where public
randomly-assigned addresses (RA). Here, we use the small phone numbers are offered to users for a temporary usage [41].
dataset collected from RA services (§III-A) to provide some Researchers also studied the security risks of man-in-the-
contexts. Recall the dataset contains 1,431 messages from 5 middle attack [20], and use the collected messages to investi-
RA services. After removing personal and non-English emails, gate SMS spam [25], [37]. A recent work shows that “retired”
we apply our classifier to the rest 1142 emails. We find that addresses from popular email services can be re-registered
randomly-assigned addresses also contain account manage- to hijack existing accounts [21]. Other researchers looked in
ment emails, including 134 registration emails (11.7%), 44 how disposable SMS are used to create phone-verified fake
password reset emails (3.9%), and 32 authentication emails accounts in online services [50].
(2.8%). We also notice that the spam email ratio is lower in RA
services (81.6%) than that of UA services (94%). Intuitively, PII Leakage and Email Hijacking. Previous works have
spammers often blindly send spam emails to addresses with examined PII leakage under various channels [26], [27] such as
popular usernames. mobile network traffic [42], [53], website contact forms [47],
The second limitation is that our dataset is not representative and cross-device tracking [14]. Our work differs from previous
with respect to a normal user inbox. Our measurement results works with a focus on PII leakage during email tracking.
cannot be used to assess email tracking at a per-user level. X. C ONCLUSION
Instead, the main advantage of the dataset is that it contains In this paper, we perform a first measurement study on
emails sent by a large number of online services (including the disposable email services. We collect a large dataset from
top-ranked websites). This allows us to analyze email tracking 7 popular disposable email services (2.3 million emails sent
from the perspective of online services (200K domains across by 210K domains), and provide new understandings of what
121 categories). For future work, we can evaluate the user- disposable email services are used for and the potential risks
level tracking through user studies. of usage. In addition, we use the collected email dataset
Third, for ethical considerations, we decided not to man- to empirically analyze email tracking activities. Our results
ually analyze the PII or back-track the accounts registered provide new insights into the prevalence of tracking at different
with the disposable addresses. This has limited our ability to online services and the evasive tracking methods used of
answer some of the questions. For example, in §IV-A, we did trackers. The results are valuable for developing more effective
not manually confirm the validity of detected PII, assuming anti-tracking tools for email systems.
the training accuracy transfers well to the testing. In §IV-C,
it is possible that spammers would use the email addresses ACKNOWLEDGMENT
to register fake accounts in online services, but we cannot We would like to thank our shepherd Manos Antonakakis
confirm. Similarly, for the password reset emails, it is possible and the anonymous reviewers for their helpful feedback. This
that the emails were triggered by malicious parties who were project was supported in part by NSF grants CNS-1750101 and
trying to login other people’s accounts, or by the real owners CNS-1717028, and Google Research. Any opinions, findings,
of the accounts who forgot the password. and conclusions or recommendations expressed in this material
Fourth, our email tracking detection is still incomplete. are those of the authors and do not necessarily reflect the views
Theoretically, it is possible for a tracker to use subdomain of any funding agencies.
R EFERENCES [32] L UHN , H. Computer for verifying numbers, 1960. Patent No. 2,950,048.
[33] M A , X., H ANCOCK , J., AND NAAMAN , M. Anonymity, intimacy and
[1] Air force mypers. http://www.afpc.af.mil/Support/myPers/. self-disclosure in social media. In Proc. of CHI’16 (2016).
[2] Alexa top 1 million websites. https://www.alexa.com/topsites. [34] M AYER , J. R., AND M ITCHELL , J. C. Third-party web tracking: Policy
[3] Guerrillamail. https://www.guerrillamail.com/. and technology. In Proc. of IEEE S&P’12 (2012).
[4] Mailchimp. https://mailchimp.com/help/limitations-of-html-email/. [35] M IKOLOV, T., S UTSKEVER , I., C HEN , K., C ORRADO , G. S., AND
[5] Maildrop.cc privacy policy. https://maildrop.cc/privacy. D EAN , J. Distributed representations of words and phrases and their
[6] Mailinator privacy policy. https://www.mailinator.com/faq.jsp. compositionality. In Proc. of NIPS’13 (2013).
[7] Selenium. http://www.seleniumhq.org/. [36] M UGHEES , M. H., Q IAN , Z., AND S HAFIQ , Z. Detecting anti ad-
[8] Enron email dataset. https://www.cs.cmu.edu/∼enron/, May 2015. blockers in the wild. In Proc. of PETs’17 (2017).
[9] ACAR , G., E UBANK , C., E NGLEHARDT, S., J UAREZ , M., [37] M URYNETS , I., AND P IQUERAS J OVER , R. Crime scene investigation:
NARAYANAN , A., AND D IAZ , C. The web never forgets: Persistent Sms spam data analysis. In Proc. of IMC’12 (2012).
tracking mechanisms in the wild. In Proc. of CCS’14 (2014). [38] N IKIFORAKIS , N., K APRAVELOS , A., J OOSEN , W., K RUEGEL , C.,
[10] ACAR , G., J UAREZ , M., N IKIFORAKIS , N., D IAZ , C., G ÜRSES , S., P IESSENS , F., AND V IGNA , G. Cookieless monster: Exploring the
P IESSENS , F., AND P RENEEL , B. Fpdetective: dusting the web for ecosystem of web-based device fingerprinting. In Proc. of IEEE S&P’13
fingerprinters. In Proc. of CCS’13 (2013). (2013).
[11] A RSHAD , S AJJAD , K. A. R. W. Include me out: In-browser detection [39] N ITHYAN , R., K HATTAK , S., JAVED , M., VALLINA -RODRIGUEZ , N.,
of malicious third-party content inclusions. In Proc. of Financial FALAHRASTEGAR , M., P OWLES , J. E., C RISTOFARO , E., H ADDADI ,
Cryptography and Data Security’17 (2017). H., AND M URDOCH , S. J. Adblocking and counter blocking: A slice
[12] AYENSON , M. D., WAMBACH , D. J., S OLTANI , A., G OOD , N., AND of the arms race. In CoRR (2016), USENIX.
H OOFNAGLE , C. J. Flash cookies and privacy ii: Now with html5 [40] P ISCITELLO , D. The new face of phishing. APWG, 2018.
and etag respawning. In SSRN (2011). http://dx.doi.org/10.2139/ssrn. [41] R EAVES , B., S CAIFE , N., T IAN , D., B LUE , L., T RAYNOR , P., AND
1898390. B UTLER , K. R. B. Sending out an sms: Characterizing the security of
[13] BALL , G. H., AND H ALL , D. J. Isodata, a novel method of data analysis the sms ecosystem with public gateways. In Proc. of IEEE S&P’16
and pattern classification. Tech. rep., Stanford research inst Menlo Park (2016).
CA, 1965. [42] R EN , J., R AO , A., L INDORFER , M., L EGOUT, A., AND C HOFFNES , D.
[14] B ROOKMAN , J., ROUGE , P., A LVA , A., AND Y EUNG , C. Cross-device Recon: Revealing and controlling pii leaks in mobile network traffic. In
tracking: Measurement and disclosures. Proc. of PETs’17 (2017). Proc. of the MobiSys’16 (2016).
[15] B UDAK , C., G OEL , S., R AO , J., AND Z ERVAS , G. Understanding [43] ROESNER , F., KOHNO , T., AND W ETHERALL , D. Detecting and
emerging threats to online advertising. In Proc. of EC’16 (2016). defending against third-party tracking on the web. In Proc. of NSDI’12
[16] DAS , A., B ONNEAU , J., C AESAR , M., B ORISOV, N., AND WANG , X. (2012).
The tangled web of password reuse. In Proc. of NDSS’14 (2014). [44] ROSE , S., E NGEL , D., C RAMER , N., AND C OWLEY, W. Automatic
[17] E NGLEHARDT, S., H AN , J., AND NARAYANAN , A. I never signed up keyword extraction from individual documents. In Text Mining: Appli-
for this! privacy implications of email tracking. In Proc. of PETS’18 cations and Theory. 2010, pp. 1 – 20.
(2018). [45] S EETHARAMAN , D., AND B INDLEY, K. Facebook controversy: What to
[18] E NGLEHARDT, S., AND NARAYANAN , A. Online tracking: A 1-million- know about cambridge analytica and your data. The Wall Street Journal
site measurement and analysis. In Proc. of CCS’16 (2016). (2018).
[19] F IFIELD , D., AND E GELMAN , S. Fingerprinting web users through [46] S OLTANI , A., C ANTY, S., M AYO , Q., T HOMAS , L., AND H OOFNAGLE ,
font metrics. In Proc. of Financial Cryptography and Data Security’15 C. J. Flash cookies and privacy. In AAAI spring symposium: intelligent
(2015). information privacy management (2010).
[20] G ELERNTER , N., K ALMA , S., M AGNEZI , B., AND P ORCILAN , H. The [47] S TAROV, O., G ILL , P., AND N IKIFORAKIS , N. Are you sure you want
password reset mitm attack. In Proc. of IEEE S&P’17 (2017). to contact us? quantifying the leakage of pii via website contact forms.
[21] G RUSS , D., S CHWARZ , M., W ÜBBELING , M., G UGGI , S., Proc. of PETs’16 (2016).
M ALDERLE , T., M ORE , S., AND L IPP, M. Use-after-freemail: [48] S TAROV, O., AND N IKIFORAKIS , N. Extended tracking powers: Mea-
Generalizing the use-after-free problem and applying it to email suring the privacy diffusion enabled by browser extensions. In Proc. of
services. In Proc. of Asia CCS’18 (2018). WWW’17 (2017).
[22] H OSIE , R. Ashley madison hacking: What happened when married man [49] S ZURDI , J., AND C HRISTIN , N. Email typosquatting. In Proc. of
was exposed? Independent, 2017. IMC’17 (2017).
[23] I ORDANOU , C., S MARAGDAKIS , G., P OESE , I., AND L AOUTARIS , N. [50] T HOMAS , K., I ATSKIV, D., B URSZTEIN , E., P IETRASZEK , T., G RIER ,
Tracing cross border web tracking. In Proc. of IMC’18 (2018). C., AND M C C OY, D. Dialing back abuse on phone verified accounts.
[24] I QBAL , U., S HAFIQ , Z., AND Q IAN , Z. The ad wars: retrospective In Proc. of the CCS’14 (2014).
measurement and analysis of anti-adblock filter lists. In Proc. of the [51] T HOMAS , K., L I , F., Z AND , A., BARRETT, J., R ANIERI , J., I NV-
IMC’17 (2017). ERNIZZI , L., M ARKOV, Y., C OMANESCU , O., E RANTI , V., M OSCICKI ,
[25] J IANG , N., J IN , Y., S KUDLARK , A., AND Z HANG , Z.-L. Greystar: Fast A., M ARGOLIS , D., PAXSON , V., AND B URSZTEIN , E. Data breaches,
and accurate detection of sms spam numbers in large cellular networks phishing, or malware?: Understanding the risks of stolen credentials. In
using gray phone space. In Proc. of USENIX Security’13 (2013). Proc. of CCS’17 (2017).
[26] K RISHNAMURTHY, B., NARYSHKIN , K., AND W ILLS , C. Privacy [52] U R , B., S EGRETI , S. M., BAUER , L., C HRISTIN , N., C RANOR , L. F.,
leakage vs. protection measures: the growing disconnect. In Proc. of KOMANDURI , S., K URILOVA , D., M AZUREK , M. L., M ELICHER , W.,
the Web’11 (2011). AND S HAY, R. Measuring real-world accuracies and biases in modeling
[27] K RISHNAMURTHY, B., AND W ILLS , C. E. On the leakage of personally password guessability. In Proc. of USENIX Security’15 (2015).
identifiable information via online social networks. In Proc. of the ACM [53] VALLINA -RODRIGUEZ , N., K REIBICH , C., A LLMAN , M., AND PAX -
workshop on Online social networks’09 (2009). SON , V. Lumen: Fine-grained visibility and control of mobile traffic in
[28] L APERDRIX , P., RUDAMETKIN , W., AND BAUDRY, B. Beauty and user-space.
the beast: Diverting modern web browsers to build unique browser [54] VAN DER NAGEL , E., AND F RITH , J. Anonymity, pseudonymity, and the
fingerprints. In Proc. of IEEE S&P’16 (2016). agency of online identity: Examining the social practices of r/gonewild.
[29] L ERNER , A., S IMPSON , A. K., KOHNO , T., AND ROESNER , F. Internet First Monday 20, 3 (2015).
jones and the raiders of the lost trackers: An archaeological study of web [55] V ERAS , R., C OLLINS , C., AND T HORPE , J. On semantic patterns of
tracking from 1996 to 2016. In Proc. of USENIX Security’16 (2016). passwords and their security impact. In Proc. of NDSS’14 (2014).
[30] L I , Y., WANG , H., AND S UN , K. A study of personal information [56] WANG , C., JAN , S. T., H U , H., B OSSART, D., AND WANG , G. The
in human-chosen passwords and its security implications. In Proc. of next domino to fall: Empirical analysis of user passwords across online
INFOCOM’16 (2016). services. In Proc. of CODASPY’18 (2018).
[31] L IBERT, T. Exposing the invisible web: An analysis of third-party http [57] WANG , D., Z HANG , Z., WANG , P., YAN , J., AND H UANG , X. Targeted
requests on 1 million websites. International Journal of Communication online password guessing: An underestimated threat. In Proc. of CCS’16
(2015). (2016).
[58] WANG , G., KONOLIGE , T., W ILSON , C., WANG , X., Z HENG , H., AND
Z HAO , B. Y. You are how you click: Clickstream analysis for sybil
detection. In Proc. of USENIX Security’13 (2013).
[59] Z HU , S., H U , X., Q IAN , Z., S HAFIQ , Z., AND Y IN , H. Measuring and
disrupting anti-adblockers using differential execution analysis. In Proc.
of NDSS’18 (2018).

A PPENDIX – O BFUSCATED U SER I DENTIFIER


To detect obfuscated user identifiers (i.e. email addresses) in
the tracking URLs, we have tested 31 different hash/encoding
functions. If the link’s parameters contain the “obfuscated
version” of the receiver’s email address, then the image is
considered as a tracking pixel. As shown in Table XI, we
apply 31 hash/encoding functions on the receiver email address
to look for a match. We also test two-layer obfuscations
by exhaustively applying two-function combinations, e.g.,
MD5(SHA1()). In total, we examine 992 obfuscated strings
for each address.
TABLE XI: Functions to obfuscate user identifiers.
Hash or encoding functions (31 in total)
MD2, MD4, MD5, RIPEMD, SHA1, SHA224, SHA256, SHA384,
SHA512, SHA3 224, SHA3 256, SHA3 384, SHA3 512, blake2b,
blake2s, crc32, adler32, murmurhash 3 32 bit, murmurhash 3 64 bit,
murmurhash 3 128 bit, whirlpool, b16 encoding, b32 encoding,
b64 encoding, b85 encoding, url encoding, gzip, zlib, bz2, yenc, entity

You might also like