Echo Chambers
Echo Chambers
ABSTRACT KEYWORDS
Personalized recommendation benefits users in accessing contents E-commerce; Recommender Systems; Echo Chamber; Filter Bubble
of interests effectively. Current research on recommender systems
mostly focuses on matching users with proper items based on user ACM Reference Format:
Yingqiang Ge, Shuya Zhao, Honglu Zhou, Changhua Pei, Fei Sun, Wenwu
interests. However, significant efforts are missing to understand
Ou, and Yongfeng Zhang. 2020. Understanding Echo Chambers in E-commerce
how the recommendations influence user preferences and behav- Recommender Systems. In Proceedings of the 43rd International ACM SIGIR
iors, e.g., if and how recommendations result in echo chambers. Conference on Research and Development in Information Retrieval (SIGIR ’20),
Extensive efforts have been made in examining the phenomenon July 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA, 10 pages.
in online media and social network systems. Meanwhile, there are https://doi.org/10.1145/3397271.3401431
growing concerns that recommender systems might lead to the
self-reinforcing of user’s interests due to narrowed exposure of
items, which may be the potential cause of echo chamber. In this 1 INTRODUCTION
paper, we aim to analyze the echo chamber phenomenon in Alibaba Recommender systems (RS) comes into play with the rise of on-
Taobao — one of the largest e-commerce platforms in the world. line platforms, e.g., social networking sites, online media, and e-
Echo chamber means the effect of user interests being reinforced commerce[16, 18, 19]. Intelligent algorithms with the ability to offer
through repeated exposure to similar contents. Based on the defini- personalized recommendations are increasingly used to help con-
tion, we examine the presence of echo chamber in two steps. First, sumers seek contents that best match their needs and preferences
we explore whether user interests have been reinforced. Second, in forms of products, news, services, and even friends[1, 49, 50]. De-
we check whether the reinforcement results from the exposure of spite the significant convenience that RS has brought, the outcome
similar contents. Our evaluations are enhanced with robust metrics, of the personalized recommendations, especially how it reforms
including cluster validity and statistical significance. Experiments social mentality and public recognition — which could potentially
are performed on extensive collections of real-world data consisting reconfigure the society, politics, labor, and ethics — remains unclear.
of user clicks, purchases, and browse logs from Alibaba Taobao. Extensive attention has been drawn at this front, thus arriving
Evidence suggests the tendency of echo chamber in user click be- at the two coined terms, echo chamber and filter bubble. Both ef-
haviors, while it is relatively mitigated in user purchase behaviors. fects might occur after the use of personalized recommenders and
Insights from the results guide the refinement of recommendation entail far-reaching implications. Echo chamber describes the ris-
algorithms in real-world e-commerce systems. ing up of social communities who share similar opinions within
the group [41], while filter bubble [36], as the phenomenon of an
CCS CONCEPTS overly narrow set of recommenders, was blamed for isolating users
• Information systems → Recommender systems; Web log in information echo chambers [1].
analysis; Test collections. Owing to the irreversible and striking impact that the internet
has brought on the mass communication, echo chamber and filter
∗ Co-first authors with equal contributions. bubble are appearing in online media and social networking sites,
† This work was done when Yingqiang Ge worked as an intern in Alibaba. such as MovieLens [33], Pandora [1], YouTube [23], Facebook [37],
and Instagram [39]. Significant research efforts have been put for-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ward in examining the two phenomena in online media and social
for profit or commercial advantage and that copies bear this notice and the full citation networks [4, 6, 7, 14, 20, 30]. Recently, researchers have concluded
on the first page. Copyrights for components of this work owned by others than the that the decisions made by RS can influence user beliefs and pref-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission erences, which in turn affect the user feedback, e.g., the behavior
and/or a fee. Request permissions from permissions@acm.org. of click and purchase received by the learning system, and this
SIGIR ’20, July 25–30, 2020, Virtual Event, China kind of user feedback loop might lead to echo chamber and filter
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8016-4/20/07. . . $15.00 bubbles [26]. On the other hand, the two concepts are not isolated,
https://doi.org/10.1145/3397271.3401431 since filter bubble is a potential cause of echo chamber [1, 12].
2261
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
In this work, we are primarily concerned with the existence personalized recommenders would fragment users, making like-
and the characteristics of echo chamber in real-world e-commerce minded users aggregate [41]. The existing views or interests of these
systems. We define echo chamber as the effect of users’ interests users would be reinforced and amplified since “group polarization
being reinforced due to repeated exposure to similar items or cate- often occurs because people are telling one another what they
gories of items, thereby generalizing the definition in [26]. This is know” [41, 42]. Pariser later described filter bubble, as the effect
because users’ consuming preferences are so versatile and diverse of recommenders making users isolated from diverse content, and
that cannot simply be classified into positive or negative directions trapping them in an unchanging environment [36]. Though both
as what it looks like in political opinions [40]. Based on the above are concerned with the malicious effect that recommenders would
definition of echo chamber, we formulate the research in two steps pose, echo chamber emphasizes the polarized environment, while
by answering the following two related research questions: filter bubble lays stress on the undiversified environment.
• RQ1: Does the recommender system, to some extent, rein- Researchers are expressing their concerns of the two effects, and
force user click/purchase interests? attempting to formulate a richer understanding of the potential
• RQ2: If user interests are indeed strengthened, is it caused characteristics [4, 10, 23, 32, 39]. Considering echo chamber as a
by RS narrowing down the scope of items exposed to users? significant threat to modern society as they might lead to polariza-
tion and radicalization [11], Risius et al. analyzed news “likes” on
To measure the effect of recommender systems on users, we first Facebook, and distinguished different types of echo chambers [37].
follow the idea introduced in [33] and separate all users into cate- Mohseni et al. reviewed news feed algorithms as well as methods
gories based on how often they actually “take” the recommended for fake news detection and focused on the unwanted outcomes
items. This separation helps us to compare recommendation fol- of echo chamber and filter bubble after using personalized content
lowers against a controlled group, namely, the recommendation selection algorithms [30]. They argued that personalized newsfeed
ignorers. The remaining problem is how to measure the effect of might cause polarized social media and the spread of fake content.
echo chamber on each group. Users in social network platforms Another genre of research aims to clear up strategies to mitigate
have direct ways to interact with other users, potentially through the potential issues of echo chamber and filter bubble, or design
actions of friending, following, commenting, etc [38]. A similar new recommenders to alleviate such effects [2, 3, 13, 17, 22, 35].
analogy is that users in the recommender system could interact Badami et al. proposed a new recommendation model for com-
with other users indirectly through the recommendations offered bating over-specialization in polarized environments after finding
by the platform, since recommendation lists are usually generated that matrix factorization models are easier to learn in polarized
as a result of considering the user’s previous preferences and the environments, and in turn, encourage filter bubbles that reinforce
preferences of similar users (𝑖.𝑒., collaborative filtering). Due to the polarization. Tintarev el al. attempted to use visual explanations,
absence of an explicit network of user-user interaction, which is i.e., chord diagrams and bar charts, to address the problems [43].
naturally and commonly provided in social networks, we decide to There is a certain amount of work focusing on the detection or
measure echo chamber in e-commerce at the population level. This measuring of echo chamber and filter bubble, questioning whether
is because users who share similar interaction records (𝑒.𝑔., clicking they do exist [1, 4, 15, 27, 31, 33]. For example, Hosanagar et al.
the same products) will be closely located in a latent space, and the used data from an online music service, trying to find out whether
cluster of these users in that space, along with its temporal changes, personalization is, in fact, fragmenting the population, and con-
could serve as signals to detect echo chamber. Finally, we measure cluded that it does not [24]. They claimed personalization is a tool
the content diversity in recommendation lists for each group to see that helps users widen their interests, which in turn creates com-
whether recommender system narrows down the scope of items monality with others. Sasahara et al. suggested echo chambers are
exposed to users, so as to answer RQ2. somewhat inevitable given the mechanisms at play in social media,
The key contributions of our paper can be summarized as follows: specifically, the basic influence and unfriending [38]. Their simu-
• We study echo chamber effect at a population level by im- lation dynamics showed that the social network rapidly devolves
plementing clustering on different user groups, and measure into segregated, homogeneous, and polarized communities, even
the shifts in user interests with cluster validity indexes. with the minimal amount of influence and unfriending.
• We design a set of controlled trials between recommendation Despite the reasonableness of prior works, severe limitations do
followers and ignorers, and employ a wide range of technical exist, making the claims only plausible. One major aspect is that
metrics to measure the echo chamber effect to provide a most of the existing works draw conclusions by means of simula-
broader picture. tion, or relying on some self-defined networks and measurements
• We conduct our experiments based on real-world data from with simplified dynamics [6, 9, 15, 20, 26, 31]. Building upon the
Alibaba Taobao — one of the largest e-commerce platforms subjective assumptions, whether the modeling and analysis have
in the world. Our analytical results, grounded with reliable the capability to reflect the truth seems to be dubious [37]. On the
validity metrics, suggest the tendency of echo chamber in other hand, many of the prior works confound the meaning of the
terms of the user click behaviors, and relatively mitigated two effects or solely examine one of them without considerations of
effect in user purchase behaviors. the other [17, 23, 27, 28, 37]. Exceptions such as [26], disentangles
echo chamber from filter bubble, but suffers from the previous-
2 RELATED WORK mentioned deficiency, i.e., reliance on simulation and simplified
Today’s recommender systems are criticized for bringing dangerous artificial settings. With the desire to address the limitations, we aim
byproducts of echo chamber and filter bubble. Sunstein argued that to explore the existence of echo chamber in real-world e-commerce
2262
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
2263
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
sure that all users have the same amount of interactions with the Click Purchase Browse
recommender system throughout an interaction block.
Following group 5, 025 2, 099 5, 507
We define an interval as a block consisting of 𝑛 consecutive
interactions, where 𝑛 is a constant decided by experimental studies. Ignoring group 2, 452 1, 458 1, 910
Moreover, different interactions occur at different frequencies, for All users 7, 477 3, 557 7, 417
example, the number of browsed items is much higher than the
Table 1: Statistics of each user group.
number of clicked items, and the number of clicked items is again
higher than the purchased items, indicating that the length of the 4 MEASURES FOR ECHO CHAMBERS
block (i.e., 𝑛) may vary based on the corresponding interactions. We
To answer RQ1, we propose to study the reinforcement in user
primarily set the length of the interval as 𝑛 = 200 for browsed items
interests at a population level. The pattern of reinforcement could
(named as a browsing block), 𝑛 = 100 for clicked items (named
happen in a simple scenario, where some members highly support
as a clicking block), and 𝑛 = 10 for purchased items (named as a
one opinion, while others believe in a competing opinion. The
purchasing block). If there are not enough interactions to constitute
phenomenon is reflected as the dense distribution on the two sides
the last block, we will drop it to make sure that all blocks have the
of the opinion axis. However, user’s interest in e-commerce is much
same number of interactions. Meanwhile, to ensure the temporal
more complicated, such that it cannot be simply classified into
effect of RS, we only keep those users who have at least three
positive and negative. What we observe is that users can congregate
intervals in the three months, for each type of user-item interaction.
into multiple groups in terms of distinct preferences. As a result,
Finally, as shown in Table 1, we have 7,477 users to examine the
we implement clustering on user embeddings and measure the
echo chamber effect on click behaviors, 3,557 users for purchase
change in user interests with cluster validity indexes (more details
behaviors, and 7,417 users for browsing behaviors.
can be found in Section 4.2). We measure the changes in terms of
Considering the browse log is potentially noisy (i.e., indifferent
clustering on the embeddings at the beginning and at the end of
to a user, see Section 3.1), as well as the fact that clicking and
the user interaction record, i.e., we compute the user embeddings
purchasing are commonly used to represent users’ implicit feedback
respectively for the first and the last interaction block and measure
to RS, in the following, we use click and purchase behaviors to
the changes. To be clear, we refer to these two blocks as the “first
represent users’ preferences on the items (while we detect the echo
block” and the “last block”.
chamber effect), and examine the temporal changes in content
To answer RQ2, we propose to measure the content diversity
diversity of recommended items via browsing behaviors (which
in recommendation lists at the beginning and the end, and more
may be the potential cause of echo chamber).
details can be found in Section 4.3. We examine whether there exists
a trend that recommendation systems narrow down the contents
3.4 User Embeddings provided to the users. Before we cluster the Following Group and
the Ignoring Group, respectively, we need to know whether the
As the user interests are closely related to the user interactions
two groups are clusterable and what is the appropriate number of
with items, we argue that the items that the user clicked can reflect
clusters for each of the group. Thus, we first examine the clustering
his/her click interests, and the items that the user purchased can
tendency and select the proper clustering settings, which will be
represent his/her purchase interests. However, only using discrete
introduced in Section 4.1.
indexes to denote the interacted items is not sufficient to represent
user interests since we cannot know the collaborative relations 4.1 Measuring Clusters
between different items only based on the indexes. Following the
basic idea of collaborative filtering, we use the user-item interaction 4.1.1 Clustering Tendency.
information to train an embedding model. Assessing clustering tendency is employed to evaluate whether
Items are encoded into item embeddings based on one of the there exist meaningful clusters in the dataset before applying clus-
state-of-the-art models [46]. To cluster and compare the items for tering methods. We use Hopkins statistic (H) [5, 29] to measure
different users at different times, we need to guarantee that the the tendency since it can examine the spatial randomness of the data
item embeddings are stable across the period of time that is under by testing the given dataset with a uniformly random-distributed
investigation. For this purpose, the embeddings are trained on all dataset. The value of H is from 0 to 1. A result close to 1 indicates
of the collected data until May 31, 2019, which is the last day that a highly clustered dataset, while a result around 0.5 indicates that
our dataset contains. This is for two reasons: (1) since the training the data is random.
data contains all of the user-item interactions, it helps to learn more Let 𝑋 ∈ 𝑅 𝐷 be the given dataset of 𝑁 elements, and 𝑌 ∈ 𝑅 𝐷 is
accurate embeddings; and (2) since the training procedure includes the uniformly random dataset of 𝑀 (𝑀 ≪ 𝑁 ) elements with the
all items under consideration, we can guarantee that all embed- same variation as 𝑋 . Then we get a random sample {𝑥 1𝐷 , 𝑥 2𝐷 , . . . , 𝑥 𝑀
𝐷}
dings are learned in the same space. After that, we use the average from X. And 𝑠𝑖𝐷 and 𝑡𝑖𝐷 are the distances from 𝑥𝑖𝐷 and 𝑦𝑖𝐷 to their
pooling on the item embeddings to compute the user embeddings. nearest neighbor in 𝑋 , respectively. Hopkins statistic is computed
Specifically, we use the average of the item embeddings of items as the following:
Í𝑀 𝐷
that the user clicked (or purchased) within a user-item interaction 𝑖=1 𝑡𝑖
𝐻 = Í𝑀 Í𝑀 𝐷 (1)
block to represent the user’s click preferences (or purchase prefer- 𝑖=1 𝑠𝑖 + 𝑖=1 𝑡𝑖
𝐷
ences) during a certain period of time. In this way, user embeddings The results are shown in Table 3. Hopkins statistic examines
and item embeddings are in the same representation space. the datasets before applying further measurement to them. The
2264
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
characteristics of a clusterable dataset (𝐻 > 0.5) can be observed the same group at different times under the same setting, and a
under its optimal setting in K-means clustering. Then we can select higher score indicates a better clustering. Let 𝑁 denote the size
the proper number of clusters for each user group. of the dataset {𝑥 1, . . . , 𝑥 𝑁 }, and 𝐾 denote the number of clusters.
The centroids of clusters are denoted as 𝐶𝑖 , 𝑖 = 1, 2, ..., 𝐾. For a data
4.1.2 Clustering Settings.
point 𝑥 𝑗 , it belongs to a cluster 𝑝 𝑗 and we have the correspond-
We use the Bayesian Information Criterion (BIC) to determine
ing cluster centroid 𝐶𝑝 𝑗 , where 𝑗 = 1, 2, ..., 𝑁 ,𝑝 𝑗 = 1, 2, ..., 𝐾. The
the number of clusters for each group. Due to the high-dimensional
Calinski-Harabasz index is thus calculated as follows:
characteristics of user embeddings, it is hard to choose the optimal
𝑆𝑆𝐵𝐾 (𝑁 − 𝐾)
𝐾 (i.e., the number of clusters) via some common k-selection tech- 𝐶𝐻𝐾 = · (3)
𝑆𝑆𝑊𝐾 (𝐾 − 1)
niques, like elbow method or average silhouette method. To deal
with it, we use model selection technique to compare the cluster- 𝑁 𝐾
∥𝑥 𝑗 − 𝑐 𝑝 𝑗 ∥ 2 , 𝑆𝑆𝐵𝐾 = ∥𝑐𝑖 − 𝑋¯ ∥ 2 , and 𝑋¯ repre-
Í Í
where 𝑆𝑆𝑊𝐾 =
ing results under different 𝐾s and we choose BIC, which aims to 𝑗=1 𝑖=1
select the model with maximum likelihood. Its revised formula for sents the mean of the whole dataset. Based on this definition, an
partition-based clustering suits our tasks well and there is also a ideal clustering result means that elements within a cluster con-
penalty term avoiding overfitting in the formula. gregate and elements in-between clusters disperse, leading to a
high 𝐶𝐻𝐾 value. Intuitively, we can assign users into clusters, and
𝐾
Õ 𝑛𝑖 𝑛𝑖 𝐷 log 2𝜋 Σ 𝐷 (𝑛𝑖 − 1) 𝐾 (𝐷 + 1) log 𝑁 calculate the CH index for this clustering result based on the user
𝐵𝐼𝐶 = 𝑛𝑖 log − − − (2)
𝑖=1
𝑁 2 2 2 embeddings. After a certain period of time, the user embeddings
𝐾 𝑛𝑖 would change due to the user’s new interactions during the time,
where the variance is defined as Σ = 𝑁 −𝐾 1 Í Í 𝑥 −𝑐 and we can use the new embeddings to calculate the CH index
𝑗 𝑖 2.
𝑖=1 𝑗=1 without changing the users’ cluster assignment. By comparing the
The 𝐾-class clustering has 𝑁 points 𝑥 𝑗 ∈ 𝑋 𝐷 , 𝑐𝑖 is the center of CH index before and after the user embeddings change, we will be
the 𝑖-th cluster with the size of 𝑛𝑖 , 𝑖 = 1, . . . , 𝐾. BIC evaluates the able to evaluate to what extent the user preferences have changed
likelihood of different clustering settings. In our case, we use BIC to (see Figure 2, details to be introduced later). Furthermore, based on
determine the number of clusters (𝑖.𝑒., K). The 𝐾 of the maximum the user IDs, we can track how each user’s preference changed in
BIC is the optimal number of clusters. We pick the corresponding the latent space.
𝐾 (𝑖.𝑒., 𝐾 ∗ ) of the first decisive local maximum (𝑖.𝑒., 𝐵𝐼𝐶 ∗ ).
4.2.2 External Validity Indexes.
4.2 Measuring Reinforcement of User Interests We use external validity indexes to measure the similarity between
the “first block” and the “last block” embeddings in terms of cluster-
We use cluster validity [44] to compare the user embeddings of
ing. This kind of indexes utilizes the ground truth class information
two user groups, and observe the changes in clustering through
to evaluate the clustering results. The clustering result close to the
different months. Originally, this technique is known as the pro-
optimal clustering has high index scores. In other words, the exter-
cedure to evaluate how the clustering algorithm performs on the
nal indexes compute the similarity between the given clustering
given datasets. The process evaluates the results on different param-
and the optimal clustering.
eter settings via a set of cluster validity indexes. These indexes for
External validity indexes are constructed on the basis of con-
cluster validation can be grouped into two types, internal indexes
tingency tables [48]. It is built as a matrix containing the inter-
(Section 4.2.1), and external indexes (Section 4.2.2). The external in-
relation between two partitions on a set of 𝑁 points. Partitions,
dexes are based on the ground-truth clustering information, which
P = {𝑃1, 𝑃2, . . . , 𝑃𝐾1 } of 𝐾1 clusters and Q = {𝑄 1, 𝑄 2, . . . , 𝑄 𝐾2 } of
is not always available for a dataset. On the contrary, the internal
𝐾2 clusters, give us a 𝐾1 × 𝐾2 contingency table (𝐶𝑃𝑄 ) consisting
index can evaluate the clustering without knowing the optimal
the number of common points (𝑛𝑖 𝑗 ) in 𝑃𝑖 and 𝑄 𝑗 as follows:
classification. We choose each of them to measure the temporal
changes in clustering in both user groups. 𝑛 11 𝑛 12 ··· 𝑛 1𝐾2
𝑛 21 𝑛 22 ··· 𝑛 2𝐾2
4.2.1 Internal Validity Indexes. 𝐶𝑃𝑄 = .
(4)
.. .. ..
A good clustering algorithm is required to satisfy several valid ..
. . .
properties, such as compactness, connectedness, and spatial separa- 𝑛𝐾 1 𝑛𝐾 2 · · · 𝑛𝐾 𝐾
1 1 1 2
tion [21]. One type of internal indexes is to evaluate to what extent
where 𝑃𝑖 and 𝑄 𝑗 are the clusters in P and Q with the size of 𝑝𝑖 and
the clusters satisfy these properties, and a prominent example is 𝐾
Í2
the Calinski-Harabasz index [8]. Another type is applied to crisp 𝑞 𝑗 , 𝑖 = 1, 2, . . . , 𝐾1 , 𝑗 = 1, 2, . . . , 𝐾2 . Therefore, we have 𝑝𝑖 = 𝑛𝑖 𝑗 ,
clustering or fuzzy clustering [34, 47]. Since we want to explore 𝑗=1
how user interests shift at the population level, we apply the for- 𝐾
Í1 𝐾
Í1 𝐾Í2
𝑞𝑗 = 𝑛𝑖 𝑗 , and 𝑛𝑖 𝑗 = 𝑁 .
mer type of internal indexes on the clustering results of the user 𝑖=1 𝑖=1 𝑗=1
embeddings, in order to detect the polarization tendency in user The techniques of comparing clusters for external validation
preferences by tracking how the index changes over time. are divided into three groups, pair-counting, set-matching, and
Calinski-Harabasz (𝐶𝐻𝐾 ) index scores the clustering consider- information-theoretic [45]. We use Adjusted Rand Index (ARI) [25],
ing the variation ratio between the sum-of-squares between clusters which is a pair-counting index that counts the pairs of data points
(𝑆𝑆𝐵𝐾 ) and the sum-of-squares within clusters (𝑆𝑆𝑊𝐾 ) under 𝐾- on which two clusters are identical or dissimilar. In this way, we
class clustering. Based on this, we can compare the clustering of can evaluate the portion of users shifting to another cluster on
2265
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
2266
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
(a) Click – Following Group (b) Click – Ignoring Group (c) Purchase – Following Group (d) Purchase – Ignoring Group
Action User type Amount First Block Last Block P-value the Following Group uses 𝐾 with the range of [19, 29] and [6, 16] for
All users 4904 0.7742 0.7713 4.33e−9 the click embedding and the purchase embedding respectively, and
Following 2452 0.7756 0.7746 3.05e−2 the Ignoring Group uses 𝐾 with the range of [15, 25] and [4, 14].
Click
Ignoring 2452 0.7728 0.7680 1.53e−21
Between-group p-value 4904 1.58e−8 2.88e−28
5.3 Results Analysis
All users 2916 0.7264 0.7279 1.41e−2
Following 1458 0.7223 0.7248 1.25e−6 RQ1: Does RS reinforce user click or purchase interests?
Purchase
Ignoring 1458 0.7305 0.7310 0.27
Between-group p-value 2916 2.65e−28 1.02e−24
5.3.1 Internal Validity Indexes.
As introduced in Section 4.2.1, CH can measure the extent of varia-
Table 3: Hopkins statistic. tion of the within-group clustering at different times. We first use
CH to examine the tendency of reinforcement in user interests. The
average scores of the CH index are plotted in Figure 2, we can find
Comparing the changes of 𝐻 score in the first and last block, the
that both groups have a drop in CH after three months at all 𝐾s,
clustering tendency decreases in click embeddings but increases
and CH also decreases as 𝐾 becomes larger both in the first blocks
in purchase embeddings. Furthermore, the clustering tendency of
and the last blocks. The common decreasing trend over time in two
purchase embedding is less changeable and even slightly increased
groups might attribute to how we compute CH in the last blocks.
because there are fewer local shifts in the latent space of purchase
As we mentioned in Section 4.2.1, we assign the clustering parti-
embedding compared with the temporal changes in click embed-
tion results of the first blocks to the last blocks, which means that
ding. This might attribute to the fact that users’ tastes reflected in
the same user will have the same cluster label both at the begin-
purchased items are relatively stable since users cannot choose to
ning and at the end. However, the decrease in CH suggests that the
buy whatever they want as they need to pay the price for it.
temporal shifts in the user embeddings might have made the as-
5.2.2 Select the Number of Clusters. signment of clusters unsuitable, i.e., the ideal clustering partition of
Since we have shown that the Following Group and Ignoring Groups the first blocks, can no longer serve as the ideal clustering partition
are clusterable in the previous section, we now use BIC to detect of the last blocks due to the temporal shifts in the user embed-
the optimal number of clusters (𝐾 ∗ ) for each group of embeddings. dings. Besides the effect of RS, the changes in user embeddings can
The average BIC curves are plotted in Figure 3. We do not force also result from other factors. For instance, user interest can vary
the clustering on each dataset to use the same number of clusters a lot, along with changes in external conditions in e-commerce,
in consideration of underestimation caused by inappropriate 𝐾 such as sales campaigns. As a result, these temporal shifts of user
settings. Clustering settings, such as 𝐾, have to fit the datasets embeddings in the latent space are reflected as the decrease in CH.
well to guarantee the optimal clustering results. The inaccurate In practice, we can hardly avoid this “natural” reduction caused
results might lead to overestimating or underestimating of the echo by the e-commerce platform. As a result, we evaluate the difference
chamber effect. between the two user groups to find the effect that comes from the
We set the corresponding 𝐾 of the maximum BIC as the optimal RS. We compute the temporal decreases in CH for each group with
number of clusters 𝐾 ∗ . The 𝐾 ∗ s is 24, shown in Figure 3(a), for 𝐾 in [𝐾 ∗ − 5, 𝐾 ∗ + 5] (see Table 4). As is shown in the table, the
the Following Group, and 20, shown in Figure 3(b), for the Ignoring drops of CH at 𝐾 ∗ are 48.22 and 50.73 for the Following Group and
Group, with the click embeddings. Meanwhile, 𝐾 ∗ is 11, shown in Ignoring Group in click embedding, and the average drops of CH in
Figure 3(c), for the Followings Group, and 9, shown in Figure 3(d), [𝐾 ∗ −5, 𝐾 ∗ +5] are 48.41 and 50.95 respectively. Moreover, purchase
for the Ignoring Group, with the purchase embeddings. Intuitively, embeddings have similar results that the decreases of CH for the
we could directly compare the user groups with their own 𝐾 ∗ , but Following Group are 32.74 at 𝐾 ∗ and 33.26 for average, and the
the local areas around maximum in the curves seem to be quite reductions for the Ignoring Group are 35.45 and 37.29. We further
flat. As a result, we believe the measurements around 𝐾 ∗ would check the statistical significance of the difference between two
show more reliable and plausible results than the measurement groups, finding that all differences are at 95% confidence interval
exactly at 𝐾 ∗ . Thus, we execute the experiments in the range of (i.e., 𝑝-value is less than 0.05). Overall, CH drops slower in Following
[𝐾 ∗ −5, 𝐾 ∗ +5]. The average of results is used to examine the changes Groups, showing a more stable tendency than that in Ignoring Group.
of clustering via cluster validity indexes introduced next. Finally, Accordingly, the Ignoring Group, which falls faster in CH, disperses
2267
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
(a) Click – Following Group (b) Click – Ignoring Group (c) Purchase – Following Group (d) Purchase – Ignoring Group
Figure 3: Bayesian Information Criterion (BIC). The 𝐾 of the maximum BIC is the optimal number of clusters.
Table 4: Calinski-Harabasz score. The corresponding 𝐾s for Table 5: ARI scores. Same as 𝐶𝐻𝐾 , the corresponding 𝐾s for
each groups are introduced in Section 5.2.2 each groups are introduced in Section 5.2.2
to a wide range on the latent space, reveals that it may receive a To sum up, the Following Group has fewer temporal shifts in click
milder influence of reinforcement on user preference. embedding but no evident difference in purchase embedding. In
The less dispersion of Following Group in the latent space is pos- other words, in terms of click interests, partitions at the beginning
sibly due to multiple factors. One the one hand, the Following Group and the end in the Following Group are more similar, indicating more
might have more users who hold on to the previous preference in connections. While in purchase interests, clustering at the end in
items; on the other hand, Following Group has fewer changes in both groups does not show the trend of sticking to the previous
their interests than the Ignoring Group does. Either of the reasons clustering. More changes appearing in both groups in purchase
could give us the conclusion that user interest in the Following embedding could be caused by the fact that users have fewer choices
Group has a strengthening trend over time, resulting in that the to purchase because of objective constraints, such as their incomes
dispersion in latent space is suppressed to some extent. and the item prices. The effect of RS cannot “force” users to buy
some items they cannot afford, thus, even if user interest has been
5.3.2 External Validity Indexes.
reinforced, the shifts in preferences might not appear in purchase
Also, we examine the temporal changes in clustering via the exter-
behaviors. However, in click behaviors, the Following Group seems
nal validity index, ARI. Unlike CH using the same labels on two
to strengthen their preference under the effect of RS, since users
datasets, ARI compares the different clusterings of the first and the
are free to click items they are interested in, and their interests
last block and measures the similarity between clusterings. We plot
presented in click embedding do not have any other constraints.
the similarities (ARI) at different 𝐾s in Figure 4 and list the average
The group with higher ARI has fewer changes, which means they
results of ARI in Table 5. We find that in the click embedding, the
stick to the items they interacted with before and intensify their
Following Group has a higher ARI than the Ignoring Group (average
preferences. The evidence confirms the conclusion in Section 5.3.1
ARI of 0.0986 and 0.0765 respectively with the 𝑝-value of 2.28e−51).
that there exists the tendency of reinforcement in user interests in
In the purchase embedding, the difference between the two groups
the Following Group.
is not statistically significant; the 𝑝-value for the difference of the
average ARI is 0.53. Similar observation is also shown in the curves RQ2: If user interests are strengthened, is it caused by RS
in Figure 4, the curve in Figure 4(a) is higher than the curve in narrowing down the scope of items exposed to users?
Figure 4(b), but curves in Figure 4(c) and Figure 4(d) almost over- After an affirmative answer to RQ1, we examine RQ2 to explore the
lap. Let us take a look at each pair of ARI of different user groups potential cause of the reinforcement in user interests: narrowed
in purchase embedding, the differences among half of the 𝐾s in contents offered to users. To do so, we measure the content diver-
[𝐾 ∗ − 5, 𝐾 ∗ + 5] are not significant. sity of recommendation lists at the beginning and the end. The
2268
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
(a) Click – Following Group (b) Click – Ignoring Group (c) Purchase – Following Group (d) Purchase – Ignoring Group
(a) First Block – Following Group (b) Last Block – Following Group (c) First Block – Ignoring Group (d) Last Block – Ignoring Group
Figure 5: Recommendation content diversity (pairwise distance between user embeddings) in two user groups.
distributions of content diversity (the average pairwise distance of Amount First last Within-group p-value
item embeddings) in the first and last blocks for display are plot- All users 3820 1.0969 1.0937 6.10e−11
ted in Figure 5, and the corresponding average content diversities Following group 1910 1.0945 1.0882 1.95e−20
are listed in Table 6 These distributions are approximately normal, Ignoring group 1910 1.0992 1.0989 0.67
and the first blocks have a larger density around higher content Between-group p-value 3820 2.13e−12 2.16e−56
diversity than the last blocks do. Also, the distribution becomes
dispersing over time, lowering the average of the whole group. Table 6: The content diversity of recommended items.
Furthermore, the average content diversity gives us consistent ob-
servation. When paying attention to the overall temporal changes, Group drops a lot. Then, the scope of items recommended to the Fol-
we find that the content diversity among all users falls from 1.0960 lowing Group has been repeatedly narrowed down, strengthening
to 1.0937 with 𝑝-value of 6.10e−11. Even though the drop is tiny, user interests in this group as a consequence.
it indicates that both groups go through the trend of narrowing As we claimed in answer to RQ1, the reinforcement in prefer-
down the scope of the content displayed to users. Additionally, the ences — echo chamber effect — is reflected in the temporal shifts of
content diversities in the Following Group have a larger reduction user embeddings in clustering. Particularly, echo chamber appears
from 1.0945 to 1.0882 than the reduction in Ignoring Group. On the in both user click interests and user purchase interests, but the effect
contrary, the decrease in the Ignoring Group can even be ignored in the latter is sort of slight. However, in other RS platforms, such as
because of the high 𝑝-value of 0.67. This is because RS learns more movie recommendations [33], opposite observations appear in the
about followers from their interactions, such as click, purchase. Following Group, indicating that RS helps users explore more items
Thus, it is more likely for RS to provide items similar to what users and mitigate the reduction in content diversity. One possible reason
have previously interacted with. is that unlike products in e-commerce, promotional campaigns for
In e-commerce, user affects recommendations exposed to them movies mostly focus on those commercial films. Therefore, movie
through user actions, and their actions get influenced in return, recommendation platforms could still fill the recommendation list
these procedures form a feedback loop, which strengthens the per- with niche movies and slow down the reduction in content diversity.
sonalized recommendation and shrinks the scope of the content of-
fered to users. As a result, the filter bubble effect occurs in Following 6 CONCLUSIONS AND FUTURE WORK
Group. Conversely, Ignoring Group only provide minimal informa- In this paper, we examine and analyze echo chamber effect in a real-
tion about their tastes in items, since they do not interact with RS world e-commerce platform. We found that the tendency of echo
much. Hence, RS recommends items from a broad scope to explore chamber exists in personalized e-commerce RS in terms of user
the users’ preferences. The difference between the two groups is click behaviors, while on user purchase behaviors, this tendency
also statistically significant at all times. The Ignoring Group has a is mitigated. We further analyzed the underlying reason for the
higher diversity of 1.0992 in the beginning, and the difference is observations and found that the feedback loop exists between users
further enlarged in the end since the content diversity in Following and RS, which means that the continuous narrowed exposure of
2269
Industry (SIRIP) Papers I SIGIR ’20, July 25–30, 2020, Virtual Event, China
2270