The Evolution of Web Archiving
The Evolution of Web Archiving
DOI 10.1007/s00799-016-0171-9
123
M. Costa et al.
ing and contributing to individual and community memory ogy. We also compared our two surveys against the results
[5]. It is, therefore, important to preserve these data, not only obtained from other surveys whenever possible.
for historical and social research [6–12], but also to support The analysis evidences a significant growth in the num-
current technology, such as assessing the trustworthiness of ber of initiatives, countries hosting these initiatives, volume
statements [13], detecting web spam [14], improving web of data and number of contents preserved, which indicates
information retrieval [15] or forecasting events [16]. a growing effort that has been employed by the web archiv-
At least 68 web archiving initiatives undertaken by ing community to preserve the web. A cause for concern is
national libraries, national archives, private companies and the small amount of archived data in comparison with the
consortia of organizations are acquiring and preserving parts amount of data being published on the web. This will likely
of the web. Together, they hold more than 534 billion files originate a knowledge gap about the present time. On the
(17 PB) and this number continues to grow as new initiatives other hand, the amount of archived data is larger and grows
arise. Some country code top-level domains and thematic faster than the amount processed by any commercial web
collections are being archived regularly,3 while other col- search engine, which raises scalability challenges in giving
lections related to important events, such as September 11, efficient and effective data access. In fact, the search tools
are created at particular points in time.4 Web archives also have not changed in the last years, being essentially based
contribute to the preservation of content born in non-digital on commonly used web search technology that does not take
formats that were afterwards digitized and published online, into account the specificities of web archiving. These tools
such as The Times Archive5 with news since 1785. As result, have a poor performance and greatly affect the finding of
web archives contain often millions or billions of archived historical information [18].
documents and cover decades or even centuries in the case The remainder of this paper is organized as follows. Sec-
of digitized publications. The historic interest in these docu- tion 2 describes the background and covers related work.
ments is also growing as they age, becoming a unique source Section 3 describes the methodology for conducting the sur-
of past information for widely diverse areas, such as soci- veys on web archiving initiatives in 2010 and 2014. Section 4
ology, history, anthropology, politics, journalism, linguistics presents the results obtained in the surveys and the analysis of
or marketing. the advancements made in web archiving during that period.
However, despite the existence of web archives since 1996 Section 5 finalizes with the conclusions.
and their joint efforts to preserve digital information, infor-
mation about web archiving initiatives and the services they
provide is scarce. Without knowing the status of current web 2 Related work
archiving it is impossible to understand its strengths, lim-
itations and the developments that are still needed to turn Cultural heritage institutions, such as museums, libraries and
these document repositories into useful sources of informa- archives, have been preserving the intangible culture of our
tion. Without knowing the preferences, trends and needs of society (e.g., folklore, traditions, language) and the legacy
the web archiving community it is difficult to adapt current of physical artifacts (e.g., monuments, books, works of art).
technology to the emerging challenges and develop strate- Web archives are a novel form of cultural heritage institutions
gies to anticipate future problems. Motivated by this lack of mandated to preserve similar artifacts. However, the artifacts
knowledge in the research community, we conducted two sur- of web archives are born-digital and digitized contents.
veys to gather results about existing web archiving initiatives Web archives are a special type of digital libraries. Both
across the globe. The first survey, already published, pro- share the responsibility of preserving information for future
vided a comprehensive characterization of world wide web generations. This includes all types of multimedia, such as
archiving initiatives in 2010 [17]. The second survey was images and videos, besides the digital counterparts of printed
carried out in 2014 and provides an updated characterization documents. The main difference is that web archives usu-
of these initiatives. Both surveys analyzed the same metrics, ally grow to a data size that exceeds traditional organization
which enabled to study the evolution of the characteristics of and management of typical digital libraries. Digital libraries
web archiving initiatives, such as the location, creation year, are based on meta-data describing manually curated artifacts
selection policy, used formats, number of people engaged, and catalogs of these artifacts, which are usually used to
volume of archived data, access type and employed technol- explore and search digital collections, for instance, through
faceted search. However, the experience from the Pandora
(National Library of Australia)6 and the Minerva (Library of
3 E.g., Internet Archive available at http://www.archive.org. Congress)7 projects showed that this is not a viable option for
4 E.g., Library of Congress Web Archives available at http://www.loc.
gov/minerva. 6 http://pandora.nla.gov.au.
5 http://www.thetimes.co.uk/tto/archive/. 7 http://www.loc.gov/minerva.
123
The evolution of web archiving
Fig. 1 A version of 1992 of the first web site. This earliest version found at CERN describes the world wide web project
web archives. The size of the web makes traditional methods the International Internet Preservation Consortium (IIPC),
for cataloging too time consuming and expensive, beyond the which leads the development of several open-source tools,
capability of libraries staff. One of the conclusions from the standards and best practices for web archiving [21]. A time
final report of the Minerva project is that automatic index- line of some of these initiatives can be obtained online.9
ing should be the primary strategy for information discovery Previous initiatives archived a large number of web sites
[19]. according to some selection policy. In addition to these, there
The first web site, presented in Fig. 1, was created by are services that enable any person to permanently archive a
Tim Berners-Lee at the European Organisation for Nuclear web page given a URL, such as Perma.cc,10 WebCitation11 or
Research (CERN) and published in August 1991. This site Archive.is.12 Each archived page receives a unique link, such
describes the basis of the world wide web and is back online as a Digital Object Identifier, to direct readers to its original
at its original URL.8 The first web archives appeared only version that will remain available online. Several user needs
in 1996 and do not contain sites prior to this date with the are met by these services, such as scholars preserving web
exception of some pages recovered from backups stored in pages cited in their work [22] or Supreme Courts preserving
floppy disks or CDs. The Internet Archive, a USA-based non- citations in their published decisions [23].
profit foundation, was one of the first web archives and has
been broadly archiving the web since 1996. It leads the most 2.1 Data access
ambitious initiative. In 2013, the Internet Archive was pre-
serving 240 billion archived documents with a total of about Much of the effort on web archive development focuses on
5 PB of data [20]. In 2014, it held 376 billion archived web acquiring, storing, managing and preserving data [19]. How-
pages, which represent 13.8 PB of data. The Pandora and ever, data must also be accessible to users who need to exploit
Tasmanian web archives from Australia, and the Kulturarw3 and analyze them. Due to the challenge of indexing all the col-
web archive from Sweden, were also created in 1996. Many
other initiatives followed since then and a significant effort
9 http://timeline.webarchivists.org.
has been employed by the research community in the web
10 https://perma.cc/.
archiving domain. Many of these initiatives are members of
11 http://webcitation.org/.
8 http://info.cern.ch/hypertext/WWW/TheProject.html. 12 http://archive.is/.
123
M. Costa et al.
lected data, the prevalent discovery method in web archives is of web archives support this type of search [26]. However,
based on URL search, which returns a list of chronologically URL search is limited, as it forces the users to remember the
ordered versions for a given URL, such as in the Internet URLs, some of which refer to content that ceased to exist
Archive’s Wayback Machine [24,25]. Figure 2 depicts the many years ago.
user interface of the Wayback Machine after searching a Another type of access is meta-data search, i.e., the search
URL. A survey on European web archives reported that 68 % by meta-data attributes, such as category or theme. Meta-
123
The evolution of web archiving
data search is provided by 65 % of European web archives lytical tools are being researched to fulfill informational
[26]. For instance, the Library of Congress Web Archives13 needs for specific users requiring richer answers such as
supports search on bibliographic records. Some web archives historians or journalists [32,33]. Such tools would help to
support filtering results by domain and media type, while explain the stories of the past and predicting future events
others organize collections by subject or genre to provide through the analysis and modeling of the evolution of data.
browsing functionality, such as the Pandora Australia’s web Web archives are an exceptional data source to extract and
archive [27]. Most web archives support narrowing the search leverage this evolution. A good example is the work of
results by date range. Leskovec et al. who tracked short units of information (e.g.,
Full-text search has become the dominant form of infor- phrases) from news as they spread across the web and evolve
mation discovery, especially in web search systems such as throughout time [34]. This tracking provided a coherent rep-
Google. These systems have a strong influence on the way resentation of the news cycle, showing the rise and decline
users search in other settings. This explains why full-text of main topics in the media. Another example is the work of
search was reported as the most desired web archive func- Radinsky and Horvitz who mined news and the web to pre-
tionality [28] and the most used when supported [29]. Despite dict future events [16]. For instance, they found a relationship
the high computational resources required for this purpose, between droughts and storms in Angola that catalyze cholera
70 % of the European web archives surveyed support full-text outbreaks. Anticipating these events may have a huge impact
search for at least a part of their collections. Still, previous on world populations. Hoffart et al. built a large knowledge
studies showed that the search services provided by these base in which entities, facts, and events are anchored in both
web archives are poor and frequently deemed unsatisfactory time and space [35]. Web archives can be the source to extract
[18,30]. these data, which will then be used for temporal analysis.
There are several access tools created for web archiving. For instance, since the veracity of facts is time dependent,
The site14 of the International Internet Preservation Con- it would be interesting to identify whether and when they
sortium (IIPC) has a list with many tools for acquisition, become inaccurate.
curation, storage and access. Thomas et al. present a com- Novel types of interfaces are also being researched to sup-
prehensive list of available tools and services that can be used port data analysis over time. The Time Explorer, depicted in
in web archives [31]. Fig. 3, combines several interfaces integrated in the same
application designed for analyzing how topics evolve over
2.2 Data analysis time [36]. The core of the interface is a time line with the
main titles extracted from the news and a frequency graph
The existing search tools require a substantial human effort with the number of news and entities most frequently asso-
when exploring and analyzing complex topics. Hence, ana- ciated with a given query displayed over the time axis. The
interface also displays a list of the most representative enti-
13
ties (people and locations) that occur on matching news and
http://www.loc.gov/webarchiving.
14
that can be used to narrow the search. The Zoetrope system
http://www.netpreserve.org/web-archiving/tools-and-software.
123
M. Costa et al.
123
The evolution of web archiving
than a typical one-shot online survey with closed answers. (NDSA) in 2011 and 2013, and they covered organizations
However, the cost of processing the results for statistical of the USA involved or planning to archive content from
analysis was significantly higher. the web [45,46]. These surveys are referred to from now on
This survey was published in 2011 [17]. The data col- as the NDSA2011 and NDSA2013 surveys. In this paper,
lected and validated enabled the creation of a Wikipedia page we analyze and compare the results of the surveys whenever
named List of Web Archiving Initiatives,20 so that the pub- possible, despite our surveys having covered world wide web
lished information could be collaboratively kept up-to-date. archiving initiatives, while the IMF2010 survey focused just
Since then, the web archiving community has been updating on initiatives from Europe and the NDSA surveys on initia-
this information, making it a useful resource. Figure 4 shows tives from the USA. Still, all surveys took place between 2010
the Wikipedia page that contains three tables populated with and 2014, which makes their results comparable in time.
information about the web archiving initiatives, such as their
name, country, creation year, employed technologies, num-
ber of employees, number and volume of archived contents, 4 Results
archived formats, type of crawl and access methods.
4.1 Web archiving initiatives
To observe how web archiving changed since the first sur-
vey, in 2014 we conducted the same analysis on the data
Table 1 shows general statistics about web archiving initia-
published in the Wikipedia page and compared it against the
tives surveyed in 2010 and 2014. Web archiving initiatives
results of 2010. In case of doubt or lack of information, we
are very heterogeneous in size and scope. For instance, the
consulted the official sites of the initiatives or their scientific
web archive (WA) of Čačak aims to preserve sites related to
publications.
this Serbian city, while the Internet Archive has the objec-
tive of archiving the global web. The obtained results show
3.1 Comparison with other surveys
that web archives exclusively hold content related to their
hosting country, region or institution. However, there are a
After our first survey in 2010, three other surveys were
few initiatives, such as the Internet Memory Foundation and
conducted on web archiving which obtained related infor-
the Portuguese Web Archive, that also preserve information
mation, such as the access type provided by the initiatives
related to foreign countries.
and the technology used to support them. The first survey
was conducted by the Internet Memory Foundation over Table 1 General statistics of web archiving initiatives
European web archives in 2010, from now on referred to
Characteristics 2010 2014 (%)
as the IMF2010 survey [26]. The second and third surveys
were conducted by the National Digital Stewardship Alliance Total initiatives 42 68 +61.9
Countries hosting initiatives 26 33 +26.9
20 http://en.wikipedia.org/wiki/List_of_Web_Archiving_Initiatives.
123
M. Costa et al.
We detected an increase in the number of web archiving Table 2 Staff statistics of web archiving initiatives
initiatives, from 42 in 2010 to 68 in 2014. Since the creation Characteristics 2010 2014
and operation of a web archive is complex and costly, sev-
eral initiatives exist to provide web archiving services (WAS) Total people (full time) 112 108 −3.6 %
that can be independently operated by third-party archivists Total people (part-time) 166 197 +18.7 %
to harvest, build and preserve collections of digital content. Total people 278 305 +9.7 %
These WAS enable focused archiving of web content by orga- Median people (full time) 2.5 2 −20.0 %
nizations, such as universities or libraries, that otherwise Median people (part-time) 2 2 0.0 %
could not manage their own archives. In 2014, there were Average people (full time) 3.5 2.2 −37.1 %
11 initiatives (16 %) providing WAS against the previous Average people (part-time) 5 4 −20.0 %
3 (7 %) offered in 2010. Some of these new WAS are the
Aleph Archives,21 Hanzo Archives22 and Reed Archives.23
The oldest WAS are the Archive-It,24 ArchiveTheNet25 and
In 2014, the size of the teams continued to be highly vari-
Web Archiving Service.26 Of the 11 WAS, 6 operate in the
able, where initiatives had teams without any person working
USA, where most of them offer electronic discovery (edis-
in full time, such as the University of Texas at San Antonio
covery) services for enterprises, which are required by law
WA, while other teams had 12 people working in full time,
since 2006 for the discovery of information in civil litigation
such as the Internet Archive, or 80 people working in part-
or government investigations. In 2014, at least 19 % of the ini-
time, such as the Library of Congress. As shown in Table 2,
tiatives were using WAS. In 2010, this percentage was 16 %.
in 2014, the web archiving initiatives had in total 108 peo-
ple working in full time and 197 in part-time. There was
4.1.1 Human resources an increase from 278 to 305 people working in this area.
The teams continued to be mostly small, having a median
The measurement of human resources engaged in web staff of 2 people in full time (average of 2.2) and 2 people
archiving activities was not straightforward (question 2). in part-time (average of 4). There were 3 initiatives that did
Most respondents could not provide an effort measurement not have any person dedicated full time, against the 11 of
in person-month. The presented reasons were that the teams 2010. Despite the large increase of the number of initiatives,
were too variable and some services were hired to third-party the total number of people working on them increased only
organizations out of their control. Instead, most of the respon- slightly, which led to a decrease in the median and average
dents described their staff and hiring conditions. The obtained team size. The NDSA2013 survey shows a different reality
results of 2010 show that web archiving engaged at least 112 with less people working in web archiving. The USA initia-
people in full time and 166 in part-time. The web archive tives have a median staff of 0.25 people in full time. Only
teams were typically small, presenting a median staff of 2.5 19 % of the USA initiatives devote at least one person to
people in full time (average of 3.5) and 2 people in part-time handle web archiving tasks. The small size of the teams are
(average of 5). The staff was mostly composed of librarians likely due to the high percentage of initiatives that use WAS
and computer engineers. The results show that 11 initiatives instead of running their own web archiving system.
(26 %) did not have any person dedicated full time. The effort
of part-time workers was variable, for instance, at the Library
of Congress they spent only a few hours a month. Most of 4.1.2 Geographic location
the human resources were invested on data acquisition and
quality control. The IMF2010 survey corroborates that web Figure 5a presents the countries that hosted web archiving
archive teams are small, but the number of staff depends on initiatives in 2010. The 42 initiatives were spread across
the phase of the project. Its results show that 38 % of fully 26 countries. There were 23 initiatives hosted in Europe,
operational initiatives count more than five full-time employ- 10 in North America, 6 in Asia and 3 in Oceania. Half
ees, while 67 % that started a project count between two and of the initiatives were hosted in countries belonging to the
five employees. Organisation for Economic Co-operation and Development
(OECD). From the 34 countries that belong to the OECD,
21 (62 %) hosted at least one web archiving initiative, which
21 http://aleph-archives.com/. is an indicator of the importance of web archiving in devel-
22 http://www.hanzoarchives.com/. oped countries. Most of the countries hosted one (74 %) or
23 http://www.reedarchives.com/. two initiatives (22 %). The only country that hosted more than
24 http://www.archive-it.org. two was the USA with a total of nine initiatives. Although
25 http://archivethe.net. being part of a country, initiatives like the Tasmanian WA
26 http://webarchives.cdlib.org. (Australia), North Carolina WA (USA) or Digital Heritage
123
The evolution of web archiving
Catalonia (Spain) were hosted at autonomous states and ica (previously 10), 8 in Asia (previously 6), 3 in Oceania
aimed at preserving regional content. (equal) and 1 in Africa (previously 0). Notice that some ini-
Figure 5b presents the location of all countries hosting tiatives have more than one location. There were increases in
web archiving initiatives in 2014. The 68 web archiving ini- almost all continents, especially in Europe and North Amer-
tiatives are spread by 33 countries. In 2010, there were only ica. Africa received its first initiative hosted in Egypt, while
26 countries hosting web archiving initiatives, which shows South America does not have any yet.
a growing awareness of the importance of web archiving all When comparing the number and location of initiatives
over the world. The USA continues to be the country with the with other surveys, we detected that many were missing. The
most initiatives, increasing from 9 in 2010 to 19 in 2014. The IMF2010 survey found 41 European initiatives fully opera-
second country with most initiatives is France, with five ini- tional in 2010, while we found 38 in 2014. The NDSA2011
tiatives. Germany and Switzerland share the third place with and NDSA2013 surveys found 49 and 64 active initiatives in
four initiatives each. The distribution of the initiatives over the USA, but we found only 19 in 2014. This difference is
the world is 38 in Europe (previously 23), 22 in North Amer- mostly due to college and universities, i.e., 36 in 2011 and
123
M. Costa et al.
50
40
30
20
10
1996
1997
1999
2000
2001
2002
2004
2005
2007
2008
2009
2010
2012
2013
1998
2003
2006
2011
creaon year
48 in 2013, included in the NDSA surveys and that were not archives selected specific sites for archiving. This selection
included in our surveys. Future surveys should make an effort is determined by multiple factors such as consent by the
to cover all these initiatives. Nevertheless, both NDSA and authors or relevance for inclusion in thematic collections
our surveys show a growing trend of initiatives. (e.g., elections or natural disasters). However, 80 % of the
web archives exclusively held content related to their host-
4.1.3 Growth ing country, region or institution. Of the 42 initiatives, 11
(26 %) also performed broad crawls of the web, including all
Figure 6 displays the evolution of the number of web archiv- sites hosted under a given domain name or geographical loca-
ing initiatives created per year, including the new initiatives tion. The IMF2010 survey reported that 23 % of European
recorded on the Wikipedia page. There was a growth from web archives run domain crawls, while 71 % performed the-
4 initiatives in 1996 to 14 initiatives in 2003, which repre- matic or selective crawls. The NDSA2011 survey reported
sents an average of 1.8 new initiatives per year. After 2003, that all USA initiatives archived web content from their own
many new initiatives appeared to solve the web ephemeral- institution, as well as content from other organizations or
ity problem. For instance, in 2005 and 2007, nine and eight individuals for future research.
initiatives were created, respectively. There was an average Our results show that in 2014, at least 45 initiatives (66 %)
growth of 5.4 initiatives per year from 2004 to 2012. There is performed selective crawls and 20 (29 %) country code top-
no information on new initiatives created in 2013. One possi- level domain (ccTLD) or broad crawls of the web. Almost
ble explanation for the significant and constant growth since all initiatives continue to exclusively hold content related
2003 was the concern raised by the United Nations Edu- to their hosting country, region or institution. There are three
cational, Scientific and Cultural Organization (UNESCO) initiatives that archive ccTLD of other countries besides their
regarding the preservation of the digital heritage [4]. The own. The Internet Archive and the Internet Memory Founda-
NDSA2013 survey also shows a constant growth, especially tion share a vision to preserve web content from all over the
between 2006 and 2013, when there was a great increase world. The Portuguese Web Archive preserves content from
of initiatives mainly due to universities starting their web 4 countries that have Portuguese as their official language.
archiving programs. Universities created 39 (out of 67) ini-
tiatives during these 8 years, which indicates an emergent 4.2.2 Volume size
awareness in the academic community of the USA about the
importance of preserving web content. Figure 7 presents the distribution of the size of archived
collections measured in total volume of data and number
4.2 Archived data of contents. Notice that one HTML page containing three
embedded images results in the archive of four contents.
4.2.1 Selection policy There was an increase of initiatives with collections between
10 and 100 TB in detriment of collections between 1 and
Since the resources are scarce and not all the web can be 10 TB. While in 2010, 50 % of the initiatives preserved col-
preserved, the selection policy of most web archiving ini- lections smaller than 10 TB and 31 % preserved collections
tiatives is to preserve the most relevant parts of the web between 10 and 100 TB, in 2014 these percentages were
from their own perspective. In the survey of 2010, all web 42 and 40 %, respectively. The percentage of initiatives with
123
The evolution of web archiving
% iniaves
% iniaves
30 30
25 25
20 2010 20 2010
2014 15 2014
15
10 10
5 5
0 0
(a) (b)
50
collections larger than 100 TB continues to be 19 %. In accor-
45
dance with this finding, the percentage of initiatives with 40
collections between 100 and 1000 million contents decreased
% of iniaves
35
from 43 to 33 %, mostly because the percentage of initia- 30
tives with collections with more than 1000 million contents 25
2010
increased from 22 to 33 %. 20
15 2014
World wide web archives preserved from 1996 to 2010
10
a total of 181,978 million contents (6.6 PB). The Internet
5
Archive by itself held 150,000 million contents (5.5 PB). In 0
2014, all initiatives had archived together at least 534,604
million contents, which sums around 17 PB of data. This
represents an increase from 2010 to 2014 of 294 % on con-
tents and 258 % on volume of data. The Internet Archive Fig. 8 Usage of file formats to store web content
continue to be by far the web archive with the largest col-
lection with 376,000 million contents. The information of WARC format was published by the International Organiza-
its volume of data was not available in the Wikipedia page. tion for Standardization (ISO) as the official standard format
Hence, we extrapolated from the 2010 results and estimated for archiving web content and it was exclusively used by
13.8 PB of data. 10 % of the initiatives in 2010 [49]. The ARC and WARC
The selection policies of some initiatives intersect, which formats were dominant in 2010, being used by 54 % of the
leads to a replication of archived content [47,48]. For initiatives.
instance, initiatives hosted in the same country may preserve There was a decrease, from 26 % in 2010 to 13 % in 2014,
some of the same sites. Initiatives with a broader scope, such of initiatives using exclusively the ARC format. These initia-
as the Internet Archive, preserve some content that are also tives likely changed to the WARC format that increased 3 %
archived by national initiatives. The overlap of archived con- points and the ARC/WARC formats that also increased 3 %
tent is not contemplated in this paper. points. The ARC and WARC formats continue to be by far
the most predominant, being used in 2014 by 47 % of web
4.3 Access and technologies archiving initiatives against the 54 % in 2010. Besides his-
torical reasons, the widespread of the ARC/WARC formats
4.3.1 Formats to store archived content was motivated by the Archive-Access project, which freely
provides open-source tools to process this type of files [50].
Figure 8 presents the evolution of file formats used to store There are only 10 % of initiatives using other file formats in
archived content. The ARC format defined by the Internet 2014, such as the HTTrack format. Still, 43 % of the initia-
Archive was the de facto standard in 2010 [44]. In 2009, the tives did not report the adopted format in the Wikipedia page.
123
M. Costa et al.
60
In 2010, some initiatives held the copyright of the archived 50
% of iniaves
123
The evolution of web archiving
respectively. However, respondents frequently mentioned small teams that mainly work on the acquisition and cura-
that full-text search was hard to implement and that the tion of data. Almost all initiatives exclusively hold content
performance of NutchWAX was unsatisfactory, being one related to their hosting country, region or institution, which
reason for the partial indexing of their collections. Nonethe- stresses the need for each country to finance at least one ini-
less, in 2010, NutchWAX supported full-text search for the tiative at national level.
Finnish Web Archive (148 million), Canada Web Archive Web archiving initiatives have been in existence since
(170 million), Digital Heritage of Catalonia (200 million), 1996 and their number has been growing since then. Par-
California Digital Library (216 million) and BnF (15 % ticularly, from 2010 to 2014 there was a large increase in the
of a collection of 200TB). The IMF2010 survey shows number of initiatives, hosting countries, number of contents
that the European initiatives used similar tools. They used and volume of archived data. Currently, web archiving ini-
Heritrix to crawl web content (80 %), and for search, tiatives hold 17 PB (534,604 million contents), which shows
they used the Wayback Machine (67.5 %) or NutchWAX a growing awareness of the importance of web archiving all
(70 %). over the world and a continued effort of the community in
Despite the increase from 3 in 2010 to 11 in 2014 of mitigating the web ephemerality problem.
web archive services (WAS), the number of initiatives that On the other hand, despite the social and economic impact
used WAS increased just 3 % points, from 16 to 19 %. The of losing the information that is being exclusively pub-
Archive-It is the service most used, summing a total of seven lished on the web, the obtained results show that the human
initiatives. There was an increase from 9 to 19 % of initia- resources invested in web archiving are still scarce and the
tives doing some in-house development. This software was size of teams are even decreasing. The lack of resources will
mostly developed by WAS, such as the Hanzo Archives’ probably originate a historical void in the future about our
access tools, or curation tools developed by libraries, such current time. Our results already show that only a small part
as the DigiBoard of the Library of Congress Web Archives. of the web has been preserved.
These increases contributed to the decrease of the use of The web archiving community is adopting common data
Archive-Access tools. Still, the Archive-Access tools con- formats and tools. The ARC and WARC are the predominant
tinue to predominate, with 57 % of the initiatives using at data formats to store archived content, but in the last years
least one of its tools in 2014, against the 62 % in 2010. there was a shifting from ARC to WARC likely to take advan-
Lucene and Solr together continue to be used by 10 % of tage of the new format enhancements, which enables, for
the initiatives with a growing trend toward Solr. instance, to manage duplicated content and record contextual
The NDSA surveys show different results, where the USA meta-data. Regarding technology, most initiatives continue
initiatives contracted much more WAS. There were 60 % of to use Lucene-based solutions to support full-text search,
initiatives in 2011 and 63 % in 2013 that exclusively used such as NutchWAX or Solr, the Wayback Machine to sup-
WAS. Archive-It is the dominant external service used by port URL search and display archived content, and Heritrix to
approximately 70 % of the initiatives and the California crawl web content. This continuity could be explained by the
Digital Library WAS is the second most used with 17 %. significant number of developers and web archive initiatives
Regarding technology to capture web content, Heritrix is that contribute to enhance these projects.
the most used tool by USA initiatives (29 %), followed by The predominant methods for discovering archived con-
HTTrack (18 %). The Wayback Machine increased from tent have remained the URL, meta-data and full-text search.
76 % in 2011 to 89 % in 2013 as the preferred tool to view However, the respondents of the surveys mentioned that the
contents. existing technology provides unsatisfactory search results
and full text, which is the preferred method by the users, is
hard to implement. Moreover, recent studies show that these
5 Conclusion technologies provide poor search results, making difficult for
users to find the desired information. With the fast growth
Web archiving has been gaining interest and recognition from of archived data, this problem is only exacerbated. Hence,
modern societies around the world. Still, there is a lack of the development of efficient and effective search technology
knowledge in the research community about the most recent is urgent to access the massive data already stored in web
developments in web archiving and the existing initiatives. archives.
This paper provides an updated global overview on these
issues and discusses evolution trends. Acknowledgments This work could not have been done without the
support of the Portuguese Web Archive team. We also thank FCT
Based on two conducted surveys, we observed that web for the financial support of the Research Units of LaSIGE (PEst-
archiving initiatives are typically hosted by developed coun- OE/EEI/UI0408/2014) and INESC-ID (UID/CEC/50021/2013), and
tries, but we can find them spread all over the world in almost the DataStorm Research Line of Excellency (EXCL/EEI-ESS/0257/
every continent. Web archives are generally composed of 2012).
123
M. Costa et al.
References 21. Grotke, A.: IIPC—2008 member profile survey results. Techni-
cal report, International Internet Preservation Consortium (IIPC)
1. Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The (2008)
evolution of the web from a search engine perspective. In: Proc. of 22. Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Bal-
the 13th International Conference on World Wide Web, pp. 1–12 akireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one
(2004) in five articles suffers from reference rot. PloS One 9(12), 1–39
2. Dellavalle, R., Hester, E., Heilig, L., Drake, A., Kuntzman, J., (2014)
Graber, M., Schilling, L.: Going, going, gone: lost internet ref- 23. Lazun, M.J.: “Link Rot” and legal resources on the web: a 2013
erences. Science 302(5646), 787–788 (2003) analysis by the chesapeake digital preservation group. Technical
3. SalahEldeen, H., Nelson, M.: Losing my revolution: how many Report, The Chesapeake Digital Preservation Group (2013)
resources shared on social media have been lost? In: Theory and 24. Tofel, B.: ‘Wayback’ for accessing web archives. In: Proc. of the
Practice of Digital Libraries, pp. 125–137 (2012) 7th International Web Archiving Workshop (2007)
4. UNESCO: Charter on the preservation of digital heritage. 25. Jaffe, E., Kirkpatrick, S.: Architecture of the Internet Archive. In:
In: Adopted at the 32nd Session of the General Conference Proc. of SYSTOR 2009: The Israeli Experimental Systems Con-
of UNESCO (2003). http://portal.unesco.org/ci/en/files/13367/ ference, pp. 1–10 (2009)
10700115911Charter_en.pdf/Charter_en.pdf. Accessed 17 Oct 26. Internet Memory Foundation: Web archiving in Europe. Technical
2003 Report, Internet Memory Foundation (2010)
5. UNESCO: Universal declaration on archives. In: Adopted at the 27. Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012)
ICA Annual General Meeting in Malta (2010). http://www.ica.org/ 28. Ras, M., van Bussel, S.: Web archiving user survey. Technical
6573/reference-documents/universal-declaration-on-archives. Report, National Library of the Netherlands (Koninklijke Biblio-
html. Accessed 17 Sept 2010 theek) (2007)
6. Kitsuregawa, M., Tamura, T., Toyoda, M., Kaji, N.: Socio-sense: a 29. Costa, M., Silva, M.J.: Characterizing search behavior in web
system for analysing the societal behavior from long term web archives. In: Proc. of the 1st International Temporal Web Analytics
archive. In: Proc. of the 10th Asia-Pacific Web Conference on Workshop, pp. 33–40 (2011)
Progress in WWW Research and Development, pp. 1–8 (2008) 30. Costa, M., Silva, M.J.: Evaluating web archive search systems. In:
7. Arms, W.Y., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L.: Proc. of the 13th International Conference on Web Information
A research library based on the historical collections of the Internet Systems Engineering, pp. 440–454 (2012)
Archive. D-Lib Mag. 12(2) (2006) 31. Thomas, A., Meyer, E.T., Dougherty, M., Van den Heuvel, C.,
8. Arms, W., Huttenlocher, D., Kleinberg, J., Macy, M., Strang, D.: Madsen, C., Wyatt, S.: Researcher engagement with web archives:
From Wayback Machine to Yesternet: new opportunities for social challenges and opportunities for investment. Technical Report,
science. In: Proc. of the 2nd International Conference on e-Social Joint Information Systems Committee (JISC) (2010)
Science (2006) 32. Spaniol, M., Masanès, J., Baeza-Yates, R.: The 5th temporal web
9. Ackland, R.: Virtual observatory for the study of online networks analytics workshop (tempweb’15). In: Proc. of the Companion
(VOSON)—progress and plans. In: Proc. of the 1st International Publication of the 24th International Conference on World Wide
Conference on e-Social Science (2005) Web, pp. 863–864 (2015)
10. Foot, K., Schneider, S.: Web Campaigning. The MIT Press, Cam- 33. Spaniol, M., Masanès, J., Baeza-Yates, R.: The 4th temporal web
bridge (2006) analytics workshop (tempweb’14). In: Proc. of the Companion
11. Franklin, M.: Postcolonial Politics, the Internet, and Everyday Life: Publication of the 23rd International Conference on World Wide
Pacific Traversals Online. Routledge (2004) Web, pp. 863–864 (2014)
12. Gomes, D., Costa, M.: The importance of web archives for human- 34. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the
ities. Int. J. Humanit. Arts Comput. 8(1), 106–123 (2014) dynamics of the news cycle. In: Proc. of the 15th ACM SIGKDD
13. Yamamoto, Y., Tezuka, T., Jatowt, A., Tanaka, K.: Honto? Search: International Conference on Knowledge Discovery and Data Min-
estimating trustworthiness of web information by search results ing, pp. 497–506 (2009)
aggregation and temporal analysis. In: Advances in Data and Web 35. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a
Management, pp. 253–264 (2007) spatially and temporally enhanced knowledge base from wikipedia.
14. Chung, Y., Toyoda, M., Kitsuregawa, M.: A study of link farm Artif. Intell. 194, 28–61 (2013)
distribution and evolution using a time series of web snapshots. In: 36. Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P.,
Proc. of the 5th International Workshop on Adversarial Information Zaragoza, H.: Searching through time in the New York Times. In:
Retrieval on the Web, pp. 9–16 (2009) Proc. of the 4th Workshop on Human–Computer Interaction and
15. Elsas, J., Dumais, S.: Leveraging temporal dynamics of document Information Retrieval, pp. 41–44 (2010)
content in relevance ranking. In: Proc. of the 3rd ACM International 37. Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: inter-
Conference on Web Search and Data Mining, pp. 1–10 (2010) acting with the ephemeral web. In: Proc. of the 21st Annual ACM
16. Radinsky, K., Horvitz, E.: Mining the web to predict future events. Symposium on User Interface Software and Technology, pp. 239–
In: Proc. of the 6th ACM International Conference on Web Search 248 (2008)
and Data Mining, pp. 255–264 (2013) 38. Teevan, J., Dumais, S., Liebling, D., Hughes, R.: Changing how
17. Gomes, D., Miranda, J., Costa, M.: A survey on web archiving people view changes on the web. In: Proc. of the 22nd Annual
initiatives. In: Proc. of the International Conference on Theory and ACM Symposium on User Interface Software and Technology, pp.
Practice of Digital Libraries, pp. 408–420 (2011) 237–246 (2009)
18. Costa, M., Couto, F.M., Silva, M.J.: Learning temporal-dependent 39. Masanès, J.: LiWA news #3: living web archives (2011). http://
ranking models. In: Proc. of the 37th Annual ACM SIGIR Confer- liwa-project.eu/images/videos/Liwa_Newsletter-3.pdf. Accessed
ence (2014) March 2011
19. Masanès, J.: Web Archiving. Springer, New York (2006) 40. Weikum, G., Ntarmos, N., Spaniol, M., Triantafillou, P., Benczur,
20. Kahle, B.: Wayback machine: now with 240,000,000,000 (2013). A.A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal
http://blog.archive.org/2013/01/09/updated-wayback/. Accessed analytics on web archive data: it’s about time! In: Proc. of the 5th
30 Apr 2016 Conference on Innovative Data Systems Research, pp. 199–202
(2011)
123
The evolution of web archiving
41. Huurdeman, H.C., Ben-David, A., Sammar, T.: Sprint methods for 47. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nel-
web archive research. In: Proc. of the 5th Annual ACM Web Sci- son, M.L.: How much of the web is archived? In: Proc. of the
ence Conference, pp. 182–190 (2013) 11th Annual International ACM/IEEE joint Conference on Digital
42. Risse, T., Peters, W.: ARCOMEM: from collect-all ARchives to Libraries, pp. 133–136 (2011)
COmmunity MEMories. In: Proc. of the 21st International Confer- 48. AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.:
ence Companion on World Wide Web, pp. 275–278 (2012) Profiling web archive coverage for top-level domain and content
43. Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)
Ainsworth, S., Shankar, H.: Memento: time travel for the web. 49. ISO 28500:2009: Information and documentation—WARC
CoRR (2009). arXiv:0911.1112 file format (2009). http://www.iso.org/iso/catalogue_detail.htm?
44. Burner, M., Kahle, B.: Arc file format (1996). http://www.archive. csnumber=44717. Accessed 30 Apr 2016
org/web/researcher/ArcFileFormat.php. Accessed Sept 1996 50. IIPC: Internet Archive ARC access tools (2009). http://
45. NDSA Content Working Group: Web archiving survey report. archive-access.sourceforge.net/. Accessed 30 Apr 2016
Technical Report, National Digital Stewardship Alliance (2012)
46. Bailey, J., Grotke, A., Hanna, K., Hartman, C., McCain, E., Moffatt,
C., Taylor, N.: Web archiving in the United States: a 2013 survey.
Technical Report, National Digital Stewardship Alliance (2014)
123