Lecture 5
Lecture 5
(ECC 4209)
Lecture 5
(Archive Management)
sjh@upm.edu.my
1
Contents
1. Why, what, and where to archive
2. Archiving files and file systems using tar
3. Searching for system files
4. Securing files with object permissions and ownership
5. Archiving entire partitions with dd
6. Synchronizing remote archives with rsync
2
Why Archive?
• Archive is a single file containing a collection of objects
namely files, directories, or a combination of both.
• Bundling objects within a single file (Figure 4.1) makes it
easier to move, share, or store multiple objects that might
otherwise be unwieldy and disorganized.
Archive
contents
Archive file
Parent directory
Figure 4.1: Files and directories can be bundled into an archive file and saved to the file system
3
Type of Archives
• There are several types of archives and archiving tools
– tar is to create copies of directories and their contents so you can
easily share or back them up
– dd is to create an exact copy of a partition or even an entire hard disk
– rsync is an ongoing solution for regular system backups
4
Compression
• Do not confuse between archive and compression
• Compressor is software tool that applies special algorithm to
a file or archive to reduce disk space (Figure 4.2)
• Files are unreadable when compress, and the algorithm can
also be applied in reverse to decompress them
• Applying compression to a tar archive is good for transfer
large archives over a network because compression can
reduce transmission times significantly
5
Compression (continue)
Original Data
Compressed
Version
7
Data Backups
• Reasons for backup:
– Hardware can - and will - fail
– Error keyboard typing can mangle configuration files that can lock out
anyone from their encrypted system
– Data insecurely stored on cloud infrastructure providers like AWS can
be suddenly and unpredictably lost
– Could become the victim of a ransomware attack that encrypts or
disables all your files unless you pay a large ransom
• Please note that untested data backups may not work
– there could be flaws on backup device
– the archive file might become corrupted
– the initial backup itself might have been unable to properly process all
of your files
• Generating and monitoring log messages can help you spot
problems but ultimately to be confident about a backup is to
8
run a trial restore onto matching hardware.
What to archive?
• Can use scp for secure transferring files to remote places
• But to back up many files spread across multiple directories
(e.g. complicated project with source codes) or even entire
partitions (e.g OS) need better tools
• Command dh provides info for the partitions and file system
– Adding the -h flag converts partition sizes to human readable
formats like GB or MB, rather than bytes
• Running df on a container file system
9
What to archive? (continue)
• The first partition listed is designated as /dev/sda2, which
means that it’s the second partition on Storage Device A and
that it’s represented as a system resource through the
pseudo file system directory, /dev/
• This happens to be the primary OS partition in this case and
all devices associated with a system will be represented by a
file in the /dev/ directory.
• The partition used by your accounting software would appear
somewhere on this list, perhaps designated using something
like /dev/sdb1
• It’s important to distinguish between real and pseudo file
systems
– Pseudo file systems whose files aren’t actually saved to disk but live
in volatile memory and disappear when the machine shuts down
– If the file designation is tmpfs and the number of bytes reported10in
the Used column is 0, it’s temporary
What to archive? (continue)
• df run on a physical computer instead of container:
12
Where to back up?
• From an OS perspective backup can be done to anywhere:
– legacy tape drives
– USB-mounted SATA storage drives
– network-attached storage (NAS)
– storage area networks (SAN)
– cloud storage solution (AWS S3 or Cloudflare C2)
13
Where to back up?
• Be sure to carefully follow best practices for any backups:
– Reliable - Use only storage media that are reasonably likely to retain
their integrity for the length of time you intend to use them
– Tested - Test restoring as many archive runs as possible in simulated
production environments
– Rotated - Maintain at least a few historical archives older than the
current backup in case the latest one should somehow fail
– Distributed - Make sure that at least some of your archives are stored
in a physically remote location in case of fire or other disaster
– Secure - Never expose your data to insecure networks or storage sites
at any time during the process.
– Compliant - Honor all relevant regulatory and industry standards at all
times
– Up to date - What’s the point keeping archives that are weeks or
months behind the current live version?
– Scripted - Never rely on a human being to remember to perform an
14
ongoing task, automate it
Archiving files and file systems
using tar
• To successfully create your archive, there are three things that
will have to happen:
– (1) Find and identify the files that need to be included
– (2) Identify the location on a storage drive that archive will use
– (3) Add your files to an archive, and save it to its storage location
• Want to knock off all three steps in one go? Use tar.
15
Simple archive and compression
examples
• This example copies all the files and directories within and
below the current work directory and builds an archive file that
to archivename.tar.
– c tells tar to create a new archive,
– V sets the screen output to verbose so can get updates
– f points to the filename I’d like the archive to get
Create archive
using tar Transfer archive to remote server
for storage using SSH
18
Figure 4.3:An archive is a file that can be copied or moved using normal Bash tools
Streaming file system archives
(continue)
• Rather than entering the archive name right after the command
arguments used a dash (czvf -)
• The dash outputs data to standard output and push the archive
filename details back to the end of the command and tells tar
to expect the source content for the archive instead
• Then piped (|) the unnamed, compressed archive to an SSH
login on a remote server (prompt for password)
• The command enclosed in quotation marks then executed cat
against the archive data stream, which wrote the stream
contents to a file called myfiles.tar.gz in the home 19
directory on the remote host
Streaming file system archives
(continue)
Figure 4.4: Streaming an archive as it’s created avoids the need to first save it to a local drive
21
Aggregating files with find
• The find command searches through a file system looking for
objects that match rules you provide
– The search outputs the names and locations of the files to stdout
– output can just as easily be redirected to another command like tar,
which would then copy those files to an archive
• For example, website server provides lots of .mp4 video files
spread across many directories within the /var/www/html/
tree
– single command that will search the /var/www/html/ hierarchy for files
with names that include the file extension .mp4
– When a file is found, tar will be executed with the argument -r to append
(as opposed to overwrite) the video file to a file called videos.tar
– find /var/www/html –name “.mp4” –exec tar –rvf videos.tar
{} \;
22
Aggregating files with find
(continue)
• locate is the alternative faster command to find
– locate searches the entire system for files matching the specified string
– locate will look for files whose names end with the string video.mp4
– $ locate *video.mp4
• locate will almost always return results far faster because
locate isn’t actually searching the file system, but simply
running the search string against entries in a preexisting index
• The catch is that if the index is allowed to fall out of date, the
searches become less and less accurate
• Normally the index is updated every time the system boots, but
you can also manually do the job by running updatedb:
– #updatedb
23
Preserving permissions and
ownership and extracting archives
• Have to make sure that archive operations don’t corrupt file
permissions and file-ownership attributes
• As you’ve seen, running ls -l lists the contents of a
directory in long form, showing the file’s name, age, and size
• But it also repeats a name (root, in this example) and provides
some rather cryptic strings made up of the letters r, w, and x:
24
Permissions
• Those two leftmost sections (Figure 4.5).
– The 10 characters to the left are made up of four separate sections
– The first dash (1 in the figure) means that the object being listed is a file
It would be replaced with a d if it were a directory
– The next three characters 2 are a representation of the file’s
permissions as they apply to its owner,
– The next three 3 are the permissions as they apply to its group
– The final three 4 represent the permissions all other users have over
• In this example, the file owner has full authority—including read
(r), write (w), and execute (x) rights.
• Members of the group and those in others can read and
execute, but not write.
25
1 2 3 4
29
Ownership
• Suppose a file might be too large to email or might contain
sensitive data that shouldn’t be emailed
• To copy (local or network scp) need sudo to copy a file to the
user’s directory, which means its owner will be root
31
dd operations
• Suppose want to create an exact image of an entire disk of
data that’s been designated as /dev/sda
• You’ve plugged in an empty drive (ideally having the same
capacity as your /dev/sdb system).
• The syntax is simple: if= defines the source drive, and of=
defines the file or location where you want your data saved:
– # dd if=/dev/sda of=/dev/sdb
• The next example will create a .img archive of the /dev/sda
drive and save it to the home directory:
– # dd if=/dev/sda of=/home/username/sdadisk.img
• Those commands created images of entire drives can also
focus on a single partition from a drive
32
dd operations
• The next example does that and also uses bs flag to set the
number of bytes to copy at a single time (4,096, in this case)
• Playing with the bs value can have an impact on the overall
speed of a dd operation, although the ideal setting will depend
on hardware and other considerations:
– # dd if=/dev/sda2 of=/home/username/partition2.img bs=4096
• Restoring is simple: effectively, you reverse the values of if and
of and in this case, if= takes the image that you want to
restore, and of= takes the target drive to which you want to
write the image:
– # dd if=sdadisk.img of=/dev/sdb
• Always test your archives to confirm they’re working
– If it’s a boot drive stick it into a computer, see if it launches as expected
– If it’s a normal data partition, mount it to make sure the files both exist33
and are appropriately accessible.
Wiping disks with dd
• Given enough time and motivation, nearly anything can be
retrieved from virtually any digital media, with the possible
exception of the ones that have been well properly hammered
• Use dd to make it a whole lot more difficult for the bad guys
to get at your old data
• This command will spend some time writing millions of zeros
over every nook and cranny of the /dev/sda1 partition:
– #dd if=/dev/zero of=/dev/sda1
• Using the /dev/urandom file as your source, you can write
over a disk with random characters:
– #dd if=/dev/urandom of=/dev/sda1
34
Synchronizing archives with rsync
• For proper backups is that, to be effective, they absolutely have
to happen regularly
• One problem with that is that daily transfers of huge archives
can place a lot of strain on your network resources
• Wouldn’t it be nice if only had to transfer the small handful of
files that had been created or updated since the last time, rather
than the whole file system?
– rsync is the solution
• How to create remote copy of a directory full of files and
maintain the accuracy of the copy even after local files change?
• To illustrate this happening between local machine and a
remote server (perhaps an LXC container you’ve got running),
create a directory and populate it with a handful of empty files:
35
Synchronizing archives with rsync
• Use ssh to create a new directory on the remote server where
the copied files will go, and then run rsync with the -av
• The v tells rsync to display a verbose list of everything it does
a is a bit more complicated, but also lot more important
• Specifying the -a super-argument will make rsync
synchronize recursively
– subdirectories and their contents will also be included
– preserve special files, modification times, and ownership and
permissions attributes
36
Synchronizing archives with rsync
• If everything is fine, on remove server and the contents of
/syncdirectory should be 10 empty files
• To give rsync a proper test run, you could add a new file to the
local mynewdir directory and use nano to, say, add a few
words to one of the existing files.
• Then run the exact same rsync command as before. When it’s
done, see if the new file and updated version of the old one
have made it to the remote server:
37
Planning Considerations
• Careful consideration will go a long way to determine how
much money and effort to invest in backups
• The more valuable is the data, the more reliable it should be
• The goal is to measure the value of your data against these
questions:
– How often should you create new archives, and how long will you retain
old copies?
– How many layers of validation will you build into your backup process?
– How many concurrent copies of your data will you maintain?
– How important is maintaining geographically remote archives?
• Another equally important question: should consider
incremental or differential backups?
38
Planning considerations (continue)
• Using a differential system, you might run a full backup once a
week (Monday), and smaller and quicker differential backups
on each of the next six days.
– The Tuesday backup will include only files changed since Monday’s
backup
– The Wednesday, Thursday, and Friday backups will each include all files
changed since Monday
– Friday’s backup will, obviously, take up more time and space than
Tuesday’s
• On the plus side, restoring a differential archive requires only
the last full backup and the most recent differential backup
39
Planning considerations (continue)
• An incremental system might also perform full backups only on
Mondays and can also run a backup covering only changed
files on Tuesday
• Wednesday’s backup, unlike the differential approach, will
include only files added or changed since Tuesday, and
Thursday’s will have only those changed since Wednesday
• Incremental backups will be fast and efficient; but, as the
updated data is spread across more files, restoring incremental
archives can be time-consuming and complicated
• This is illustrated in Figure 4.6.
40
Incremental Backup and Recovery:
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
A full recovery requires the most recent full backup and all subsequent incremental backups
A full recovery requires only the most recent full backup and the most recent differential backup
Figure 4.6: The differences between incremental and differential backup systems
Summary
• Not having good backups can ruin your morning.
• The tar command is generally used for archiving full or partial
file systems, whereas dd is more suited for imaging partitions.
• Adding compression to an archive not only saves space on
storage drives, but also bandwidth during a network transfer.
• Directories containing pseudo file systems don’t need backup.
• You can incorporate the transfer of an archive into the command
that generates it, optionally avoiding any need to save the
archive locally.
• It’s possible - and preferred - to preserve the ownership and
permissions attributes of objects restored from an archive.
• You can use dd to (fairly) securely wipe old disks.
• You can incrementally synchronize archives using rsync,
greatly reducing the time and network resources needed for42
ongoing backups.
Key Terms
• An archive is a specially formatted file in which file system
objects are bundled.
• Compression is a process for reducing the disk space used by a
file through the application of a compression algorithm.
• An image is an archive containing the files and directory
structure necessary to re-create a source file system in a new
location.
• Permissions are the attributes assigned to an object that
determine who may use it and how.
• Ownership is the owner and group that have authority over an
object.
• A group is an account used to manage permissions for multiple
users.
43
Command-line Review
• df -h displays all currently active partitions with sizes shown in a human
readable format.
• tar czvf archivename.tar.gz /home/myuser/Videos/*.mp4 creates a
compressed archive from video files in a specified directory tree.
• split -b 1G archivename.tar.gz archivename.tar.gz.part
splits a large file into smaller files of a set maximum size.
• find /var/www/ -iname "*.mp4" -exec tar -rvf videos.tar
{} \; finds files meeting a set criteria and streams their names to tar to
include in an archive.
• chmod o-r /bin/zcat removes read permissions for others.
• dd if=/dev/sda2 of=/home/username/partition2.img creates an
image of thesda2 partition and saves it to your home directory.
• dd if=/dev/urandom of=/dev/sda1 overwrites a partition with random
characters to obscure the old data.
44
References
• Linux in Action, David Clinton:
– https://www.manning.com/books/linux-in-action
• Learning Modern Linux, Michael Hausenblas:
• https://www.oreilly.com/library/view/learning-modern-linux/9781
098108939/
• Linux Administration Best Practices, Scott Alan Miller:
• https://www.packtpub.com/product/linux-administration-best-pra
ctices/9781800568792
• Tarsnap Mastery: Online Backups for the Truly Paranoid,
Michael Lucas
– https://mwl.io/nonfiction/tools#tarsnap
• Linux Cookbook: Essential Skills for Linux Users and System &
Network Administrators (2nd Edition)
– https://www.oreilly.com/library/view/linux-cookbook-2nd/9781492087151/
45
References (continue)
• Amazon S3: Object storage built to retrieve any amount of data
from anywhere
– https://aws.amazon.com/s3/
• Announcing Cloudflare R2 Storage: Rapid and Reliable Object
Storage, minus the egress fees
– https://blog.cloudflare.com/introducing-r2-object-storage/
• The rsync algorithm, TR-CS-96-05 Andrew Tridgell and Paul
Mackerras
– https://openresearch-repository.anu.edu.au/handle/1885/40765
• Optimizing File Replication over Limited-Bandwidth Networks
using Remote Differential Compression, Dan Teodosiu et. al
– https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr
-2006-157.pdf
46