Author(s):           Adam Zajac
Contributor(s):
Last Revised:        2008.12.12
This work licensed under Creative Commons Attribution-Share Alike 3.0
Unported License.
Working with Tar Files in Python
1. Introduction
   1.a Background Reading
2. Tutorial
   2.a Adding Files
   2.b File Information
   2.c Extracting Files
3. Examples
   3.a Archiving Select Files from a Directory
4. Extending
   4.a Removing Files
1. Introduction
       "Tar" is an archiving format that has become rather popular in the open
  source world. In essence, it takes several files and bundles them into one
  file. Originally, the tar format was made for tape archives, hence the name;
  today it is often used for distributing source code or for making backups of
  data. Most Linux distributions have tools in the standard installation for
  creating and unpacking tar files.
       Python's standard library comes with a module which makes creating and
  extracting tar files very simple. Examples of when individuals might want such
  functionality include programming a custom backup script or a script to create
  a snapshot of other personal projects.
Background Reading
       There is significant documentation of both tar files and Python's tarfile
  module. In addition to this document, the following resources are recommended
  reading:
  Wikipedia: tar file
  Python Library Reference 12.5: tarfile
2. Tutorial
       This is a basic tutorial designed to teach three things: how to add files
  to an archive, how to retrieve information on files in the archive, and how to
 extract files from the archive.
Adding Files
      To begin, import the tarfile module. Then, create what is called a
 "TarFile Object". This is an object with special functions for interacting with
 the tar file. In this case, we are opening the file "archive.tar.gz". Note that
 the mode is "w:gz", which opens the file for writing and with gzip compression.
 As usual, "w" not preserve previous contents of the file. If the tarfile
 already exists, use "a" to append files to the end of the archive (n.b.: you
 cannot use append with a compressed archive - there is no such mode as "a:gz").
   Create a TarFile Object
   >>> import tarfile
   >>> tar = tarfile.open("archive.tar.gz", "w:gz")
   >>> tar
   <tarfile.TarFile object at 0x2af77c060990>
      Adding files to the archive is very simple. If you want the file to have a
 different name in the archive, use the arcname option.
   Adding a File to the Archive
   >>> tar.add("file.txt")
   >>> tar.add("file.txt", arcname="new.txt")
      Adding directories works in the same way. Note that by default a directory
 will be added recursively: every file and folder under it will be included.
 This behavior can be changed by setting recursive to False.
   Adding a Directory to the Archive
   >>> tar.add("docs/")
   >>> tar.add("financial/", recursive=False)
      As with normal file objects, always be sure to close a TarFile Object.
   Close the TarFile Object
   >>> tar.close()
File Information
       The tarfile module includes the ability to retrieve information about the
 individual contents of a tar file. Each item is accessed as a "TarInfo Object".
 For example, getmembers() will return a list of all TarInfo objects in a tar
 file:
   Listing TarInfo Objects
   >>> import tarfile
   >>> tar = tarfile.open("archive.tar.gz", "r:gz")
   >>> members = tar.getmembers()
   >>> members
   [<TarInfo 'text.txt' at 0x2b0b73e46a90>, <TarInfo 'text2.txt' at
   0x2b0b73e46ad0>]
      Each TarInfo object has several methods associated with it. Some examples
 are below, and a full list can be found here.
   TarInfo information
   >>> members[0].name
   'text.txt'
   >>> members[0].isfile()
   True
Extracting Files
      Extracting the contents is a very simple process. To extract the entire
 tar file, simple use extractall(). This will extract the file to the current
 working directory. Optionally, a path may be specified to have the tar extract
 elsewhere.
   Extracting an entire tar file
   >>> import tarfile
   >>> tar = tarfile.open("archive.tar.gz", "r:gz")
   >>> tar.extractall()
   >>> tar.extractall("/tmp/")
      If only specific files need to be extracted, use extract()
   Extracting a single file from a tar file
   >>> import tarfile
   >>> tar = tarfile.open("archive.tar.gz", "r:gz")
   >>> tar.extract("text.txt")
      You should be aware that there is at least one security concern to take
 into account when extracting tar files. Namely, a tar can be designed to
 overwrite files outside of the current working directory (/etc/passwd, for
 example). Never extract a tar as the root user if you do not trust it.
3. Examples
Archiving Select Files from a Directory
   archiver.py
   import os
   import tarfile
   whitelist = ['.odt', '.pdf']
   contents = os.listdir(os.getcwd())
   tar = tarfile.open('backup.tar.gz', 'w:gz')
   for item in contents:
       if item[-4:] in whitelist:
           tar.add(item)
   tar.close()
4. Extending
Removing Files
      The tarfile module does not contain any function to remove an item from an
 archive. It is presumed that this is because of the nature of tape drives,
 which were not designed to move back and forth (consider this post to the
 Python tutor mailing list). Nevertheless, other programs for creating tar
 archives do have a delete feature.
      The following code uses the popular GNU tar programs that comes with most
 Linux distributions. Their documentation of the "--delete" flag can be read
 here; note that they warn not to use it on an actual tape drive. The reliance
 on an external program obviously makes the code far less portable, but it is
 suitable for personal scripts.
   Removing an Item from a Tar
   import subprocess
   def remove(archive, unwanted):
       external = subprocess.getoutput("tar --version")
       if external[:13] != "tar (GNU tar)":
           raise Exception("err: need GNU tar to delete individual files.")
       command = 'tar --delete --file="{0}" "{1}"'.format(archive, unwanted)
       output = subprocess.getstatusoutput(command)[0]
   return output