Bioinfomatics
Bioinfomatics
Bioinformatics
A Theoretical and Practical Approach
Edited by
Stephen A. Krawetz
David D. Womble
Includes
CD-ROM
GCG File Management — 309
Sittichoke Saisanit
Introduction
Users are most likely to encounter the Wisconsin Package (GCG) via the web
interface as SeqWeb. As the name implies, SeqWeb is a web interface product that
allows access to many programs in the GCG package. However, there are still a
number of advantages for using GCG on the UNIX command line interface. First of
all, the command-line interface is more amenable to batch processing of large
datasets. Secondly, the command-line interface allows access to all programs not
just the web interface subset. The use of GCG under the UNIX command line is
presented in this chapter.
List File
When working with a family of gene or protein sequences, there is often a need to
simultaneously manage a number of sequences. GCG provides a powerful function
called a list file. A list file is simply a text file that contains a list of individual
sequences beginning with 2 dots (..) and separated by new lines. The GCG programs
ignore any text before the 2 dots and any text after an exclamation mark (!). Therefore,
comments or descriptions of sequences can be added. An example of a list file:
Sequences of mammalian EGF receptors and related family members.
..
sw:EGFR_HUMAN
sw:EGFR_MOUSE
/usr/home/newdata/myseq.pep ! unpublished EGFR-related sequence
As shown in this example, a list file can contain either database sequences or
local user sequences or both. A list file is accessed by preceding the file name with
the @ symbol. For example, to retrieve all sequences in the list file named egfr.list
to the current working directory, use the GCG Fetch command:
% fetch @egfr.list.
3 1 2 — Saisanit
In addition to making a list file, multiple sequences can be aligned and written to a
single Multiple Sequence Format (MSF) file. Several GCG programs can output files
in an MSF format. For example, the reformat program with the -MSF parameter can
be used to convert a group of sequences from a list file into an MSF formatted file.
Example: % reformat -msf @egfr.list
However, reformat does not align the input sequences. The file resulting from
reformat can be named egfr.msf, for example. This MSF file can then be used as input
for other GCG programs. One or a subset or all of the sequences in an MSF file can be
used. To specify a single sequence from an MSF file, type the MSF file name followed
by the sequence name in curly brackets, for example egfr.msf{egfr1}. To specify
multiple sequences, an asterisk wildcard character must be used. For example,
egfr.msf{egfr*} specifies sequences in the egfr.msf file with sequence names begin-
ning with egfr. Similiary, egfr.msf{*} indicates that all sequences in the MSF file will
be used. Note, plain file name specification is not sufficient to specify sequences from
MSF files. Either a sequence name or wildcard in the curly brackets must be used with
the file name.
SeqLab, the X Windows interface for GCG, can also output MSF files from a list of
sequences. GCG command-line programs that can output MSF files are listed below.
Programs that require -MSF parameter are listed accordingly.
• LineUp -MSF
• PileUp
• PrettyBox
• ProfileGap -MSF
• ProfileSegments -MSF
• Reformat -MSF
Below are two examples of how to use MSF files in a program without (PileUp)
and with (LineUp) “-MSF” option requirement.
Example: % pileup egfr.msf{*}
% lineup -msf egfr.msf
Graphic Files
Several GCG programs have an option to generate output in a graphic format. In
order to use the graphic feature, a graphical language and a graphic device must be
defined. The command ShowPlot displays the current graphic device while the com-
mand SetPlot changes it. After setting the graphic device, the command PlotTest can
generate a test graphic output. It is a quick and easy way to determine whether the
device is properly configured.
Graphic files require specific applications in order to be displayed correctly. They
can not be displayed from the command-line interface like plain text files. The .figure
files are generally a graphic output from many GCG programs.
Graphics can be displayed directly on the screen. If an appropriate device is selected.
For example, on an X Windows terminal, ColorX can be used. ColorX is a graphic
language and a device for the X Windows environment.
File management in GCG requires knowledge of the operating system on which
GCG runs. Most likely, it is one of many flavors of UNIX. Common sense should be
applied to maintain naming consistency and to facilitate the task of file organiza-
GCG File Management — 313
tion. This is helped by the various file utilities for creating sequence files and con-
verting them into proper formats. Learning how to manage and use graphic files will
be helpful to visualize the output from many GCG programs.
Appendices
712— Appendix
CD Contents — 713
1. CD Contents
2. A Collection of Useful Bioinformatic Tools
and Molecular Tables
3. Simple UNIX Commands
714— Appendix
CD Contents — 715
1 Appendix
CD Contents
BioDiscovery
This folder contains software packages for microarray analysis that may be installed
on IBM-PC computers. Installation instructions are included in the file named
Readme.pdf. You will need to use the Acrobat Reader utility to read the file (see
Section “Adobe Acrobat Reader”). The BioDiscovery software was kindly provided
by Sorin Draghici, author of Chapter 35.
ClustalX
This folder contains the graphical interface versions of the Clustal multiple
sequence alignment program. Versions for both IBM-PC (clustalx1.81.msw.zip) and
Macintosh (clustalx1.81.PPC.sea.Hqx) are included. The files in the packages will
need to be unpacked with common unzipping utilities. ClustalX versions for various
flavors of UNIX are also available from the original source FTP website (see
WWW Website: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/), described in the readme file.
Permission to include ClustalX on this CD was kindly provided by Julie Thompson
and is described by Steven Thompson in Chapter 31.
715
716— Appendix
Ensembl
This folder contains the files needed to install the Ensembl package on a UNIX
server. Installation instructions are located in the additional docs subfolder in the
file named EnsemblInstall100.pdf. You will need to use the Acrobat Reader utility
to read the file (see Section “Adobe Acrobat Reader”). The source code subfolder
contains the required source code for both Ensembl and Bioperl. Note that the files
in the source code folder are in UNIX format. Please use BINARY FTP mode to
transfer those files to your UNIX server. Up-to-date versions of Ensembl and Bioperl
WWW
are available at their respective Websites (see Websites: http://www.ensembl.org/
and http://www.bioperl.org). The Ensembl software was kindly provided by James
Stalker, author of Chapter 25.
MicroAnalyser
This folder contains a software package for microarray analysis that may be installed
on Macintosh computers. Up-to-date versions of the software are available (see Website:
WWW http://imru.bham.ac.uk/MicroAnalyser/). Permission to include the MicroAnalyser soft-
ware on this CD was kindly provided by Adrian Platts.
Oligo
This folder contains demo versions of the Oligo primer design and analysis soft-
ware for both IBM-PC and Macintosh computers. This software was kindly provided
by Wojciech Rychlik, author of Chapter 21.
Sequencealign
This folder contains a PowerPoint demonstration of sequence alignment. It was
kindly contributed by David S. Wishart, author of chapter 27.
Singh_perl_scripts
This folder contains perl scripts for statistical analysis that were generously con-
tributed by Gautam Singh, author of Chapters 22 and 23. They can be used for solving
the problems described in Chapter 23.
Staden
This folder contains the Staden Sequence Analysis Package and the Gap4 Viewer
software that can be installed on an IBM-PC computer. For up-to-date versions see
WWW Website: http://www.mrc-lmb.cam.ac.uk/pubseq/. This software was kindly provided
by Roger Staden, author of Chapters 20 and 24.
TreeView
This folder contains the TreeView tree drawing software for both IBM-PC
and Macintosh computers. TreeView is a free program for displaying phylogenies.
Up-to-date versions, including UNIX versions, can be found (see Website: http://
WWW
taxonomy.zoology.gla.ac.uk/rod/treeview.html). Please visit the Website to register
TreeView if you wish to use it. Permission to include TreeView on this CD was
kindly provided by Roderic D. M. Page.
CD Contents — 717
2 Appendix
A Collection of Useful Bioinformatic Tools and Molecular Tables
3rd Position
1st Position
719
7 2 0 — Appendix
A A Adenine
C C Cytosine
G G Guanine
T T Thymine (DNA)
U U Uracil (RNA)
Y C or T(U) pYrimidine
R A or G puRine
M A or C aMino
K G or T(U) Keto
S G or C Strong interaction (3 H bonds)
W A or T(U) Weak interaction (2 H bonds)
H A or C or T(U) not-G
B G or T(U) or C not-A
V G or C or A not-T
D G or A or T(U) not-C
N G,A,C or T(U) aNy base
Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Ser S Serine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamic acid or Glutamine
Xaa X Any amino acid
Bioinformatic Tools and Molecular Tables— 721
Human 3000.0
M. Musculus (mouse) 3000.0
Drosophila (fruit fly) 135.6
Arabidopsis (plant) 100.0
C. elegans (round worm) 97.0
S. cerevisiae (yeast) 12.1
E. coli (bacteria) 4.7
H. influenzae (bacteria) 1.8
was calculated given that 10 µg of human genomic DNA contains 2.86 × 106 genome copies.
b Average of all values given in each tissue for that species.
Simple UNIX Commands — 723
3 Appendix
Simple UNIX Commands
The following tables contain a brief list of simple but useful UNIX commands1.
These commands can be used to move around the file system, examine files, and copy,
delete, or rename files. They can also be used to do housekeeping on a user’s account,
and to communicate with other users on the local system or on remote systems.
Directory Operations
Command Action
File Operations
Command Action
ls list files
cp copy files: cp /path/name newname
rm remove (i.e. delete) files: rm name
mv move or rename files: mv name newname
more page file contents (spacebar to continue): more name
cat scroll file contents: cat name
less better pager than more? (q to quit): less name
vi visual text editor (:wq to save and quit): vi name
pico pico text editor (Ctrl-X to quit): pico name
chmod change mode of file permissions: chmod xxx name
1Most commands have options. To see what options are available, use the man command to
open the manual pages for that command, e.g. type man ls to open the manual for the ls command.
723
7 2 4 — Appendix
Manual Pages
Command Action
Communications
Command Action
System Operations
Command Action
df show free disk space
du show disk usage
ps list your processes
kill kill a process: kill ###
passwd change your password
date show date and time
w who is doing what on the system
who who is connected to the system
ping ping another computer (is it alive?)
finger get information on users
exit exit, or logout, from the system
X Windows
Command Action