0% found this document useful (0 votes)
93 views21 pages

Bioinfomatics

This document discusses file management in the GCG bioinformatics package. It covers GCG sequence formats, using database sequences, editing sequences, list files, and converting between formats. The key points are how to retrieve, work with, and organize multiple sequences using GCG utilities and commands.

Uploaded by

NithinArvind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views21 pages

Bioinfomatics

This document discusses file management in the GCG bioinformatics package. It covers GCG sequence formats, using database sequences, editing sequences, list files, and converting between formats. The key points are how to retrieve, work with, and organize multiple sequences using GCG utilities and commands.

Uploaded by

NithinArvind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to

Bioinformatics
A Theoretical and Practical Approach
Edited by
Stephen A. Krawetz
David D. Womble

Includes
CD-ROM
GCG File Management — 309

18 GCG File Management

Sittichoke Saisanit

Introduction
Users are most likely to encounter the Wisconsin Package (GCG) via the web
interface as SeqWeb. As the name implies, SeqWeb is a web interface product that
allows access to many programs in the GCG package. However, there are still a
number of advantages for using GCG on the UNIX command line interface. First of
all, the command-line interface is more amenable to batch processing of large
datasets. Secondly, the command-line interface allows access to all programs not
just the web interface subset. The use of GCG under the UNIX command line is
presented in this chapter.

UNIX Commands and Overview


Familiarity with UNIXas presented in Chapter 13 and Appendix 3 is recommended
prior to studying this chapter. Here are a few commands and rules in UNIX that can
help you get started. Unlike DOS and VMS, UNIX is case-sensitive. For example, a
file name mygene.seq is different from Mygene.seq or any other mixed case combina-
tions. The man command is short for manual; it is equivalent to help in other operating
systems and programs. For example, to find out how a certain UNIX command can be
used, type man and the command of interest, then hit Enter at the command prompt
%. For example: % man cd.
Manual pages for the command cd will be displayed. The cd command is used to
change directory from one location to another. For example: % cd /usr/home /usr/
common/myproject. This command changed the current working directory from /
usr/home to /usr/common/myproject.
What if the command itself is not known? One powerful feature of the man pages
is the ability to include a modifier -k to use a keyword feature for finding a com-
mand. For example, to find a command to delete a file, enter % man -k remove.
The command will list titles of man pages that contain the word remove in them.
Introduction to UNIX is covered in Chapter 13 which includes descriptions of many
UNIX commands.
In order to use GCG effectively with the command line interface, it is important
to learn how to manipulate files and directories. This is fundamental to any operat-
ing system.
309
3 1 0 — Saisanit

Using Database Sequences and Sequence Formats


The GCG package needs to be initialized by sourcing two scripts. This can be auto-
mated at the user log-in as the .cshrc file.
There are several formats of sequence data. Users may be familiar with GenBank
or EMBL format. GCG has its own format. It also provides several program utilities to
convert sequences from one format to another. The GCG format has one notable sig-
nature, i.e., 2 dots (..) to separate annotation and the sequence itself. The annotation
proceeds the 2 dots followed by the sequence. In most cases, users should not worry
about the GCG sequence format. All one needs to do is learn how to retrieve or specify
a sequence from these databases when executing a GCG program. The GCG conven-
tion for specifying a sequence is database:accession or database:locus_name. The
database, is the GCG logical name for a database. These names have been set by the
GCG administrator. For example, it is customary to set gb for a logical name of
GenBank database. To find out whether or not a local installation of GCG has gb as
one of the logical names, issue the following command: % name gb.
To list the logical names, issue the command name without any database name.
Example: % name
To retrieve a sequence, the GCG Fetch command can be used.
Example: % fetch gb:m97796
Assuming that the GenBank database is installed locally and gb is set as its logical
name, the above command will retrieve a GenBank sequence which has an accession
number of M97796. The result will be written to a file with a default name unless a
name is specifically given to the program.
The Fetch program can also work without database name specification.
Example: % fetch m97796
However, if the accession number appears in more than one database, Fetch will
retrieve all of the sequence records. To ensure uniqueness and speed of retrieval, it is
best to use Fetch with full specification of the database name and sequence accession
number.
On the occasion that a GCG sequence is created or modified by a text editor and the
checksum has been altered, GCG programs will not recognize this sequence.
Users need to run a utility called reformat (shown below) to correct the checksum.
Example: % reformat myseq.seq
Another useful file format in GCG is the Rich Sequence Format (RSF). In SeqLab,
which is a graphical user interface for GCG run under X Windows, RSF is particularly
useful because sequence annotations such as domains and phosphorylation sites can
be displayed for visualization. SeqLab can only be run from a UNIX workstation or an
X Windows emulation program. There are two modes of working when inside SeqLab:
main list and editor. The graphical sequence viewer is available in SeqLab editor
mode. The reformat command can be used to convert a GCG sequence into an RSF
format sequence by including the -RSF parameter in the command line: % reformat
-rsf myseq.seq.
It is recommended that users name sequence files consistently. By default, GCG
does not require consistent naming and UNIX does not insist on file types. In con-
trast, DOS and Windows usually require a 3-letter file type extension for files to be
GCG File Management — 311

recognized correctly by the programs. However, after accumulating files it will be


difficult to recognize older files. Users should make a habit of naming sequence
files with meaningful names and consistent name extensions. Appending .dna or
.seq extension for nucleic acid sequence files, or .pep or .pro extension for protein
sequence files will help in recognizing these files. In addition, storing sequence files
in specific directories for each project is generally a good idea. Once a certain project
has been completed, the entire directory can be archived or removed.
Another file format users may encounter is the FASTA format. Most public
sequence utilities on the web can accept or produce sequences in FASTA format. GCG
has a utility to convert GCG sequences into FASTA format sequences and vice versa.
This is useful because it allows one to use other available tools and external sequences.
Tofasta converts a GCG sequence into a FASTA format sequence.
Example: % tofasta gb:m97796
FromFasta converts a FASTA sequence into a GCG format sequence.
Example: % fromfasta pubseq.fasta

Editing GCG Formatted Sequences


The need to edit sequences may come from users’ own sequencing efforts. Addi-
tionally, users may want to track recombinant sequences such as products of mutagen-
esis. SeqEd is a utility to edit sequences in a much more efficient manner than a text
editor. SeqEd has another advantage in that the edited sequence will be recognized by
other GCG programs without the need to run the reformat command. Annotations to
specific residues can also be placed within an edited sequence. SeqEd can be started
by entering: % seqed myseq.seq.
Once inside a SeqEd editor, use Control-D to enter the editor command. Enter-
ing help in the editor brings up a list of commands that can be used inside the editor.

List File
When working with a family of gene or protein sequences, there is often a need to
simultaneously manage a number of sequences. GCG provides a powerful function
called a list file. A list file is simply a text file that contains a list of individual
sequences beginning with 2 dots (..) and separated by new lines. The GCG programs
ignore any text before the 2 dots and any text after an exclamation mark (!). Therefore,
comments or descriptions of sequences can be added. An example of a list file:
Sequences of mammalian EGF receptors and related family members.
..
sw:EGFR_HUMAN
sw:EGFR_MOUSE
/usr/home/newdata/myseq.pep ! unpublished EGFR-related sequence
As shown in this example, a list file can contain either database sequences or
local user sequences or both. A list file is accessed by preceding the file name with
the @ symbol. For example, to retrieve all sequences in the list file named egfr.list
to the current working directory, use the GCG Fetch command:
% fetch @egfr.list.
3 1 2 — Saisanit

In addition to making a list file, multiple sequences can be aligned and written to a
single Multiple Sequence Format (MSF) file. Several GCG programs can output files
in an MSF format. For example, the reformat program with the -MSF parameter can
be used to convert a group of sequences from a list file into an MSF formatted file.
Example: % reformat -msf @egfr.list
However, reformat does not align the input sequences. The file resulting from
reformat can be named egfr.msf, for example. This MSF file can then be used as input
for other GCG programs. One or a subset or all of the sequences in an MSF file can be
used. To specify a single sequence from an MSF file, type the MSF file name followed
by the sequence name in curly brackets, for example egfr.msf{egfr1}. To specify
multiple sequences, an asterisk wildcard character must be used. For example,
egfr.msf{egfr*} specifies sequences in the egfr.msf file with sequence names begin-
ning with egfr. Similiary, egfr.msf{*} indicates that all sequences in the MSF file will
be used. Note, plain file name specification is not sufficient to specify sequences from
MSF files. Either a sequence name or wildcard in the curly brackets must be used with
the file name.
SeqLab, the X Windows interface for GCG, can also output MSF files from a list of
sequences. GCG command-line programs that can output MSF files are listed below.
Programs that require -MSF parameter are listed accordingly.
• LineUp -MSF
• PileUp
• PrettyBox
• ProfileGap -MSF
• ProfileSegments -MSF
• Reformat -MSF
Below are two examples of how to use MSF files in a program without (PileUp)
and with (LineUp) “-MSF” option requirement.
Example: % pileup egfr.msf{*}
% lineup -msf egfr.msf

Graphic Files
Several GCG programs have an option to generate output in a graphic format. In
order to use the graphic feature, a graphical language and a graphic device must be
defined. The command ShowPlot displays the current graphic device while the com-
mand SetPlot changes it. After setting the graphic device, the command PlotTest can
generate a test graphic output. It is a quick and easy way to determine whether the
device is properly configured.
Graphic files require specific applications in order to be displayed correctly. They
can not be displayed from the command-line interface like plain text files. The .figure
files are generally a graphic output from many GCG programs.
Graphics can be displayed directly on the screen. If an appropriate device is selected.
For example, on an X Windows terminal, ColorX can be used. ColorX is a graphic
language and a device for the X Windows environment.
File management in GCG requires knowledge of the operating system on which
GCG runs. Most likely, it is one of many flavors of UNIX. Common sense should be
applied to maintain naming consistency and to facilitate the task of file organiza-
GCG File Management — 313

tion. This is helped by the various file utilities for creating sequence files and con-
verting them into proper formats. Learning how to manage and use graphic files will
be helpful to visualize the output from many GCG programs.

Glossary and Abbreviations


EMBL Nucleotide Database Europe’s primary collection of all publicly available
nucleotide sequences. It is maintain in collaboration with GenBank and DDBJ (Japan).
GCG Genetics Computer Group started in 1982 within the Department of Genet-
ics at the University of Wisconsin. It went private in 1990 and was acquired by Oxford
Molecular Group in 1997. In 2000, Oxford Molecular was acquired by Pharmacopeia
WWW resulting in a new company called Accelrys (see Website: http://www.accelrys.com)
which is currently the commercial distributor of GCG(r) Wisconsin Package™.
GenBank An annotated collection of all publicly available nucleotide sequences.
The protein sequence collection is referred to specifically as GenPept. GenBank is
maintained by National Center of Biotechnology Information (NCBI), a unit of the US
National Institute of Health (NIH).
SWISS-PROT An annotated protein sequence database maintained and curated
by the Swiss Institute for Bioinformatics (SIB). The database designation is often
abbreviated as SW in GCG.
Wisconsin Package A suite of tools and programs for Bioinformatics sequence
analysis developed by GCG. It runs on various UNIX operating systems including
SUN Solaris, SGI IRIX, Compaq Tru64 UNIX, IBM AIX, and Red Hat Linux.
3 1 4 — Saisanit
CD Contents — 711

Appendices
712— Appendix
CD Contents — 713

1. CD Contents
2. A Collection of Useful Bioinformatic Tools
and Molecular Tables
3. Simple UNIX Commands
714— Appendix
CD Contents — 715

1 Appendix
CD Contents

What is Included on the CD?


The CD that comes with this book includes:
1. All Figures and Tables with legends from the various chapters, many of which
are in color. This is an excellent source of illustrative material for presentations.
2. Several bioinformatics software packages that the readers can install on their own
computer workstations or servers.
3. Several useful basic tables and charts for understanding genome properties.
The CD is organized into folders and subfolders. The readers should be able to load
the CD into the CD-drive of any IBM-Personal Computer or Apple Macintosh and
browse through the folders.
The color figures can be found in the Color Figures folder, organized into sub-
folders by chapter.
The software packages can be found in the Programs folder, organized into sub-
folders by the name of each package. For each program subfolder, there is a ReadmeCD
file that provides further information about the software, including how to install it, use
it, and where up-to-date versions can be downloaded from the Web. There is also infor-
mation on licensing and registration, and restrictions that may apply.

BioDiscovery
This folder contains software packages for microarray analysis that may be installed
on IBM-PC computers. Installation instructions are included in the file named
Readme.pdf. You will need to use the Acrobat Reader utility to read the file (see
Section “Adobe Acrobat Reader”). The BioDiscovery software was kindly provided
by Sorin Draghici, author of Chapter 35.

ClustalX
This folder contains the graphical interface versions of the Clustal multiple
sequence alignment program. Versions for both IBM-PC (clustalx1.81.msw.zip) and
Macintosh (clustalx1.81.PPC.sea.Hqx) are included. The files in the packages will
need to be unpacked with common unzipping utilities. ClustalX versions for various
flavors of UNIX are also available from the original source FTP website (see
WWW Website: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/), described in the readme file.
Permission to include ClustalX on this CD was kindly provided by Julie Thompson
and is described by Steven Thompson in Chapter 31.

715
716— Appendix

Ensembl
This folder contains the files needed to install the Ensembl package on a UNIX
server. Installation instructions are located in the additional docs subfolder in the
file named EnsemblInstall100.pdf. You will need to use the Acrobat Reader utility
to read the file (see Section “Adobe Acrobat Reader”). The source code subfolder
contains the required source code for both Ensembl and Bioperl. Note that the files
in the source code folder are in UNIX format. Please use BINARY FTP mode to
transfer those files to your UNIX server. Up-to-date versions of Ensembl and Bioperl
WWW
are available at their respective Websites (see Websites: http://www.ensembl.org/
and http://www.bioperl.org). The Ensembl software was kindly provided by James
Stalker, author of Chapter 25.

MicroAnalyser
This folder contains a software package for microarray analysis that may be installed
on Macintosh computers. Up-to-date versions of the software are available (see Website:
WWW http://imru.bham.ac.uk/MicroAnalyser/). Permission to include the MicroAnalyser soft-
ware on this CD was kindly provided by Adrian Platts.

Oligo
This folder contains demo versions of the Oligo primer design and analysis soft-
ware for both IBM-PC and Macintosh computers. This software was kindly provided
by Wojciech Rychlik, author of Chapter 21.

Sequencealign
This folder contains a PowerPoint demonstration of sequence alignment. It was
kindly contributed by David S. Wishart, author of chapter 27.

Singh_perl_scripts
This folder contains perl scripts for statistical analysis that were generously con-
tributed by Gautam Singh, author of Chapters 22 and 23. They can be used for solving
the problems described in Chapter 23.

Staden
This folder contains the Staden Sequence Analysis Package and the Gap4 Viewer
software that can be installed on an IBM-PC computer. For up-to-date versions see
WWW Website: http://www.mrc-lmb.cam.ac.uk/pubseq/. This software was kindly provided
by Roger Staden, author of Chapters 20 and 24.

TreeView
This folder contains the TreeView tree drawing software for both IBM-PC
and Macintosh computers. TreeView is a free program for displaying phylogenies.
Up-to-date versions, including UNIX versions, can be found (see Website: http://
WWW
taxonomy.zoology.gla.ac.uk/rod/treeview.html). Please visit the Website to register
TreeView if you wish to use it. Permission to include TreeView on this CD was
kindly provided by Roderic D. M. Page.
CD Contents — 717

Adobe Acrobat Reader


In several of the folders on the CD, there are information files that may be in PDF
format. To read PDF format files, you will need to have the free Acrobat Reader utility
installed on your computer. If you do not already have Acrobat Reader installed, you can
WWW download it (see Website:: http://www.adobe.com/products/acrobat/readstep.html).

Other Sources for Bioinformatics Software


There are many sources available for downloading software that may be useful.
Here are two of our favorites: EBI FTP Server (see Website: http://www.ebi.ac.uk/
WWW FTP/) and IUBio Archive for Biology data and software (see Website: http://
iubio.bio.indiana.edu/).
718— Appendix
Bioinformatic Tools and Molecular Tables— 719

2 Appendix
A Collection of Useful Bioinformatic Tools and Molecular Tables

The Genetic Code


2nd Position
U C A G

UUU Phe UCU Ser UAU Tyr UGU Cys U


UUC Phe UCC Ser UAC Tyr UGC Cys C
U
UUA Leu UCA Ser UAA Stop UGA Stop A
UUG Leu UCG Ser UAG Stop UGG Trp G

CUU Leu CCU Pro CAU His CGU Arg U


CUC Leu CCC Pro CAC His CGC Arg C
C
CUA Leu CCA Pro CAA Gin CGA Arg A

3rd Position
1st Position

CUG Leu CCG Pro CAG Gin CGG Arg G

AUU Ile ACU Thr AAU Asn AGU Ser U


AUC Ile ACC Thr AAC Asn AGC Ser C
A
AUA Ile ACA Thr AAA Lys AGA Arg A
AUG Met ACG Thr AAG Lys AGG Arg G

GUU Val GCU Ala GAU Asp GGU Gly U


GUC Val GCC Ala GAC Asp GGC Gly C
G
GUA Val GCA Ala GAA Glu GGA Gly A
GUG Val GCG Ala GAG Glu GGG Gly G
The codons are read as triplets in the 5' → 3' direction, i.e., left to right.
Termination codons are in bold.

719
7 2 0 — Appendix

IUPAC Nucleotide Codes


Code Members Nucleotide

A A Adenine
C C Cytosine
G G Guanine
T T Thymine (DNA)
U U Uracil (RNA)
Y C or T(U) pYrimidine
R A or G puRine
M A or C aMino
K G or T(U) Keto
S G or C Strong interaction (3 H bonds)
W A or T(U) Weak interaction (2 H bonds)
H A or C or T(U) not-G
B G or T(U) or C not-A
V G or C or A not-T
D G or A or T(U) not-C
N G,A,C or T(U) aNy base

IUPAC Amino Acid Codes


3-Letter Code 1-Letter Code Amino Acid

Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Ser S Serine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamic acid or Glutamine
Xaa X Any amino acid
Bioinformatic Tools and Molecular Tables— 721

Converting Base Size of a Nucleic Acid → Mass of Nucleic Acid


Number of Bases Mass of Nucleic Acid

1 kb ds DNA (Na+) 6.6 × 105 Da


1 kb ss DNA (Na+) 3.3 × 105 Da
1 kb ss RNA (Na+) 3.4 × 105 Da
1.52 kb ds DNA 1MDa ds DNA (Na+)
Average MW of a dsDNA 660 Da
Average MW of a ss DNA 330 Da
Average NW of an RNA 340 Da

Converting Base Size


of a Nucleic Acid → Maximum Moles of Protein
Molecular Amino
DNA Weight (Da) Acids 1 µg 1 nmol

270 bp 10,000 90 100 pmol or 6 × 1013 molecules 10 µg


1.35 Kbp 50,000 450 20 pmol or 1.2 × 1013 molecules 50 µg
2.7 Kbp 100,000 900 10 pmol or 6 × 1012 molecules 100 µg
4.05 Kbp 150,000 1350 6.7 pmol or 4 × 1012 molecules 150 µg

Average MW of an amino acid = 110 (Da).


3 bp are required to encode 1 amino acid.

Sizes of Common Nucleic Acids


Nucleic Acid Number of Nucleotides Molecular Weight
lambda DNA 48,502 (dsDNA) 3.2 × 107
pBR322 DNA 4361 (dsDNA) 2.8 × 106
28S rRNA 4800 1.6 × 106
23S rRNA (E. coli) 2900 1.0 × 106
18S rRNA 1900 6.5 × 105
16S rRNA (E.coli) 1500 5.1 × 105
5S rRNA (E. coli) 120 4.1 × 104
tRNA (E. coli) 75 2.5 × 104

Mass of Nucleic Acid ↔ Moles of Nucleic Acid


Mass Moles

1 µg/ml of nucleic acid 3.0 µM phosphate


1 µg of a 1 kb DNA fragment 1.5 pmol; 3.0 pmol ends
0.66 µg of a 1 kb DNA fragment 1 pmol
7 2 2 — Appendix

Sizes of Various Genomes


Organism Approximate Size (million bases)

Human 3000.0
M. Musculus (mouse) 3000.0
Drosophila (fruit fly) 135.6
Arabidopsis (plant) 100.0
C. elegans (round worm) 97.0
S. cerevisiae (yeast) 12.1
E. coli (bacteria) 4.7
H. influenzae (bacteria) 1.8

Genomic Equivalents of Species


µg quantity Number
Source pg/haploid a for Genome of Genomes
Organism of DNA Genome Avg.b Equivalence × 106
Human diploid 3.50 3.16 10.0 2.86
Mouse diploid 3.00 3.21 8.57 2.86
Rat diploid 3.00 3.68 8.57 2.86
Bovine haploid 3.24 3.24 9.26 2.86
Annelid haploid 1.45 1.45 4.14 2.86
Drosophila diploid 0.17 0.18 0.486 2.86
Yeast haploid 0.016 0.0245 0.0457 2.86
a pg/haploid genome was calculated as a function of the tissue source. Genomic equivalence

was calculated given that 10 µg of human genomic DNA contains 2.86 × 106 genome copies.
b Average of all values given in each tissue for that species.
Simple UNIX Commands — 723

3 Appendix
Simple UNIX Commands

The following tables contain a brief list of simple but useful UNIX commands1.
These commands can be used to move around the file system, examine files, and copy,
delete, or rename files. They can also be used to do housekeeping on a user’s account,
and to communicate with other users on the local system or on remote systems.

Directory Operations
Command Action

pwd present working directory (show directory name)


cd change directory: cd /path/name
cd change to your home directory: cd
mkdir make (create) new directory: mkdir Name
rmdir remove directory (if empty): rmdir Name
quota check disk space quota: quota -v

File Operations
Command Action
ls list files
cp copy files: cp /path/name newname
rm remove (i.e. delete) files: rm name
mv move or rename files: mv name newname
more page file contents (spacebar to continue): more name
cat scroll file contents: cat name
less better pager than more? (q to quit): less name
vi visual text editor (:wq to save and quit): vi name
pico pico text editor (Ctrl-X to quit): pico name
chmod change mode of file permissions: chmod xxx name

1Most commands have options. To see what options are available, use the man command to

open the manual pages for that command, e.g. type man ls to open the manual for the ls command.

723
7 2 4 — Appendix

Manual Pages
Command Action

man open the man pages for a command: man command

Communications
Command Action

write write messages to another user’s screen


talk talk split-screen with another user: talk username
mail UNIX email command
pine send or read E-mail with pine mail system
telnet connect to another computer via the network
ftp file transfer over the network
lynx text-based Web browser

System Operations
Command Action
df show free disk space
du show disk usage
ps list your processes
kill kill a process: kill ###
passwd change your password
date show date and time
w who is doing what on the system
who who is connected to the system
ping ping another computer (is it alive?)
finger get information on users
exit exit, or logout, from the system

X Windows
Command Action

clock & display a clock (&: run in background)


cmdtool & command tool window
filemgr & file manager
mailtool & email program
perfmeter & system performance meter
seqlab & SeqLab interface for GCG
setenv DISPLAY for setting the DISPLAY environment variable
shelltool & shell tool window
textedit & text editor
xterm & X terminal window

You might also like