0% found this document useful (0 votes)

101 views26 pages

Intro To Using Galaxy - For Bioinformatics: Carrie Ganote

This document provides an introduction and overview of using the Galaxy platform for bioinformatics analyses. It describes the National Center for Genome Analysis Support (NCGAS) and its role in providing computational resources and support. It then gives a high-level overview of the Galaxy platform and interface. The document walks through an example analysis pipeline for transcriptome assembly from RNA-Seq data, including obtaining data from the shared data library, quality control and trimming of reads, assembly using Trinity, and assessment of assembly quality.

Uploaded by

thyago6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views26 pages

Intro To Using Galaxy - For Bioinformatics: Carrie Ganote

Uploaded by

thyago6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Intro to Using Galaxy

–
For Bioinformatics

Tom Doak
Carrie Ganote
National Center for Genome Analysis Support

September 17, 2013

Summary

•  Who is NCGAS?
•  Galaxy – what is it?
•  Galaxy 101 – a guided tour
•  Short intro to transcriptome
assembly, as an example

National Center for Genome Analysis Support: http://ncgas.org

Who is NCGAS?

The National Center for Genome

Analysis Support is based at IU in
Bloomington, but caters to a national
audience with support from the NSF.
We provide computational resources and
support for genomics, transcriptomics,
and meta projects.

National Center for Genome Analysis Support: http://ncgas.org

Our Services
NCGAS provides support in the form of long- and short-term consultation
for genomics, proteomics, transcriptomics, and meta projects. We are
happy to answer questions about software, methods, and pipelines; basic
Linux use; experimental setup; and interpretation of results.

We administer bioinformatics software installation and upgrades on the

Mason cluster at IU, as well as provide access to Mason to users of
XSEDE’s national infrastructure. We provide support letters for NSF
proposals pledging our compute resources.

Last, but not least, we install and maintain the local Galaxy instances for
Indiana University: IU, NCGAS, and Rockhopper.

National Center for Genome Analysis Support: http://ncgas.org

What is Galaxy?
Galaxy is a web-based framework for running command-line utilities from
a snazzy graphical user interface.
The Galaxy web server that we will be using today is hosted at Indiana
University on the XSEDE virtual machines. This is a different “instance”
than Galaxy Main, which is hosted at Penn State.

Our instance at IU Why choose us: Galaxy Main

•  IU only – less
Galaxy@IU Data busy!
Virtual Machine •  Large RAM Penn State
jobs possible Resources
Jobs •  Custom tools
Data
on request
Mason •  On-site support

Data Capacitor
National Center for Genome Analysis Support: http://ncgas.org
Galaxy Anatomy and Physiology

Tool bar – History –

contains shows
the steps
available previously
steps to taken to
apply to manipulate
data input data
sets

Focus pane – shows options,

parameters, and output for current item.
National Center for Genome Analysis Support: http://ncgas.org
Galaxy 101 – Quick Start

We will depart this slideshow for a short time as we go through the

basics of Galaxy using the Galaxy 101 tutorial. You can find a link
to it on the home page for galaxy.indiana.edu.

You can choose to follow along either on IU Galaxy or on Galaxy

Main – the tool layout is slightly different between the two
instances.

National Center for Genome Analysis Support: http://ncgas.org

Today’s Menu Item

We will be assembling the DNA

Polymerase protein units from the H37Rv
strain of Mycobacterium tuberculosis, the
causative agent of TB, also known as the
consumption.
The raw reads originated from the Short
Read Archive on NCBI. The accession
number for the set is SRX212035.

This dataset consists of paired-end,

Cristobal Rojas, La miseria (1886) from Wikipedia. ~75bp RNA-Seq reads.

National Center for Genome Analysis Support: http://ncgas.org

Let’s get some sequence data

Galaxy allows
users to
publish their
data to the
entire user
base.

Let’s start with “Shared Data”

at the top.
Then select Data Libraries
from the menu.
National Center for Genome Analysis Support: http://ncgas.org
Let’s get some sequence data

Choose Workshop Data.

National Center for Genome Analysis Support: http://ncgas.org

Let’s get some sequence data

Expand folder
Check both boxes

Import the Data sets to current history.

National Center for Genome Analysis Support: http://ncgas.org

Let’s get some sequence data

Data set is imported – Click on Analyze Data to return.

National Center for Genome Analysis Support: http://ncgas.org

Step 1: Assess the Quality of Inputs

We will first get an idea

of the quality of our
input data sets.

The FastQC tool will

produce graphical
output that makes it
easy to gauge the
characteristics of the
data – quality, patterns,
biases, gc content etc.

Choose either the left or right reads. Compare the results with your neighbor.

National Center for Genome Analysis Support: http://ncgas.org

Step 1: Assess the Quality of Inputs
The input data usually
declines in quality as the
reads progress.

The quality score is

assigned by the
sequencing machine as
it reads each base. It is a
rough estimate of how
ambiguous the signal is.

National Center for Genome Analysis Support: http://ncgas.org

Step 2: Trim Input Sequences
We’ve determined
that the input data
sets need some work
before they are used
in downstream
processes. We’ll use
the FASTQ quality
trimmer by sliding
window to trim reads
based on quality
score.

Run this tool for both input data sets.

National Center for Genome Analysis Support: http://ncgas.org

Step 3: Rinse, Repeat

Now that the files

are trimmed, we will
re-assess their
quality. If necessary,
keep trimming away
until you are
satisfied with the
input files.

I renamed my trimmed files to help me keep them straight.

National Center for Genome Analysis Support: http://ncgas.org
Step 3: Rinse, Repeat
Pictured are the left and right reads after trimming is complete.
These will do!

National Center for Genome Analysis Support: http://ncgas.org

Step 4: Assembly

Next we will put the reads

together to create a complete
picture of the actively
transcribed genes of the
sample organism.

Trinity is a de novo assembler

that has been optimized for
use on Mason. We will use it
to assemble our reads.

National Center for Genome Analysis Support: http://ncgas.org

It finished! We’re done, right?
An assembler solves a computer problem of putting
together a puzzle from tiny pieces. The output of the
assembler is a guess – but we don’t know how accurate it
is. We could look at:
•  Basic stats of the assembly – “Contigs”
•  Number of “Contigs” vs. Expected Number
•  N50 – a weighted average
•  Average Length
•  Max Length
•  Check contigs against known genes with Blast (large or
rare transcripts)
National Center for Genome Analysis Support: http://ncgas.org
Step 5: Assessing Quality of Assembly
Important statistics for assembly quality:
Contig Length Distribution
Assemblies will typically produce a
number of complete contigs representing
whole transcripts, and a large number of
partial transcripts. This biases the
average contig length toward the low
end. The N50 is a measure weighted by
total sequence length in the assembly.

National Center for Genome Analysis Support: http://ncgas.org

Step 5: Assessing Quality of Assembly
Getting these
stats in Galaxy:

Run assemblystats to
get a summary and
histograms of your
contig length
distribution.

National Center for Genome Analysis Support: http://ncgas.org

Step 6: Check Against Database
For this last step, we’ll
check to see how well
our assembled
transcripts compare to
what we already know.

Use this step to give a

rough annotation of
genes, to make sure
that your transcripts are
from nuclear genes, or
to gauge how complete
your sequence is.

For sake of time, we’ll just Blast one gene. Filter out to get the smallest.

National Center for Genome Analysis Support: http://ncgas.org

Step 6: Check Against Database

We will use Blastx to

search the NR
database for our
gene.

Use default search

settings for this test
set.

Make sure to choose Pairwise HTML output for readability.

National Center for Genome Analysis Support: http://ncgas.org
Step 6: Check Against Database

We see the expected

genes as the top
hits!

We could limit the number of hits depending on output desired.

National Center for Genome Analysis Support: http://ncgas.org
Step 7..?
RNA-Seq is a very versatile technology. You can use the data for:

•  Gene discovery based on transcripts

•  Genome evidence – introns, exons, junction
•  Gene expression patterns
•  SNP calling/other variants
•  Protein divergence between samples

We have gotten to the assembly step, but there is a lot to learn about
the data now that it is put together. A foundation in the use of Galaxy
coupled with Indiana University resources will enable you to reach
these goals.

National Center for Genome Analysis Support: http://ncgas.org

Fin

Thanks for watching!

Questions and comments:
Email help@ncgas.org

National Center for Genome Analysis Support: http://ncgas.org

Bioinformatics Assingment - New Kandy - Draft
100% (1)
Bioinformatics Assingment - New Kandy - Draft
14 pages
Bioinformatics Cheat Sheet
No ratings yet
Bioinformatics Cheat Sheet
4 pages
Lecture 5-6 - Databases NR
No ratings yet
Lecture 5-6 - Databases NR
35 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Group # 13
No ratings yet
Group # 13
49 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
No ratings yet
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
19 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Dna Barcoding Bioinformatics Work
No ratings yet
Dna Barcoding Bioinformatics Work
11 pages
Computational Biology B.Tech - Biotech (Vith Semester)
No ratings yet
Computational Biology B.Tech - Biotech (Vith Semester)
34 pages
1 - Introduction To Computational Biology
No ratings yet
1 - Introduction To Computational Biology
22 pages
Drug Discovery Companies Are Customizing Chatgpt: Here'S How
100% (1)
Drug Discovery Companies Are Customizing Chatgpt: Here'S How
2 pages
Erik Garrison - Iowa Talk 2
No ratings yet
Erik Garrison - Iowa Talk 2
32 pages
MEGA Guide for Phylogenetic Trees
No ratings yet
MEGA Guide for Phylogenetic Trees
3 pages
Exercises For Phylogeny: Exercise 1. Parsimony and Rooted Versus Unrooted Trees
No ratings yet
Exercises For Phylogeny: Exercise 1. Parsimony and Rooted Versus Unrooted Trees
11 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
11 pages
DNA Barcoding and Metabarcoding of Standardized Samples Reveal Patterns of Marine Benthic Diversity
No ratings yet
DNA Barcoding and Metabarcoding of Standardized Samples Reveal Patterns of Marine Benthic Diversity
17 pages
Secondary Structure Prediction of Tuberculosis Genomes Using Machine Learning Algorithms
No ratings yet
Secondary Structure Prediction of Tuberculosis Genomes Using Machine Learning Algorithms
111 pages
Gene Mapper Software v3.7 User Guide
No ratings yet
Gene Mapper Software v3.7 User Guide
138 pages
DNA Know Thyself, Living in A DNA World, Ray J Rose Springer, 2024
No ratings yet
DNA Know Thyself, Living in A DNA World, Ray J Rose Springer, 2024
185 pages
ClustalW Tutorial
100% (1)
ClustalW Tutorial
8 pages
Bioinformatics History of Bioinformatics
No ratings yet
Bioinformatics History of Bioinformatics
10 pages
Bioinformatics II Course Overview
No ratings yet
Bioinformatics II Course Overview
91 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Bioinformatics 1.1
No ratings yet
Bioinformatics 1.1
52 pages
Preview-9781498774048 A37870037
No ratings yet
Preview-9781498774048 A37870037
49 pages
Bioinformatics & NCBI Overview
No ratings yet
Bioinformatics & NCBI Overview
9 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
1409 Lab Manual
No ratings yet
1409 Lab Manual
155 pages
Lab 1 Cell and Osmosis
No ratings yet
Lab 1 Cell and Osmosis
5 pages
Galaxy Nanopore
No ratings yet
Galaxy Nanopore
11 pages
Data Retrieval System: Text-Based Database Searching
No ratings yet
Data Retrieval System: Text-Based Database Searching
54 pages
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
No ratings yet
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
12 pages
07 Sequencing
No ratings yet
07 Sequencing
37 pages
Lab Report 03
No ratings yet
Lab Report 03
18 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
1 What Is Bioinformatics
No ratings yet
1 What Is Bioinformatics
34 pages
Metabolic Engineering Lecture11
No ratings yet
Metabolic Engineering Lecture11
38 pages
Categorization of Microorganisms Based On Physical and Nutritional Requirements For Growth
No ratings yet
Categorization of Microorganisms Based On Physical and Nutritional Requirements For Growth
11 pages
Molecular Biology of The Cell 6th Edition Bruce Alberts PDF Download
100% (10)
Molecular Biology of The Cell 6th Edition Bruce Alberts PDF Download
61 pages
Metagenomic Shotgun Seq Learning Progress
No ratings yet
Metagenomic Shotgun Seq Learning Progress
19 pages
Lecture12 Functional Pathway Analysis
No ratings yet
Lecture12 Functional Pathway Analysis
13 pages
Bioinformatics Practical File
No ratings yet
Bioinformatics Practical File
12 pages
Development of A QPCR Assay For Quantification of Saccharibacteria
No ratings yet
Development of A QPCR Assay For Quantification of Saccharibacteria
15 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Read Me GenAlEx 6.41
No ratings yet
Read Me GenAlEx 6.41
10 pages
Methods For Studying Proteins
No ratings yet
Methods For Studying Proteins
96 pages
Chapter 1: Genbank: The Nucleotide Sequence Database: Ilene Mizrachi
No ratings yet
Chapter 1: Genbank: The Nucleotide Sequence Database: Ilene Mizrachi
14 pages
Coursera BioinfoMethods-I Lab01 PDF
No ratings yet
Coursera BioinfoMethods-I Lab01 PDF
22 pages
Coursera BioinfoMethods-I Lecture01
No ratings yet
Coursera BioinfoMethods-I Lecture01
15 pages
Joseph Irudhiyaraj Biomedical Nanosensor
100% (5)
Joseph Irudhiyaraj Biomedical Nanosensor
188 pages
MSC Bioinformatics Syllabus
No ratings yet
MSC Bioinformatics Syllabus
42 pages
Manual PDF
100% (1)
Manual PDF
53 pages
LAb Activity - Weed Vegetation Sampling
No ratings yet
LAb Activity - Weed Vegetation Sampling
3 pages
Phage Bacteria
100% (1)
Phage Bacteria
65 pages
ArgusLab 4.0 Molecular Docking Guide
100% (1)
ArgusLab 4.0 Molecular Docking Guide
24 pages
Blank en Berg Pittsburgh 2011 Ngs
No ratings yet
Blank en Berg Pittsburgh 2011 Ngs
59 pages
2015 PAG Variant PDF
No ratings yet
2015 PAG Variant PDF
65 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Result of Pre-Test Using The Dolch Basic Sight Vocabulary Grade 6 SY 2020 - 2021
No ratings yet
Result of Pre-Test Using The Dolch Basic Sight Vocabulary Grade 6 SY 2020 - 2021
3 pages
W. James Kent - BLAT-The BLAST-Like Alignment Tool
No ratings yet
W. James Kent - BLAT-The BLAST-Like Alignment Tool
10 pages
Bioinformatics Assignment: Submitted By, Name: Ayush Kumar Bothra Id:2016B1A40945P
No ratings yet
Bioinformatics Assignment: Submitted By, Name: Ayush Kumar Bothra Id:2016B1A40945P
26 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Cost Effectiveness Analysis in Health A Practical Approach 3rd Edition Peter Muennig PDF Download
100% (2)
Cost Effectiveness Analysis in Health A Practical Approach 3rd Edition Peter Muennig PDF Download
56 pages
Applications of Bioinformatics
No ratings yet
Applications of Bioinformatics
19 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Bioinformatics for Students
No ratings yet
Bioinformatics for Students
22 pages
BTY 405 IGAP Format
No ratings yet
BTY 405 IGAP Format
4 pages
MAFFT Ver.7 - RBCL 1
No ratings yet
MAFFT Ver.7 - RBCL 1
1 page
Short Notes On Diagnostic
No ratings yet
Short Notes On Diagnostic
1 page
Ahmed Bahaa CV (2) - 1
No ratings yet
Ahmed Bahaa CV (2) - 1
1 page
Biostat Chapter Three Numerical Sammary Measures
No ratings yet
Biostat Chapter Three Numerical Sammary Measures
94 pages
Unit 1 Introduction To Biostatistics
No ratings yet
Unit 1 Introduction To Biostatistics
33 pages
Jamshedpur Co Operative College
No ratings yet
Jamshedpur Co Operative College
2 pages
DNA Sequencing With Machine Learning
No ratings yet
DNA Sequencing With Machine Learning
34 pages
Statistics Term Paper Example
100% (1)
Statistics Term Paper Example
7 pages
SAS 02 - MAT089 (Biostat) - Branches of Statistics, Biostatistics
No ratings yet
SAS 02 - MAT089 (Biostat) - Branches of Statistics, Biostatistics
6 pages
Statistics Eye Health
No ratings yet
Statistics Eye Health
19 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Managing Data Python Newbooks - 1
No ratings yet
Managing Data Python Newbooks - 1
2 pages
Math 190 - PRE-POST-TEST
No ratings yet
Math 190 - PRE-POST-TEST
4 pages
Biometric ID Analysis with Beta-Binomial
No ratings yet
Biometric ID Analysis with Beta-Binomial
8 pages
(Ebook) Bioinformatics and Functional Genomics by Jonathan Pevsner ISBN 9781118581780, 1118581784 Instant Download
100% (1)
(Ebook) Bioinformatics and Functional Genomics by Jonathan Pevsner ISBN 9781118581780, 1118581784 Instant Download
56 pages
Eukaryotic Promoter Database
No ratings yet
Eukaryotic Promoter Database
3 pages
Zoology
No ratings yet
Zoology
97 pages
Bioinformatics Exercises Guide
No ratings yet
Bioinformatics Exercises Guide
2 pages
Internship Report
100% (1)
Internship Report
22 pages
Biostatistics Final Exam
0% (2)
Biostatistics Final Exam
7 pages

Intro To Using Galaxy - For Bioinformatics: Carrie Ganote

Uploaded by

Intro To Using Galaxy - For Bioinformatics: Carrie Ganote

Uploaded by

Intro to Using Galaxy

September 17, 2013

National Center for Genome Analysis Support: http://ncgas.org

The National Center for Genome

National Center for Genome Analysis Support: http://ncgas.org

We administer bioinformatics software installation and upgrades on the

National Center for Genome Analysis Support: http://ncgas.org

Our instance at IU Why choose us: Galaxy Main

Tool bar – History –

Focus pane – shows options,

We will depart this slideshow for a short time as we go through the

You can choose to follow along either on IU Galaxy or on Galaxy

National Center for Genome Analysis Support: http://ncgas.org

We will be assembling the DNA

This dataset consists of paired-end,

National Center for Genome Analysis Support: http://ncgas.org

Let’s start with “Shared Data”

Choose Workshop Data.

National Center for Genome Analysis Support: http://ncgas.org

Import the Data sets to current history.

National Center for Genome Analysis Support: http://ncgas.org

Data set is imported – Click on Analyze Data to return.

National Center for Genome Analysis Support: http://ncgas.org

We will first get an idea

The FastQC tool will

National Center for Genome Analysis Support: http://ncgas.org

The quality score is

National Center for Genome Analysis Support: http://ncgas.org

Run this tool for both input data sets.

National Center for Genome Analysis Support: http://ncgas.org

Now that the files

I renamed my trimmed files to help me keep them straight.

National Center for Genome Analysis Support: http://ncgas.org

Next we will put the reads

Trinity is a de novo assembler

National Center for Genome Analysis Support: http://ncgas.org

National Center for Genome Analysis Support: http://ncgas.org

National Center for Genome Analysis Support: http://ncgas.org

Use this step to give a

National Center for Genome Analysis Support: http://ncgas.org

We will use Blastx to

Use default search

Make sure to choose Pairwise HTML output for readability.

We see the expected

We could limit the number of hits depending on output desired.

• Gene discovery based on transcripts

National Center for Genome Analysis Support: http://ncgas.org

Thanks for watching!

National Center for Genome Analysis Support: http://ncgas.org

You might also like

•  Gene discovery based on transcripts