0% found this document useful (0 votes)

194 views26 pages

NGDM07v1 Wei Wang

This document discusses efficient data mining methods for enabling genome-wide computing. It describes the massive amount of genotype and phenotype data that will be available from mouse and human populations, including millions of genetic markers and phenotypic measurements. It outlines some of the challenges of analyzing such high-dimensional, heterogeneous and dynamic data, such as coping with dimensionality, normalizing disparate data types, and developing incremental and robust algorithms. Specific techniques discussed include compatible interval analysis, local perfect phylogeny trees, and sample selection algorithms to maximize genetic diversity.

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

194 views26 pages

NGDM07v1 Wei Wang

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

Efficient Data Mining

Methods for Enabling

Genome-wide Computing

Wei Wang

University of North Carolina at Chapel Hill

Genotype codes for phenotype
Systems Genetics View

Cancer

Obesity
Current View of Genome-wide
Association Studies

?
Cancer

Obesity
~
mouse populations = human populations

Total mouse SNPs = ~40M

musculus, domesticus, castaneous

Total human SNPs = ~20M

~
mouse populations = human populations

 fast generation time

 reproducibility

 gene modification
Collaborative Cross

Parental Strains
1000 Independent Iterations
Recombinant Inbred Intercrosses (RIX)
Reproducible Outbred Population

~1,000,000 possible genomes

Genetics 170:1299, 2005

Data and Knowledge Integration over TIME and SPACE

Phenotypes
What we are facing …
DATA DATA …… and more DATA
 In the near future, we will have
 a thousand RILs → a million RIXs
genotypes

• How to select lines to design crosses having

desired features?
 tens of millions of SNPs
• Can we infer phylogenetic structures?
• Can we estimate historical recombination events?
phenotypes

 millionsof phenotypic measurements

(molecular and physiological) and other
derived variables.
• How to dissect complex correlations and causal
relationships between variables?
• How to efficiently assess the statistical
significance of the results?
A Data Miner’s View
 The dimensionality is extremely high
• How do we cope with the curse of dimensionality?
• Is it just a dimensionality reduction problem?
genotypes

 The data matrix is comprised of disparate

measurements including both continuous
and discrete variables, which may not be
directly comparable to each other.
• How do we normalize data?
phenotypes

 Thedata matrix is not static, but growing

both in terms of adding new samples and
measurements.
• How do we make the algorithms incremental and
adaptive?
A Data Miner’s View
 Individual items may be contaminated, noisy
or simply missing, which makes detectable
relationships hard to “see”, and thus hard to
interpret.
• How do we model noise?
genotypes

• How to make the algorithms robust to noise?

• How to infer the missing value?
• Can we formulate it as a classification or
regression problem?
 Thenumber of unknowns far exceeds the
phenotypes

number of knowns
• How to incorporate knowns in the methods?
A large number of permutation tests are
often needed to establish statistical
significance
• How to speed up this repeated (but necessary)
computation?
Human Interaction

Phenotypes
Sample Selection Maximizing
Genetic Diversity

select two

genotypes, at biomolecular level,

Single Nucleotide Polymorphisms
(SNP)
NP-complete
 maximizing the diversity within targeted
regions
 minimizing the diversity outside the regions
Searching Algorithms
 Systematically enumerates all possible combinations of
samples from smaller subsets to larger ones with effective
pruning strategies
 based on pair-wise diversity
Compatible Intervals

 The genetic variation with the interval can

be described by a perfect phylogeny tree
 No recombination event within a compatible
interval
 An important step towards understanding
the phylogenetic structures of the genome
Local Perfect Phylogeny Trees
Local Perfect Phylogeny Trees

 When quadratic time/space is too much,

 what is the minimal number of trees needed
to describe an entire genome?
Linear Complexity
 how to compute all local perfect phylogeny
trees efficiently?
 what are the common trees/subtrees?
 how to perform phylogeny tree-based
association studies efficiently?
Conclusion Remarks
 The ability to gather, organize, analyze, model,
and visualize large, multi-scale, heterogeneous
data sets rapidly is crucial.
 The massive scale and dynamic nature of data
dictate that data mining technologies be fast,
flexible, and capable of operating at multiple
levels of abstraction.
 Novel data mining techniques are required to
extract information, expose knowledge, and
understand complex data.
Acknowledgements
 This is a joint project with

http://compgen.unc.edu/

 NSF IIS 0534580: “Visualizing and Exploring High-dimensional

Data”
 EPA STAR RD832720: “Environmental Bioinformatics Research
Center to Support Computational Toxicology Applications”
 NSF IIS 0448392: “CAREER: Mining Salient Localized Patterns in
Complex Data”
 NIH U01 CA105417: “Integrative Genetics of Cancer Susceptibility”

Gene Data Mining with WEKA
No ratings yet
Gene Data Mining with WEKA
12 pages
Introduction To Data Mining For Bioinformatics: Fall 2005 Peter Van Der Putten (Putten - at - Liacs - NL)
No ratings yet
Introduction To Data Mining For Bioinformatics: Fall 2005 Peter Van Der Putten (Putten - at - Liacs - NL)
50 pages
ABC in Evolutionary Biology
No ratings yet
ABC in Evolutionary Biology
9 pages
Advanced Data Mining Techniqes in Bioinformatics
100% (1)
Advanced Data Mining Techniqes in Bioinformatics
343 pages
Abc PDF
No ratings yet
Abc PDF
9 pages
Plagiarism1 - Report
No ratings yet
Plagiarism1 - Report
8 pages
Gene Mining
100% (10)
Gene Mining
44 pages
Unit 3
No ratings yet
Unit 3
20 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
Algorithms 16 00480
No ratings yet
Algorithms 16 00480
14 pages
Molecular Biology Data Mining Insights
No ratings yet
Molecular Biology Data Mining Insights
7 pages
5 Microarray PDF
No ratings yet
5 Microarray PDF
79 pages
Budd Morphospace
No ratings yet
Budd Morphospace
5 pages
Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
No ratings yet
Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
39 pages
Tesis
No ratings yet
Tesis
109 pages
Stats Srep10298
No ratings yet
Stats Srep10298
10 pages
Omics Mapping for Complex Traits
No ratings yet
Omics Mapping for Complex Traits
10 pages
Fundamentals of Data Mining in Genomics and Proteomics (Dubitzky, Granzow & Berrar 2006-12-19)
No ratings yet
Fundamentals of Data Mining in Genomics and Proteomics (Dubitzky, Granzow & Berrar 2006-12-19)
300 pages
Clustering
No ratings yet
Clustering
22 pages
Data Availability: Ÿdopamine (Da)
No ratings yet
Data Availability: Ÿdopamine (Da)
107 pages
Datamining in Bioinformatics-1
No ratings yet
Datamining in Bioinformatics-1
15 pages
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
No ratings yet
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
89 pages
Bioinformatics & Data Mining Insights
No ratings yet
Bioinformatics & Data Mining Insights
3 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
BE Phylogenetics
No ratings yet
BE Phylogenetics
6 pages
Btae 137
No ratings yet
Btae 137
4 pages
Full Text
No ratings yet
Full Text
2 pages
DM Guidelines 14jan2022
No ratings yet
DM Guidelines 14jan2022
5 pages
Slides Week03
No ratings yet
Slides Week03
49 pages
MBG2004 SNP Genotyping Methods Week V - Updated
No ratings yet
MBG2004 SNP Genotyping Methods Week V - Updated
47 pages
Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer
No ratings yet
Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer
6 pages
Science2 15 PDF
No ratings yet
Science2 15 PDF
26 pages
Data Mining in Bioinformatics
No ratings yet
Data Mining in Bioinformatics
21 pages
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
No ratings yet
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
34 pages
Datamining
No ratings yet
Datamining
15 pages
Seminar Presen Tation
No ratings yet
Seminar Presen Tation
81 pages
Deep Learning in Population Genetics
No ratings yet
Deep Learning in Population Genetics
48 pages
Clustering
No ratings yet
Clustering
36 pages
Combinatorial Optimization in Computational Biology: Three Topics That Use Perfect Phylogeny
No ratings yet
Combinatorial Optimization in Computational Biology: Three Topics That Use Perfect Phylogeny
74 pages
Chapter3 Data Exploration
No ratings yet
Chapter3 Data Exploration
91 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Bioinfo Biostat Q1toQ8 All Subparts
No ratings yet
Bioinfo Biostat Q1toQ8 All Subparts
10 pages
Ch10 Clustering
No ratings yet
Ch10 Clustering
45 pages
DA Data Availability
No ratings yet
DA Data Availability
107 pages
1 Intro Annotated
No ratings yet
1 Intro Annotated
66 pages
Sullivan&Joyce 2005
No ratings yet
Sullivan&Joyce 2005
24 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
SMC Introduction
No ratings yet
SMC Introduction
24 pages
Computational Validation and Analysis of Semi-Quantitative Data Using In-Silico Approaches
No ratings yet
Computational Validation and Analysis of Semi-Quantitative Data Using In-Silico Approaches
5 pages
SHES2201 Lecture 3 - Data Mining in Bioinformatics
No ratings yet
SHES2201 Lecture 3 - Data Mining in Bioinformatics
45 pages
Paper 3
No ratings yet
Paper 3
8 pages
Diagnosis and Prognosis of Breast Cancer Using Multi Classification Algorithm
No ratings yet
Diagnosis and Prognosis of Breast Cancer Using Multi Classification Algorithm
5 pages
Data Mining
No ratings yet
Data Mining
26 pages
2024-Leonardo RIPOLI Thesis
No ratings yet
2024-Leonardo RIPOLI Thesis
224 pages
Genome Data Analysis Tools
No ratings yet
Genome Data Analysis Tools
11 pages
Luận Văn a Maximum Likelihood Method for Detecting Bad Samples From Illumina Beadchips Data
No ratings yet
Luận Văn a Maximum Likelihood Method for Detecting Bad Samples From Illumina Beadchips Data
16 pages
Acquisti NGDM
No ratings yet
Acquisti NGDM
47 pages
Bhavani NSF NGDM Oct2007 Short
No ratings yet
Bhavani NSF NGDM Oct2007 Short
15 pages
Ngdm07 Singh
No ratings yet
Ngdm07 Singh
30 pages
NGDM 10
No ratings yet
NGDM 10
8 pages
NGDM Senator 071011 DM
No ratings yet
NGDM Senator 071011 DM
17 pages
Alok Choudhary NGDM07 Panel Talk
No ratings yet
Alok Choudhary NGDM07 Panel Talk
16 pages
HumanGeneFinding-NGDM2007 Salzberg
No ratings yet
HumanGeneFinding-NGDM2007 Salzberg
31 pages
Finin NGDM Panel
No ratings yet
Finin NGDM Panel
17 pages
Agouris
No ratings yet
Agouris
8 pages
NGDM07 Philip Yu
No ratings yet
NGDM07 Philip Yu
22 pages
Ngdm07 Joshi
No ratings yet
Ngdm07 Joshi
80 pages
NGDM Talk Kargupta2
No ratings yet
NGDM Talk Kargupta2
22 pages
Grossman Ngdm07
No ratings yet
Grossman Ngdm07
35 pages
Architecture-Conscious Data Mining
No ratings yet
Architecture-Conscious Data Mining
16 pages
Xindong Wu NGDM07
No ratings yet
Xindong Wu NGDM07
32 pages
Nasraoui-Market-Based Decentralized Profile Infrastructure
100% (1)
Nasraoui-Market-Based Decentralized Profile Infrastructure
20 pages
Marc Snir NGDM07
No ratings yet
Marc Snir NGDM07
36 pages
InformationDiscoveryEMR-NGDM2007 Vagelis
No ratings yet
InformationDiscoveryEMR-NGDM2007 Vagelis
21 pages
Data Mining Foster
No ratings yet
Data Mining Foster
26 pages
NGDM Talia
No ratings yet
NGDM Talia
58 pages
Innovation NSF Baltimore Oct 2007 Kusiak
No ratings yet
Innovation NSF Baltimore Oct 2007 Kusiak
31 pages
Demographics of Overconfidence PDF
No ratings yet
Demographics of Overconfidence PDF
8 pages
Stat. and Prob. Module 1
100% (1)
Stat. and Prob. Module 1
20 pages
YIJC H2 2021 Prelim P2 Solutions
No ratings yet
YIJC H2 2021 Prelim P2 Solutions
13 pages
Rural Marketing Research Guide
No ratings yet
Rural Marketing Research Guide
9 pages
Pedoman Necrotizing Enterocolitis
No ratings yet
Pedoman Necrotizing Enterocolitis
10 pages
Practical Research 1 Module PDF
No ratings yet
Practical Research 1 Module PDF
10 pages
Pragmatism Script
No ratings yet
Pragmatism Script
20 pages
MPhil Stats Research Methodology-Part1
No ratings yet
MPhil Stats Research Methodology-Part1
53 pages
PWU-RDO Gabayán A Publication Manual
No ratings yet
PWU-RDO Gabayán A Publication Manual
38 pages
Defense Eeeee
No ratings yet
Defense Eeeee
16 pages
Behavioral Approach To Management
50% (2)
Behavioral Approach To Management
3 pages
Deutsch 1961 - Social Mobilization and Political Development
100% (1)
Deutsch 1961 - Social Mobilization and Political Development
23 pages
General Electric F404 - Engine of The RAAF's New Fighter
No ratings yet
General Electric F404 - Engine of The RAAF's New Fighter
87 pages
Sanyogjain
No ratings yet
Sanyogjain
10 pages
Scientific Method for Students
100% (2)
Scientific Method for Students
35 pages
Karadeniz, M. (2023) - The Effect of Factors On The Job Satisfaction of Pre-School Teachers. Journal of
No ratings yet
Karadeniz, M. (2023) - The Effect of Factors On The Job Satisfaction of Pre-School Teachers. Journal of
3 pages
Study Guide - UNIDO - CarMUN 2023
No ratings yet
Study Guide - UNIDO - CarMUN 2023
21 pages
A. Tracer Fiber Technique: This Technique Involves Immersing A Yarn, Which Contains A Very
No ratings yet
A. Tracer Fiber Technique: This Technique Involves Immersing A Yarn, Which Contains A Very
23 pages
1993 Aaron Antonovxy - Coherence Scale
100% (1)
1993 Aaron Antonovxy - Coherence Scale
9 pages
Handbook of Univariate and Multivariate Data Analysis With IBM SPSS, Second Edition
0% (2)
Handbook of Univariate and Multivariate Data Analysis With IBM SPSS, Second Edition
15 pages
Mma 205 Market Research Techniques 2010
No ratings yet
Mma 205 Market Research Techniques 2010
7 pages
Content Module 1. Biostatistics
No ratings yet
Content Module 1. Biostatistics
103 pages
Google Gemini AI Guide How - (Z-Library)
100% (6)
Google Gemini AI Guide How - (Z-Library)
43 pages
BR Research Paper
No ratings yet
BR Research Paper
15 pages
The Role of Weber 'S Law in Human Time Perception
No ratings yet
The Role of Weber 'S Law in Human Time Perception
13 pages
Oxford University Press - Online Resource Centre - Multiple Choice Questions7
No ratings yet
Oxford University Press - Online Resource Centre - Multiple Choice Questions7
4 pages
Working Paper: Federica Calidoni-Lundberg
No ratings yet
Working Paper: Federica Calidoni-Lundberg
39 pages
Project Format - BBI
No ratings yet
Project Format - BBI
7 pages
Communication Skills Short Questions
No ratings yet
Communication Skills Short Questions
10 pages
Criminal Justice Today An Introductory Text For The 21 Century 14 Ed Schmalleger Ebook and TestBank Bundle Test Bank Available Instantly
No ratings yet
Criminal Justice Today An Introductory Text For The 21 Century 14 Ed Schmalleger Ebook and TestBank Bundle Test Bank Available Instantly
408 pages

NGDM07v1 Wei Wang

Uploaded by

NGDM07v1 Wei Wang

Uploaded by

Efficient Data Mining

Methods for Enabling

University of North Carolina at Chapel Hill

Total mouse SNPs = ~40M

Total human SNPs = ~20M

 fast generation time

~1,000,000 possible genomes

Genetics 170:1299, 2005

• How to select lines to design crosses having

 millionsof phenotypic measurements

 The data matrix is comprised of disparate

 Thedata matrix is not static, but growing

• How to make the algorithms robust to noise?

genotypes, at biomolecular level,

 The genetic variation with the interval can

 When quadratic time/space is too much,

 NSF IIS 0534580: “Visualizing and Exploring High-dimensional

You might also like