Architecture-Conscious Data Mining

This document discusses challenges and opportunities in efficient data mining. It covers: 1. Data mining is an iterative process to extract useful information from large data stores, but faces challenges around efficiency due to dataset properties, computational complexity, and algorithm irregularity. 2. Architecture-conscious approaches can help address efficiency by understanding hardware limitations and designing algorithms to better utilize system resources like memory locality, parallelism, and multi-core processors. 3. Examples show adapting algorithms for modern architectures like multi-core CPUs and the Cell processor can significantly improve performance and enable mining very large datasets.

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views16 pages

Architecture-Conscious Data Mining

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Architecture Conscious Data

Mining
Srinivasan Parthasarathy
Data Mining Research Lab
Ohio State University
KDD & Next Generation
Challenges
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable
information from potentially large data stores efficiently
• Young field, long laundry list of technical challenges
– Theoretical foundations in various sub-fields
– Interestingness and Ranking
– New and Exciting Applications
• Embedding domain knowledge effectively
– Visualization for data & model understanding
– Efficient and scalable algorithms (focus of this talk)
• Other challenges
– Educational (talk a bit about this at the end)
– Reproducability (need for benchmarks)
– Socio-Political
Efficiency in the KDD process
• Why is it important?
– Interactive nature of KDD
– Real-time constraints
• What makes it challenging?
– Dataset properties (large,
heterogeneous, distributed)
– Computational complexity Mining Simulation Data
• Example Applications
– Clinical data Diagnosing disease and
modeling progression
– Biological data Twa et al 2005
Analyzing (dynamic) Networks
– Large scale simulation data Protein Interaction Network (yeast)
– Social network data
– Sensor data, WWW data….
Toward Efficient Realizations
• Data driven approach
– Compression, Sampling, Dimensionality Reduction, Feature
Selection, Matrix Factorization etc.
• Computational driven approach
– Intelligent search space pruning to reduce complexity
– Approximate algorithms, streaming algorithms
– Parallel and distributed algorithms
• Architecture-Conscious approach (this talk)
– Largely orthogonal to the above alternatives
– Objective is to understand limitations and novel features of
modern and emerging architecture(s)
– Subsequently, re-architect algorithms to better utilize system
resources.
Houston, do we have a problem?
• Turns out we do
– Many state-of-the-art data mining algorithms grossly
under-utilize processor resources [Ghoting 2005]
• Why?
1. Data intensive algorithms – lots of memory accesses
– high latency penalty.
2. Mining algorithms are extremely irregular in nature –
data and parameter driven – hard to predict
3. Use of pointer-based data structures – poor ILP
4. Do not leverage important features of modern
architectures – automated compiler/runtime systems
are handicapped because of 1, 2 and 3.
Spatial Locality
• Improve spatial locality of r
dynamic data structures
a f c
– Memory pooling
– Loss-less compression – c b b
store only data that is
needed – allows for more f p
data per cache line
– Memory placement to m b

match dominant access

order p m

– Side benefit – enables

effective hardware
prefetching (latency
alleviating mechanism) DFS allocation
Temporal Locality and Leveraging
SMT r

• Data Structure Tiling

– Operate on a tile-by-tile
basis
• Non-overlapping
(traditional)
• Overlapping
• Smart data partitioning
– Jigsaw puzzle analogy
• SMT
– Co-schedule tasks that
operate on same data tile
helps improve
performance Tile 1 Tile 2 Tile N-1 Tile N
Sample Benefits
• Gains in performance can be
staggering
– Frequent patterns (itemsets,
trees, graphs)
– Outlier Detection
– Clustering
• Benefits to end applications
– Scientific simulation data
– Web data
– Molecular and Clinical data
• For network of workstations
– minimize communication and VLDB’05, KDD’06, VLDBJ07
leverage remote memory PPOPP’07
– Enables mining of terabyte scale
distributed datasets efficiently.
CMPs (next frontier)
• Why the push from • Challenges
industry? – Existing applications, they
– Increasing clock need to be rewritten to use
frequencies is not returning multiple threads of
improved IPC, and it is execution
increasing power costs and – Compiler and runtime
thermal issues techniques have a hard
• Two new PCs in my den, time already – application
must help
no need for the heat vent! – Fine-grained sharing of
– Great for winters! processor resources
• Importantly (cache, bus/channel etc.)
– Parallel Computing meets – Memory hierarchy issues
mainstream commodity are even more challenging
market • Potential solution
– Adaptable algorithms
Adaptive algorithms
• Key idea:Trading off • Key idea: Moldable
memory for redundant partitioning and adaptive
computation scheduling of tasks
• Benefits: • Benefits
– Reduced working set sizes – Better CPU utilization
– Likely to have reduced – If co-scheduling – reduced
bandwidth pressure cache miss rates
– Utilizing strengths of the • Challenges:
CMP – Sensing the problem
• Challenge: – Re-architecting algorithm
– Sensing the problem • Moldable task
– Re-architecting algorithm to decomposition
reduce memory • Pass on enough state to
consumption move task to another core
Adaptive algorithms performance
• Graph mining • Tree Mining
– Converted to sequence space
– Gaston vs. Gspan vs. (dynamic arrays)
Hybrid (adaptive) • Better locality, ILP
– Reduced memory LCS
matching + structure checks
– Leveraged hybrid scheduling
– Sequential Performance
• 2 order reduction in
memory footprint
• 3 orders improvement in
processing time
– Parallel Performance
• Linear scalability on a 4-core
dual chip (8 cores)
• Adapted similar idea to XML
indexing with similar results!

ICDM’06, CIKM’06, VLDB’07

Esoteric CMPs (CELL)
• Interesting design point on 1000
commodity CMP space
– 25 GB/s OC bandwidth 100
– 8 cores (SPUs) + 1 PPU kMeans
– FP computation 200 GFlops 10 KNN

– Orca
Breakthroughs in commodity
1
processing
• Challenges

8
2

D2
0

l l-

l l-
Xe
0.1

25
um

Ce
m
– Hard to program

iu
ro
It a

nt
te

Pe
op
– Need to explicitly manage
memory and data transfers
between PPU and SPUs Cell-6 on Sony Playstation
– Probably not suitable for all Cell-8 is simulated
programs
All cases codes optimized and
– Interesting class of algorithms
and kernels can benefit Implemented on appropriate compiler
significantly!
Mining on Clusters
• Heavily researched over the last 15 years
– DDM Wiki (a very nice start point resource)
• What are the “new” challenges?
– Non-homogeneous “hybrid” clusters – (e.g. Roadrunner)
– Multi-level parallelism (on chip, on node, on cluster)
– Leveraging features of high end systems networking
• Infiniband makes it feasible and cheaper to access remote memory
than local disk – how to leverage?
– KDD may be particularly amenable to pipelined parallelism – a
largely ignored approach
– KDD and the grid (heard about this yesterday)
– Application specific challenges -- e.g. astronomy, folding@home
etc.
Discussion
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable information
from potentially large data stores efficiently
• This talk was primarily about the last but all 3 are
important.
• Architecture conscious data mining is a viable orthogonal
approach to achieve efficiency (references in paper)
– Tangible benefits to applications, algorithms and kernels
– Lower memory footprints + significantly faster performance
– Adaptive algorithms are necessary for emerging architectures
– Whats next? Services oriented architecture
• Plug-and-Play naturally connects with KDD process
• An effective mechanism to keep cores busy.
Broadly Speaking
• Education
– As an aside parallel algorithms and high performance
computing has to be a part of basic CS curriculum.
– We as data-intensive science need to understand the
key systems issues better from OS and architecture
friends
• Broader Scientific Impact
– Interactions between Systems and Data Mining
• Data mining for software engineering, invariant tracking,
testing, bug detection in sequential and parallel codes
• Data mining for performance modeling
• Leveraging systems features for data mining
Thanks
• Students
– A. Ghoting, G. Buehrer, S. Tatikonda
• Collaborating Colleagues
– OSU-Physics, OSU-Biomedical Informatics, Intel, IBM
• Funding agencies
– NSF CCF0702587, CNS-0406386, CAREER-IIS-0347662, RI-
CNS-0403342.
– DOE Early career principal investigator grant
– IBM Faculty partnership
• Organizers of this workshop
• Additional Information: dmrl.cse.ohio-state.edu or
srini@cse.ohio-state.edu

Marc Snir NGDM07
No ratings yet
Marc Snir NGDM07
36 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
18 pages
Data Mining Research at Ohio State: Srinivasan Parthasarathy
No ratings yet
Data Mining Research at Ohio State: Srinivasan Parthasarathy
51 pages
Parallel Data Mining Techniques On Graph
No ratings yet
Parallel Data Mining Techniques On Graph
26 pages
CS614 Finalterm Subjective Referencefile
No ratings yet
CS614 Finalterm Subjective Referencefile
27 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Data Mining Technologies and Implementations
No ratings yet
Data Mining Technologies and Implementations
34 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures
No ratings yet
Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures
10 pages
Ljku Sem 1 049010105 Data Mining and Analysis
No ratings yet
Ljku Sem 1 049010105 Data Mining and Analysis
3 pages
Tesis
No ratings yet
Tesis
109 pages
DWDM Unit 1 Part 1
No ratings yet
DWDM Unit 1 Part 1
35 pages
Computer Science-Research Methods
No ratings yet
Computer Science-Research Methods
25 pages
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
No ratings yet
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
21 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Data Mining Tasks & Architecture
No ratings yet
Data Mining Tasks & Architecture
5 pages
Data Mining Arun Pujari (2037)
No ratings yet
Data Mining Arun Pujari (2037)
303 pages
Data Mining Techniques, Arun K. Pujari
No ratings yet
Data Mining Techniques, Arun K. Pujari
303 pages
Data Mining & Big Data Insights
No ratings yet
Data Mining & Big Data Insights
3 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
No ratings yet
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
77 pages
XSEDE15 Part1 Intro
No ratings yet
XSEDE15 Part1 Intro
101 pages
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
No ratings yet
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
10 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Dic PLB L1
No ratings yet
Dic PLB L1
64 pages
Chapter1 Introduction
No ratings yet
Chapter1 Introduction
92 pages
Ijcrcst January17 12
No ratings yet
Ijcrcst January17 12
4 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
Comparative Study of Data Mining Tools
No ratings yet
Comparative Study of Data Mining Tools
8 pages
Technology Prospects For Data-Intensive Computing
No ratings yet
Technology Prospects For Data-Intensive Computing
21 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
DE Unit1 - Introdcution - DE - 8jul24
No ratings yet
DE Unit1 - Introdcution - DE - 8jul24
56 pages
Chap 1
No ratings yet
Chap 1
32 pages
Data Mining Syllabus and Question
No ratings yet
Data Mining Syllabus and Question
6 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
105 pages
Applied Data Mining
100% (1)
Applied Data Mining
284 pages
Machine Learning Internship Seminar
No ratings yet
Machine Learning Internship Seminar
19 pages
Introduction To Data and Memory Intensive Computing
No ratings yet
Introduction To Data and Memory Intensive Computing
31 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
VIPDMTheory Chapter 1
No ratings yet
VIPDMTheory Chapter 1
25 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
No ratings yet
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
77 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
Reconfigurable Dataflow Graphs For Processing-In-memory
No ratings yet
Reconfigurable Dataflow Graphs For Processing-In-memory
11 pages
Lec 1
No ratings yet
Lec 1
48 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Data in Memory Analytics Framework
No ratings yet
Data in Memory Analytics Framework
27 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
2 pages
DMDW
No ratings yet
DMDW
287 pages
Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21
No ratings yet
Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21
6 pages
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Fast Download
100% (13)
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Fast Download
109 pages
1 Introduction-To-Data-Science
No ratings yet
1 Introduction-To-Data-Science
43 pages
Seminar PPT
No ratings yet
Seminar PPT
15 pages
Immediate Download Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Ebooks 2024
100% (26)
Immediate Download Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Ebooks 2024
90 pages
Acquisti NGDM
No ratings yet
Acquisti NGDM
47 pages
NGDM Senator 071011 DM
No ratings yet
NGDM Senator 071011 DM
17 pages
Bhavani NSF NGDM Oct2007 Short
No ratings yet
Bhavani NSF NGDM Oct2007 Short
15 pages
Alok Choudhary NGDM07 Panel Talk
No ratings yet
Alok Choudhary NGDM07 Panel Talk
16 pages
Ngdm07 Singh
No ratings yet
Ngdm07 Singh
30 pages
NGDM 10
No ratings yet
NGDM 10
8 pages
HumanGeneFinding-NGDM2007 Salzberg
No ratings yet
HumanGeneFinding-NGDM2007 Salzberg
31 pages
NGDM07 Philip Yu
No ratings yet
NGDM07 Philip Yu
22 pages
Ngdm07 Joshi
No ratings yet
Ngdm07 Joshi
80 pages
Finin NGDM Panel
No ratings yet
Finin NGDM Panel
17 pages
Xindong Wu NGDM07
No ratings yet
Xindong Wu NGDM07
32 pages
NGDM07v1 Wei Wang
No ratings yet
NGDM07v1 Wei Wang
26 pages
Agouris
No ratings yet
Agouris
8 pages
NGDM Talk Kargupta2
No ratings yet
NGDM Talk Kargupta2
22 pages
Nasraoui-Market-Based Decentralized Profile Infrastructure
100% (1)
Nasraoui-Market-Based Decentralized Profile Infrastructure
20 pages
Grossman Ngdm07
No ratings yet
Grossman Ngdm07
35 pages
NGDM Talia
No ratings yet
NGDM Talia
58 pages
Innovation NSF Baltimore Oct 2007 Kusiak
No ratings yet
Innovation NSF Baltimore Oct 2007 Kusiak
31 pages
InformationDiscoveryEMR-NGDM2007 Vagelis
No ratings yet
InformationDiscoveryEMR-NGDM2007 Vagelis
21 pages
Data Mining Foster
No ratings yet
Data Mining Foster
26 pages
Safari - Jan 2, 2024 at 8:16 PM
No ratings yet
Safari - Jan 2, 2024 at 8:16 PM
1 page
Section 1: System Description
100% (2)
Section 1: System Description
84 pages
$RF1VUQV
No ratings yet
$RF1VUQV
13 pages
Rocker Arm Installation & Removal 0 Deutz 0312 4232 2011
No ratings yet
Rocker Arm Installation & Removal 0 Deutz 0312 4232 2011
4 pages
Baterias NPG12-40Ah 12V 40ah
No ratings yet
Baterias NPG12-40Ah 12V 40ah
2 pages
Babani BP160 Coil Design and Constructiton Manual
No ratings yet
Babani BP160 Coil Design and Constructiton Manual
114 pages
Chemical Engineering Calculations
No ratings yet
Chemical Engineering Calculations
11 pages
FP2 Inequalities
No ratings yet
FP2 Inequalities
7 pages
2.3 Example of Decomposition - Minitab
No ratings yet
2.3 Example of Decomposition - Minitab
4 pages
Starship-Troopers-Board Game-Rules
No ratings yet
Starship-Troopers-Board Game-Rules
24 pages
Manage and Run Recommendation Scans - Deep Security
No ratings yet
Manage and Run Recommendation Scans - Deep Security
9 pages
Chapter 4 Fundamental of Classes
No ratings yet
Chapter 4 Fundamental of Classes
26 pages
Planitop HPC
No ratings yet
Planitop HPC
4 pages
GHRCE Electronics VIII Semester Syllabus
No ratings yet
GHRCE Electronics VIII Semester Syllabus
11 pages
Test Based On Plane Mirror
No ratings yet
Test Based On Plane Mirror
3 pages
Transport Layer and Quality of Service
No ratings yet
Transport Layer and Quality of Service
5 pages
Application of Machine Learning To Bending Process
No ratings yet
Application of Machine Learning To Bending Process
24 pages
Mpeg 2 Ts DVB
No ratings yet
Mpeg 2 Ts DVB
47 pages
Math 213: Numerical Methods Assignment
No ratings yet
Math 213: Numerical Methods Assignment
3 pages
EPIDEMIOLOGICAL STUDIES Final (1) 1
100% (1)
EPIDEMIOLOGICAL STUDIES Final (1) 1
41 pages
Geosynthetic Pavement Design & Testing
No ratings yet
Geosynthetic Pavement Design & Testing
15 pages
Giri Et Al 2019 Connecting The Dots Knitting C Phenylresorcin 4 Arenes With Aromatic Linkers For Task Specific Porous
No ratings yet
Giri Et Al 2019 Connecting The Dots Knitting C Phenylresorcin 4 Arenes With Aromatic Linkers For Task Specific Porous
11 pages
Kdc-c717 Service Manual
No ratings yet
Kdc-c717 Service Manual
28 pages
DBA Assignment by Arslan Ahmad (1044)
100% (1)
DBA Assignment by Arslan Ahmad (1044)
9 pages
SPOTO CCIE LAB RS V5.0 H1 DIAG Version 1.3
100% (1)
SPOTO CCIE LAB RS V5.0 H1 DIAG Version 1.3
11 pages
Bajaj - Pulsar135 PDF
100% (2)
Bajaj - Pulsar135 PDF
74 pages
Sugar Substitutes: Presented By: - Dr. Piyush Verma Mds 2 Yr Dept of Pedodontics
No ratings yet
Sugar Substitutes: Presented By: - Dr. Piyush Verma Mds 2 Yr Dept of Pedodontics
96 pages
Hollow Sections Properties
No ratings yet
Hollow Sections Properties
20 pages
Testing Elixir Andrea Leopardi Download
100% (10)
Testing Elixir Andrea Leopardi Download
117 pages
MWB Englisch PDF
No ratings yet
MWB Englisch PDF
2 pages

Architecture-Conscious Data Mining

Uploaded by

Architecture-Conscious Data Mining

Uploaded by

Architecture Conscious Data

match dominant access

– Side benefit – enables

• Data Structure Tiling

ICDM’06, CIKM’06, VLDB’07

You might also like