EECE 5642
Data Visualization
Dimensionality Reduction
Y. Raymond Fu
Professor
Electrical and Computer Engineering (ECE), COE
Khoury College of Computer Science (KCCS)
Northeastern University
Attribute Dimensions and Orders
• Dimensions
– 1D: scalar
– 2D: two-dimensional vector
– 3D: three-dimensional vector
– >3D: multi-dimensional vector
• Orders
– scalars
– vectors
– matrix 1-st order 2-nd order
– tensors (high-order)
vector matrix
2
Data Table
www.many-eyes.com
3 Courtesy of Prof. Hanspeter Pfister, Harvard University.
Univariate Data Representations
Matlab Box Plot
Courtesy of Prof. Hanspeter Pfister, Harvard University.
4 Original figures were from the slides of Stasko
Bivariate Data Representations
Courtesy of Prof. Hanspeter Pfister, Harvard University.
5 Original figures were from the slides of Stasko
Trivariate Data Representations
Courtesy of Prof. Hanspeter Pfister, Harvard University.
6 Original figures were from the slides of Stasko
Multi-Dimensional Data
Courtesy of Prof. Hanspeter Pfister, Harvard University.
7 Original figures were from the slides of Stasko
Multi-Dimensional Data Visualization
8 https://www.youtube.com/watch?v=wvsE8jm1GzE
What if the dimension of the data is 4, 5, 6, and
even more?
A world of high-dimensional measurements!
Dimensionality Reduction (DR)
9
High Dimensional Data
• Multimedia
– High-resolution images
– High-resolution videos
– Data from multiple sensors
• Bioinformatics
– Expressions of genes
– Neurons
• Social networks
– Tweets/likes/friendships
– Other interactions
• Weather and climate
– Multiple measurements (e.g., temperature)
– Time series data
• Finance
– Stock markets
– Time series data
10
Motivation and Goal of DR
• Reduce the degree of freedom in measurements
Replace a large set of measured variables with a small set of more
“condensed” variables
Simpler models are more robust on small datasets
• Reduce the computational load
By reducing the dimensionality of data, the computational burden
(time and space) could be greatly decreased.
• Visualization
“Looking at the data”—more interpretable; simpler explanations
Make sense of the data before processing
Goal
• Extract information hidden in the data
Detect variables relevant for a specific task and how variables interact
with each other Reformulate data with less variables.
11 Samuel Kaski, Jaakko Peltonen: Dimensionality Reduction for Data Visualization [Applications Corner]. IEEE Signal Process. Mag. 28(2): 100-104 (2011)
Motivation and Goal of DR
This is easier to interpret … … than this
12 Courtesy of Prof. Jaakko Peltonen, Aalto University.
Feature Selection vs. Feature Extraction
Given a data set X consisting of n samples, and the dimension of
each sample is d.
• Feature Selection
Choose k important features (k < d), ignoring the remaining d – k.
Example: microarray data analysis
• Feature Extraction
Transform the original data set X from the d-dimensional space to a
k-dimensional space (k < d).
A general problem: 𝑌 = 𝑃𝑇 𝑋, where 𝑋 ∈ 𝑹𝑑 , 𝑌 ∈ 𝑹𝑘 .
13
Statistics & Linear Algebra Background
• Given a set of n-point data {Xk} in Rd
– The mean is E{x}
– The variance is Var{x}
– The co-variance between two data set is
14 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Eigenvectors
• For transform Y=AX, if exists
• e=[e1, e2, e3]T is an eigenvector, λ is the eigenvalue associated
with this eigenvector.
• For transform A, e is just a scaling function.
• Example
Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
15 http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
Dimensionality Reduction Methods
• Linear Methods
– Principal Component Analysis (PCA), M.A. Turk & A.P. Pentland
– Multidimensional Scaling (MDS), T.F. Cox and M.A.A. Cox
– Locality Preserving Projections (LPP), X.F. He, S.C. Yan, Y.X. Hu
– Locality Persuit Embedding (LPE), W.L. Min, K. Lu, and X.F. He.
– Locally Embedded Analysis (LEA), Y. Fu and T.S. Huang
• Nonlinear Methods
– Locally Linear Embedding (LLE), S.T. Roweis & L.K. Saul
– Laplacian Eigenmaps, M. Belkin & P. Niyogi
– Isomap, J.B. Tenenbaum, V.de Silva, and J.C. Langford
– Hessian LLE, D.L. Donoho & C.E. Grimes
– Semidefinite Programming (SDE), K.Q. Weinberger & L.K. Saul
• Fisher Graph Methods
– Linear Discriminant Analysis (LDA), R.A. Fisher
– Marginal Fisher Analysis (MFA), S.C. Yan, et al.
– Local Discriminant Embedding (LDE), H.-T. Chen, et al.
– Discriminant Simplex Analysis (DSA), Y. Fu and T.S. Huang
– Correlation Embedding Analysis (CEA), Y. Fu and T.S. Huang
16
Parametric vs. Nonparametric Learning
• Parametric Model
– Use a parameterized family of probability distributions to describe the nature
of a set of data (Moghaddam & Pentland, 1997).
– The data distribution is empirically assumed or estimated.
– Learning is conducted by measuring a set of fixed parameters, such as mean
and variance.
– Effective for the large sample, but degrade for complicated data distribution.
• Nonparametric Model
– Distribution free.
– Learning is conducted by measuring the pair-wise data relationship in both
global and local manners.
– Effective and robust due to the reliance on fewer assumptions and parameters.
– Work for cases with small-sample, high-dimensionality, and complicated data
distribution.
17
Parametric Model
• Principal Component Analysis (PCA) and Linear Discriminant
Analysis (LDA)
• PCA is trying to captures the “principle” variations in the data
• It is computed by finding the Eigenvectors of the covariance
matrix of the data
• Geometrically, PCA finds the largest variations directions of
the underlying data
• Can be applied in data compression, pattern recognition, etc.
• Find a line going through the
data mean and along the max
variation direction of the data.
• Assuming zero mean, line is
represented as y=wTx, where
w is the basis, wTw=1.
18 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Principal Component Analysis
19 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Principal Component Analysis
20 Courtesy of Prof. Jaakko Peltonen, Aalto University.
Principal Component Analysis
• OptDigits Dataset
The data set contains 5620 instances of digitized handwritten digits in
range 0 ~ 9.
Each digit is a 𝑹64 vector: 8 × 8 = 64 pixels.
21
Principal Component Analysis
22 Courtesy of Prof. Jaakko Peltonen, Aalto University.
Principal Component Analysis
Eigenvector Eigenfaced
23 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Linear Discriminant Analysis
24 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Linear Discriminant Analysis
• Instead of PCA, it finds the discriminant subspace by including class label info
in subspace modeling (Supervised learning).
– Compute within class scatter
– Compute between class scatter
– Maximize between scatters and minimize within scatters
25 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
LDA Definition
26 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
LDA Two-Class Case
27 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
LDA Multiple-Class Case
28 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Different Subspace Base Vectors
• Different subspace base vectors show
different projective directions
• Subspace base vector w forms a Fisherface
29
PCA vs. LDA
Digits data after PCA Digits data after LDA
30 Courtesy of Prof. Jaakko Peltonen, Aalto University.
PCA vs. LDA
PCA LDA
31 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
PCA vs. LDA
PCA LDA
32 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
PCA vs. LDA
• PCA performs
worse under
this condition
• LDA (FLD-Fisher
Linear Discrimi-
nant) provides
better low
dimensional
representation.
33 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
When LDA Fails
• LDA fails in the right figure( v1 is the projected
direction). Think about why
34 Courtesy of Prof. Zhu Li, Hong Kong Polytechnic University.
Criteria of Nonparametric Model
Effective to model sample
distributions
Manifold Learning
Effective to classify
different classes
Fisher Graph
Effective to measure
Similarity sample distances
Metric
Effective to describe
intrinsic data structures
High-
Order
Data
Structure
35
Manifold
• “Manifold, is an abstract mathematical space in which every
point has a neighborhood which resembles Euclidean space,
but in which the global structure may be more complicated.” --
-from Wikipedia
• “A manifold is a topological space that is locally Euclidean.” ---
from Mathworld
• e. g. 2D map of the 3D earth is a manifold.
• Manifold could be obtained by a projection from original data
to a low-dimensional representation via subspace learning.
• Manifold criterion can provide more effective ways to model
the data distribution than conventional learning methods
based on the Gaussian distribution.
36
Manifold
37 http://en.wikipedia.org/wiki/Manifold
Manifold Learning
Swiss Roll
Dimensionality
Reduction
38 Courtesy of Sam T. Roweis and Lawrence K. Saul, Sience 2002
Locally Linear Embedding
http://www.cs.toronto.edu/~roweis/lle/
39
LEA for Pose Manifold
Linear embedding and subspace projection of 400 rotating teapot images. The number of nearest neighbors is k = 6.
40
Yun Fu, et. al. “Locally Adaptive Subspace and Similarity Metric Learning for Visual Clustering and Retrieval”, CVIU, Vol. 110, No. 3, pp: 390-402, 2008.
LEA for Expression Manifold
Manifold visualization of 1,965 Frey’s face images by LEA using k = 6 nearest neighbors.
41
Yun Fu, et. al. “Locally Adaptive Subspace and Similarity Metric Learning for Visual Clustering and Retrieval”, CVIU, Vol. 110, No. 3, pp: 390-402, 2008.
LEA for Emotion State Manifold
Manifold visualization for 11,627 AAI sequence images of a male subject using LLE algorithm. (a) A video frame
snapshot and the 3D face tracking result. The yellow mesh visualizes the geometric motion of the face. (b) Manifold
visualization with k=5 nearest neighbors. (c) k=8 nearest neighbors. (d) k=15 nearest neighbors and labeling results.
42
Yun Fu, et. al. “Locally Adaptive Subspace and Similarity Metric Learning for Visual Clustering and Retrieval”, CVIU, Vol. 110, No. 3, pp: 390-402, 2008.
LEA for Head Pose Manifold
43
Fisher Graph
• Graph Embedding (S. Yan, IEEE TPAMI, 2007)
– G={X, W} is an undirected weighted graph.
– W measures the similarity between a pair of vertices.
– Laplacian matrix
– Most manifold learning method can be reformulated as
where d is a constant and B is the constraint matrix.
Between-Locality Graph Within-Locality Graph Courtesy of Shuicheng Yan
Discriminant Simplex Analysis
Y. Fu, et. al., IEEE Transactions on Information Forensics and Security, 2008.
Similarity Metric
• Single-Sample Metric
– Euclidean Distance and Pearson Correlation Coefficient.
• Multi-Sample Metric
– k-Nearest- Neighbor Simplex
Q
Q
Correlation Embedding Analysis
Objective Function
Correlation Distance Fisher Graph
Y. Fu, et. al., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
High-Order Data Structure
– m-th order tensors
– Representation where
– Define , where
– Here, tensor means multilinear representation.
1-st order 2-nd order
vector matrix
Tensor
Y. Fu, et. al., IEEE Transactions on Circuits and Systems for Video Technology, 2009.
Correlation Tensor Analysis
Given two m-th order tensors,
Pearson Correlation Coefficient (PCC):
CTA objective function
Correlation Distance and Fisher Graph
Multilinear Representation
m different subspaces
Y. Fu, et. al., IEEE Transactions on Image Processing, 2008.
Manifold with Noise Effect
Robust Manifold by Low-Rank Recovery
Real-world ATR data are
Automated, real-
large scale, unbalanced in time, and robust
dynamic sampling, and description of ATR
easily affected by noises data space under
and outliers, which are uncertainty.
difficult to represent.
Low-rank matrix recovery
can deal with noises and
outliers for data
reconstruction.
Stabilized Manifold Learning
Raw Data Existing Method New Method
LLE Noise Outlier
Voting for Outlier Detection
Stabilized Manifold Learning
Large Scale Manifold Learning
Graph based methods require spectral decomposition of matrices
of n x n, where n denotes the number of samples.
The storage cost and computational cost of building neighborhood
maps are O(n2) and O(n3), it is almost intractable to apply these
methods to large-scale scenarios.
Neighborhood search is also a large scale aspect.
Large Scale Manifold Learning
Graph oriented clustering K-means clustering
Robust Matching of Sub-Manifolds
A robust visual representation must be insensitive to durations in the case of
dynamics or time series, such as action/activity videos.
A generalized manifold can be considered as a union of sub-manifolds with
different durations which characterize different instances with similar structures,
such as different individuals performing the same action, instead of a single
continuous manifold as conventionally regarded.
Robust matching of these sub-manifolds can be achieved through both low-rank
matrix recovery and simplex synchronization.
Applications
Chemical data visualization
DR algorithm: multidimensional scaling (MDS)
Seung-Hee Bae, Jong Youl Choi, Judy Qiu, Geoffrey Fox: Dimension reduction and visualization of large high-dimensional data via interpolation. HPDC
58 2010: 203-214
Applications
Biology data visualization
DR algorithm: principal component analysis (PCA)
Andreas Lehrmann, Michael Huber, Aydin Can Polatkan, Albert Pritzkau, Kay Nieselt: Visualizing dimensionality reduction of systems biology data.
59 Data Min. Knowl. Discov. 27(1): 146-165 (2013)
Applications
Biology data visualization
DR algorithm: locally linear embedding (LLE)
Andreas Lehrmann, Michael Huber, Aydin Can Polatkan, Albert Pritzkau, Kay Nieselt: Visualizing dimensionality reduction of systems biology data.
60 Data Min. Knowl. Discov. 27(1): 146-165 (2013)
Applications
Bioinformatics
DR algorithm: multidimensional scaling (MDS)
Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox: Interpolative multidimensional scaling
61 techniques for the identification of clusters in very large sequence sets. BMC Bioinformatics 13(S-2): S9 (2012)
Applications
Metagenomic data visualization
DR algorithm: stochastic neighbor embedding (SNE)
CC Laczny, N Pinel, N Vlassis, P Wilmes: Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction, Scientific reports, 4
62 (2014).
Applications
Neuroscience
DR algorithm: multiple algorithms
J. P. Cunningham and B. M. Yu.: Dimensionality reduction for large-scale neural recordings. Nature Neuroscience, (2014), doi:10.1038/nn.3776.
63
Applications
Semantic visualization in data mining
DR algorithm: spherical semantic embedding (SSE).
64 Tuan M. V. Le, Hady Wirawan Lauw: Semantic visualization for spherical representation. KDD (2014): 1007-1016.
Applications
Visualization of machine learning datasets
DR algorithm: stochastic neighbor embedding (SNE)
Zhirong Yang, Jaakko Peltonen, Samuel Kaski: Scalable Optimization of Neighbor Embedding for Visualization. ICML (2) 2013: 127-135
65
Transfer Learning in Dimension Reduction
• We are facing huge amount of unlabeled data
nowadays
• Only a few databases are labeled
• The problem of inconsistency of training and
test data
• Transfer learning can help: training in one
domain and test in another
• Knowledge is better utilized
66
Recent Advances: Transfer Learning in DR
Motivation:
• We are facing huge amount of unlabeled data nowadays
• Only a few databases are labeled
• Knowledge is better utilized
Basic Idea:
Given two data sets A and B, use the knowledge learned from A to
help the learning task for B.
67
Recent Advances: Transfer Learning in DR
Object Face
Recognition Recognition
Learning Framework
Ming Shao, Carlos Castillo, Zhenghong Gu, Yun Fu: Low-Rank Transfer Subspace Learning. ICDM (2012): 1104-1109. 68
Recent Advances: Robust Subspace Discovery
Low-rank matrix recovery
Clean images Noisy images -observation - low-rank - sparse
Images from Twitter
Subspace Learning Subspace Clustering
Find low-dimensional projection Discover underlying subspaces in
with specific properties. data set, and correct errors.
Unsupervised (e.g., PCA, LPP) / Sparse subspace clustering (SSC),
Supervised (e.g., LDA) Low-rank representation (LRR).
Sheng Li, Yun Fu: Robust Subspace Discovery through Supervised Low-Rank Constraints. SDM 2014: 163-171 69
Recent Advances: Robust Subspace Discovery
Learning Framework
Sheng Li, Yun Fu: Robust Subspace Discovery through Supervised Low-Rank Constraints. SDM 2014: 163-171 70
Self-Taught Low-Rank Coding for Visual Learning
Self-taught Learning (Raina et al, 2007) Our Motivations and Contributions
Transferring knowledge from auxiliary domain Learn effective feature representations for
with minimum restrictions. target domain.
A special type of transfer learning. High-quality dictionary bridges auxiliary
domain and target domain.
Objective Function Low-rank constraint characterizes the
structure information.
The first general self-taught learning
framework is developed, including supervised
and unsupervised learning tasks.
Self-Taught Low-Rank Coding for Visual Learning
Application I: Subspace
Clustering
Application II: Image
Classification
Summary
• The motivation of using dimensionality reduction (DR) for
visualization
• DR mainly includes feature selection, feature extraction.
• Two basic linear DR methods: PCA and LDA.
• Nonlinear DR methods: LLE, SNE, etc.
• Applications of DR methods.
• Recent advances in DR.
73