Multichannel Variable-Size Convolution for
Sentence Classification
- WenPeng Yin
- Hinrich Schutze
K.Vinay Sameer Raja
IIT Kanpur
INTRODUCTION
Enhance word vector representations by combining various word embedding
methods trained on different corpus
Extract features of multi granular phrases using variable filter size CNN.
CNN's were employed for extracting features over phrases but the size of filter
is a hyperparameter in such models
Mutual learning and Pre training for enhancing MVCNN.
ARCHITECTURE
Multi-Channel Input :
Input layer is a 3 dimensional array of size cds
where s - sentence length d - word embedding
dimension, c - no.of embedding versions.
In practice while using mini batch, sentences are
padded to same length by using random
initialization for unknown words in corresponding
versions.
Convolution Layer :
The computations involved in this layer are same as those in normal CNN's
but with additional features obtained due to variable filter sizes.
Mathematical Formulation :
Denoting feature map in ith layer by Fi and assume there are n maps in i-
1 layer. Let l be the size of filter and let weights be in a matrix Vi,lj,k then
Fi,lj = k Vi,lj,k Fi-1k
is the convolution operator
Pooling Layer :
Normal k-max pooling involves storing k maximum values from a
moving window.
Dynamic k-max pooling has the k value changing for each layer.
The choice of k value for a feature map in layer i is given by
ki = max ( ktop , (L-i) * s / L
where i {1, . . . L} is the order of convolution layer from bottom to top
L - total number of layers
ktop - a constant determined empirically which is the k value used
in top layer
Hidden Layer :
On the top of final k-max pooling a fully connected layer is stacked
to learn sentence representation of required dimension d
Logistic Regression Layer :
The outputs of hidden layer are forwarded to logistic regression
layer for classification
MODEL ENHANCEMENTS :
Mutual Learning of Embedded versions :
As different embedding versions are trained in different corpuses,
there may be some words which dont have embedding across all
versions.
Let V1, V2 , . Vc are vocabularies of c embedding versions.
V* = i=1c Vi be the total vocabulary of our final embedding
Vi- = V* \ Vi is the set of word which have no embedding in Vi
Vij = overlapping vocabulary between ith and jth versions.
We project (or learn) embeddings from ith to jth version by
wj = fij(wi)
Squared error between wj and wj is the training loss to minimize
Element-wise average of f1i(w1), f2i(w2), . . ., fki(wk) is treated as the
representation of w in V.
A total of c(c-1) /2 number of projections are calculated for finding
embeddings of every word across all versions.
Pre- Training
In Pre-training the sentence representation is used to predict the
component words (on in the figure) in the sentence (instead of predicting
the sentence label Y/N as in supervised learning)
Given sentence representation s Rd and initialized representations of 2t
context words (t left words and t right words): wit , . . ., wi1, wi+1, . . .,
wi+t ; wi Rd , we average the total 2t + 1 vectors element wise
Noise-contrastive estimation (NCE) is used to find true middle word from
the above resulting vector which is predicted representation.
In pre-training initializations are needed for
1. Each word in sentence in multi-channel input layer
(multichannel initialization)
2. Each context word as input to average layer (random
initialization)
3. Each target word as the output of NCE layer (random
initialization)
During pre-training , the model parameters will be updated in such a
way that they extract better sentence representations . These model
parameters are fine tuned in supervised tasks.
RESULTS :
Datasets :
Standard Sentiment Treebank (Socher et al., 2013) - Binary and Fine grained
Sentiment140 (Go et al., 2009) - Senti 140
Subjectivity classification dataset by (Pang and Lee, 2004) - Subj
Questions ?
Thank You!