A brief glance at Convolutional Neural Network (CNNs)
CNN is basically a feed-forward neural network that consists of several layers such as the convolution, pooling and some densely connected layers that we are familiar with.
Firstly, as seen in the above picture, we feed the data(image in this case) into the convolution layer. The convolution layer works by sliding a window across the input data and as it slides, the window(filter) applies some matrix operations with the underlying data that falls in the window. And when you eventually collect all the result of the matrix operations, you will have a condensed output in another matrix(we call it a feature map).
With the resulting matrix at hand, you do a max pooling that basically down-samples or in another words decrease the number of dimensions without losing the essence.
Consider this simplified image of max pooling operation above. In the above example, we slide a 2 X 2 filter window across our dataset in strides of 2. As it’s sliding, it grabs the maximum value and put it into a smaller-sized matrix.
There are different ways to down-sample the data such as min-pooling, average-pooling and in max-pooling, you simply take the maximum value of the matrix. Imagine that you have a list: [1,4,0,8,5]. When you do max-pooling on this list, you will only retain the value “8”. Indirectly, we are only concerned about the existence of 8, and not the location of it. Despite it’s simplicity, it’s works quite well and it’s a pretty niffy way to reduce the data size.
Again, with the down-sized “after-pooled” matrix, you could feed it to a densely connected layer which eventually leads to prediction.
How does this apply to NLP in our case?
Now, forget about real pixels about a minute and imagine using each tokenized character as a form of pixel in our input matrix. Just like word vectors, we could also have character vectors that gives a lower-dimension representation. So for a list of 10 sentences that consists of 50 characters each, using a 30-dimensional embedding will allow us to feed in a 10x50x30 matrix into our convolution layer.
Looking at the above picture, let’s just focus(for now) on 1 sentence instead of a list. Each character is represented in a row (8 characters), and each embedding dimension is represented in a column (5 dimensions) in this starting matrix.
You would begin the convolution process by using filters of different dimensions to “slide” across your initial matrix to get a lower-dimension feature map. There’s something I deliberately missed out earlier: filters.
The sliding window that I mention earlier are actually filters that are designed to capture different distinctive features in the input data. By defining the dimension of the filter, you can control the window of infomation you want to “summarize”. To translate back in the picture, each of the feature maps could contain 1 high level representation of the embeddings for each character.
Next, we would apply a max pooling to get the maximum value in each feature map. In our context, some characters in each filter would be selected through this max pooling process based on their values. As usual, we would then feed into a normal densely connected layer that outputs to a softmax function which gives the probabilities of each class.
Note that my explanation hides some technical details to facilitate understanding. There’s a whole load of things that you could tweak with CNN. For instance, the stride size which determine how often the filter will be applied, narrow VS wide CNN, etc.
Okay! Let’s see how we could implement CNN in our competition.