By James Bradbury
In deep learning, there are two very different ways to process input (like an image or a document). Typically, images are processed all at once, with the same kind of computation happening for every part of the image simultaneously. But researchers have usually assumed that you can't do this for text data: that you need to process it the way people read text: word by word, from the beginning to the end, taking context into account as you go. In this paper, we showed that this doesn't have to be true: you can process it all at once like an image, then quickly take into account context and fix the parts that you got wrong. Since the conventional, one-word-at-a-time approach to deep learning for text uses what's called a "recurrent neural network," we called our system a "quasi-recurrent neural network," or QRNN. Because most of the computation happens all at once, in parallel, it's up to 16 times faster than the old approach. Amazingly, it also gives better results than conventional deep-learning models for all three tasks that we tried (sentiment analysis, next-word prediction, and translation).
The diagram above shows the basic structure of our model. A continuous block of red means that complicated, slow computations can proceed in parallel (i.e., much faster). Blue signifies fast, simple functions (in the case of the QRNN, the components that "fix" the parts the out-of-context red computations got wrong). Depending on the length of text inputs we need to process (whether they're sentences, paragraphs, etc.) and other properties of the dataset, our architecture, shown on the right, runs anywhere between 30% faster and 16 times faster than the architecture usually used for text, shown on the left. The advantage is highest for long sequences like paragraphs.
We compared our new building block with the traditional LSTM architecture (a special type of recurrent neural network) by building a pair of models for each of three tasks. In each case, the models are identical other than the fact that one uses the LSTM and one uses the QRNN. In all three cases, the one with the QRNN performs better:
|Model||Time to Run Through Dataset Once (s)||Test Accuracy (%)|
|Deeply Connected 4-layer LSTM||480||90.9|
|Deeply Connected 4-layer QRNN||150||91.4|
|Model||Time to Run Through Dataset Once (s)||Test Perplexity (Lower is Better)|
|Ordinary 2-layer LSTM||128||82.0|
|Ordinary 2-layer QRNN||66||79.9|
|2-layer LSTM with State-of-the-Art Regularization||—||78.9|
|2-layer QRNN with Zoneout Regularization||66||78.3|
|Model||Time to Run Through Dataset Once (hrs)||Test BLEU Score|
Deep learning models are often said to be "black boxes": You feed data in, and results come out without a human-readable explanation of why the neural network decided to produce that particular output. In principle, it is always possible to attempt to understand the internal workings of a neural network by looking at the individual neuron activations, but this hasn't been very productive—especially for natural language tasks. One reason why it's so difficult to assign a meaning to individual neurons is that traditional approaches like the LSTM allow every neuron's activation for one word in a sentence to depend on every other neuron's activation for the previous word. So the activations of all the neurons mix together with each other, and it's unlikely for any one neuron to have a single well-defined meaning.
This new QRNN approach may help interpretability of neurons, because each neuron's activation doesn't depend at all on the past history of any other neurons. This means that neurons are more likely, although not guaranteed, to have independent and well-defined meanings, and these meanings are more likely to be simpler and more human-interpretable. One way to see this is to plot the activations of all neurons in one layer as the QRNN reads an input paragraph from the sentiment analysis dataset. The input is a movie review from IMDb.com which contains some positive and some negative commentary about the movie. Individually, each neuron measures some aspect of positive or negative sentiment and isn't directly affected by the activations of other neurons; together, they make it clear how the network perceives the positive and negative swings in the writer's comments.
Visualization of neuron activations for the last QRNN layer of a network processing the sentiment of an IMDb movie review. Time (the number of words read) is on the horizontal axis; different neurons are along the vertical axis. Colors denote neuron activations; hover over the visualization to see the context. The word at that timestep is bolded. After an initial positive statement This movie is simply gorgeous (at timestep 9), timestep 117 triggers a reset of many neurons towards negative sentiment due to the phrase not exactly a bad story (soon after main weakness is its story). Only at timestep 158, after I recommend this movie to everyone, even if you've never played the game, do the neurons recover. The (correct) positive prediction prevails.
We're happy to see community interest in the QRNN architecture. To help people develop their own implementations, we've embedded the core of ours, written in Chainer, below.
STRNNFunction is a CUDA implementation of the forward and backward passes of the recurrent pooling function, while
QRNNLayer implements a QRNN layer composed of convolutional and pooling subcomponents, with optional attention and state-saving features for the three tasks described in the paper.
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016.
Quasi-Recurrent Neural Networks