Modern Recurrent Neural Networks⚓︎
:label:chap_modern_rnn
The previous chapter introduced the key ideas
behind recurrent neural networks (RNNs).
However, just as with convolutional neural networks,
there has been a tremendous amount of innovation
in RNN architectures, culminating in several complex
designs that have proven successful in practice.
In particular, the most popular designs
feature mechanisms for mitigating the notorious
numerical instability faced by RNNs,
as typified by vanishing and exploding gradients.
Recall that in :numref:chap_rnn
we dealt
with exploding gradients by applying a blunt
gradient clipping heuristic.
Despite the efficacy of this hack,
it leaves open the problem of vanishing gradients.
In this chapter, we introduce the key ideas behind
the most successful RNN architectures for sequences,
which stem from two papers.
The first, Long Short-Term Memory :cite:Hochreiter.Schmidhuber.1997
,
introduces the memory cell, a unit of computation that replaces
traditional nodes in the hidden layer of a network.
With these memory cells, networks are able
to overcome difficulties with training
encountered by earlier recurrent networks.
Intuitively, the memory cell avoids
the vanishing gradient problem
by keeping values in each memory cell's internal state
cascading along a recurrent edge with weight 1
across many successive time steps.
A set of multiplicative gates help the network
to determine not only the inputs to allow
into the memory state,
but when the content of the memory state
should influence the model's output.
The second paper, Bidirectional Recurrent Neural Networks :cite:Schuster.Paliwal.1997
,
introduces an architecture in which information
from both the future (subsequent time steps)
and the past (preceding time steps)
are used to determine the output
at any point in the sequence.
This is in contrast to previous networks,
in which only past input can affect the output.
Bidirectional RNNs have become a mainstay
for sequence labeling tasks in natural language processing,
among a myriad of other tasks.
Fortunately, the two innovations are not mutually exclusive,
and have been successfully combined for phoneme classification
:cite:Graves.Schmidhuber.2005
and handwriting recognition :cite:graves2008novel
.
The first sections in this chapter will explain the LSTM architecture, a lighter-weight version called the gated recurrent unit (GRU), the key ideas behind bidirectional RNNs and a brief explanation of how RNN layers are stacked together to form deep RNNs. Subsequently, we will explore the application of RNNs in sequence-to-sequence tasks, introducing machine translation along with key ideas such as encoder--decoder architectures and beam search.
:maxdepth: 2
lstm
gru
deep-rnn
bi-rnn
machine-translation-and-dataset
encoder-decoder
seq2seq
beam-search
创建日期: November 25, 2023