Chapter 8.1 Bahdanau Attention: From Information Compression to Dynamic Retrieval

Author

Brench

Published

2026-05-19

Modified

2026-06-22

In earlier sequence models, we have already seen a direct approach: the encoder first reads the full input sequence, compresses the information into a single vector, and then the decoder generates the output step by step based on that vector. In machine translation, this means that the encoder processes the source-language sentence and produces a context representation; the decoder then uses this representation to generate the target-language sentence.

This process matches the most intuitive encoder-decoder paradigm: the input is a sentence, the model encodes the whole sentence into a representation, and then uses that representation to generate another sentence.

The problem is that all information in the input sequence has to be squeezed into a fixed-length vector.

For short sentences, this limitation may not be obvious. Once the sentence becomes longer, the fixed-length vector can easily become an information bottleneck. The encoder’s final hidden state has to summarize the sentence-level meaning while retaining local words, syntactic structure, and long-distance dependencies as much as possible. In effect, we ask the encoder to pack every detail into one vector, and then ask the decoder to rely on the same compressed representation no matter which target word it is generating.

One way to understand this constraint is through an analogy: one person reads an entire article but is allowed to leave only a one-sentence summary; another person then tries to reconstruct the original article word by word from that sentence. This may barely work for a short text. For a longer text, many details will inevitably be lost.

Bahdanau attention starts from exactly this issue: how can we reduce the compression bottleneck caused by a fixed-length context vector?

8.1.1 The Bottleneck of a Fixed-Length Context Vector

First return to the structure of a traditional seq2seq model.

Assume the sentence to be translated has \(T_x\) tokens. The encoder reads these tokens sequentially and produces a sequence of hidden states:

\[ h_1, h_2, \dots, h_{T_x} \]

In the simplest seq2seq model, the encoder often passes only its last hidden state to the decoder as the representation of the whole input sentence:

\[ c = h_{T_x} \]

Here \(c\) is the context vector. After that, every target token generated by the decoder depends on the same \(c\). In other words, whether the decoder is generating the first word of the target sentence or a later word, the context it receives is still this fixed vector.

This creates two problems.

The first problem is information compression. The longer the source sentence, the harder it is to squeeze all of its content into a fixed-length vector. Even though LSTMs and GRUs are better than vanilla RNNs at preserving long-range information, this bottleneck does not disappear.

The second problem is the lack of dynamic selection. During translation, different target words often need to refer to different parts of the source sentence. When generating a verb, the model may need predicate information from the source sentence; when generating a noun, it may rely more on a person, place, or object. However, the context vector provided by a traditional seq2seq model is always the same, so the decoder cannot flexibly choose which source positions matter most at each generation step.

From the perspective of human translation, we usually do not read an entire sentence, keep only a global summary, and then translate solely from that summary. A more natural process is to preserve the meaning and position of each word while reading, and then dynamically look back at different positions in the source sentence during translation, selecting the information most relevant to the word being generated.

A more reasonable approach is therefore not to force the whole sentence into a single vector. Instead, we preserve the hidden state produced by the encoder at every position. Each time the decoder generates a word, it dynamically extracts the information it needs from these hidden states according to its current state.

This is the core idea of Bahdanau attention.

8.1.2 Bahdanau Attention: Decide Where to Look During Generation

The key change in Bahdanau attention is that the encoder no longer passes only one fixed context vector. Instead, it keeps the hidden states from all time steps:

\[ h_1, h_2, \dots, h_{T_x} \]

These hidden states can be viewed as contextualized representations of each source position. They are not isolated word vectors; they are states produced after the encoder has read up to the corresponding position, so they already contain part of the surrounding context.

Then, when the decoder is about to generate the \(t\)-th target word, the model no longer directly uses the same fixed \(c\). Instead, it computes a separate context vector \(c_t\) for the current step. This \(c_t\) is a weighted sum of all encoder hidden states:

\[ c_t = \sum_{i=1}^{T_x} \alpha_{t,i} h_i \]

Here \(\alpha_{t,i}\) denotes the attention proportion assigned to the \(i\)-th source position when generating the \(t\)-th target word. For example, if \(\alpha_{t,3}\) is large, the current generation step depends more on the third source position. If \(\alpha_{t,7}\) is small, the seventh position contributes less to the current decision.

In this way, the decoder receives a context vector customized for the current word at every generation step:

\[ c_1, c_2, \dots, c_{T_y} \]

This is clearly different from the traditional seq2seq approach, which uses only one fixed \(c\). Attention turns the context vector into a quantity that changes over time:

What the model is generating now determines where it should look now.

This intuition is the key point of attention. It no longer requires the model to store all information once and for all during encoding. Instead, the decoder can repeatedly look back at the source sentence during generation and select information from different positions according to the current need.

8.1.3 Where Do the Attention Weights Come From?

After understanding the intuition behind attention, the next question is where the weights \(\alpha_{t,i}\) come from.

When the decoder is preparing to generate the \(t\)-th target word, it has a current state. This state can be seen as the model’s generation need at that moment: the previous words have already been generated, and the model now has to decide what the next word should be.

Assume the current decoder hidden state is \(s_{t-1}\), and the hidden state at the \(i\)-th encoder position is \(h_i\). One idea is to use a scoring function to measure their relevance:

\[ e_{t,i} = a(s_{t-1}, h_i) \]

Here \(e_{t,i}\) is an unnormalized attention score that measures how relevant the \(i\)-th source position is to the current decoder state.

Bahdanau attention uses a small feed-forward neural network to compute this score:

\[ e_{t,i} = v_a^\top \tanh(W_a s_{t-1} + U_a h_i) \]

There is no need to expand the derivation of this formula here. Intuitively, it puts the current decoder state \(s_{t-1}\) and an encoder hidden state \(h_i\) together, then uses a learnable function to output a relevance score.

Next, applying softmax over the scores for all source positions gives the attention weights:

\[ \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T_x} \exp(e_{t,j})} \]

Finally, these weights are used to compute a weighted sum of the encoder hidden states, producing the context vector needed at the current step:

\[ c_t = \sum_{i=1}^{T_x} \alpha_{t,i} h_i \]

The computation in Bahdanau attention can therefore be summarized in three steps:

  1. Compute relevance scores between the current decoder state and each encoder hidden state.
  2. Apply softmax to these scores and convert them into attention weights.
  3. Aggregate the encoder hidden states according to the attention weights to form the context vector for the current step.

Essentially, this is dynamic information retrieval. The current decoder state asks “what do I need now”; all encoder hidden states provide candidate information; attention assigns weights according to relevance and aggregates the most useful information back into the decoder.

8.1.4 Why Is It Called Soft Alignment?

Bahdanau attention first appeared in the context of machine translation. Translation has a natural phenomenon: a word in the target language often corresponds to one or several words in the source language. For example, when translating English into Chinese, generating a Chinese word may mainly depend on several words in the English sentence. Traditional machine translation calls this correspondence alignment.

The attention weights \(\alpha_{t,i}\) can play a similar role.

For the \(t\)-th position in the target sentence, \(\alpha_{t,i}\) represents its association strength with the \(i\)-th position in the source sentence. If all \(\alpha_{t,i}\) values are plotted as a matrix, we get a visualization similar to an alignment map: rows correspond to target words, columns correspond to source words, and darker colors indicate larger weights.

However, this alignment is not hard. Hard alignment requires the current target word to correspond to exactly one position in the source sentence. For example, the third target word may align only to the fifth source word. Bahdanau attention uses soft alignment: it does not force the model to choose only one source position, but allows it to assign continuous weights to multiple positions:

\[ \alpha_{t,1}, \alpha_{t,2}, \dots, \alpha_{t,T_x} \]

These weights sum to 1, but each position can contribute part of the information. In other words, the current target word can mainly refer to one source word while also absorbing smaller amounts of information from other related positions. This is what “soft” means.

The advantage of soft alignment is that it is continuous and differentiable. The model does not need additional labels telling it which word should align with which word, and it does not need to train an independent alignment module first. It can learn this correspondence gradually through backpropagation while training on the translation task itself.

In other words, the model does not first learn alignment separately and then learn translation. It learns, during translation training, which source positions should be attended to when generating the current word. This also explains the phrase “jointly learning to align and translate” in the title of the Bahdanau paper: alignment and translation are learned together in the same training process.

8.1.5 What Did Bahdanau Attention Change?

Bahdanau attention is not just a small module inserted into seq2seq. More importantly, it changes how information is passed between the encoder and the decoder.

In traditional seq2seq, there is only one fixed-length information channel between the encoder and the decoder. The encoder must compress all input information into one vector, and the decoder can only rely on this vector to generate the full output.

After adding attention, the communication between the two sides becomes more flexible. The encoder preserves the hidden state at each position. At every generation step, the decoder can revisit these states and aggregate information dynamically according to the current need.

Attention brings at least three important changes.

First, it reduces the information bottleneck caused by a fixed-length vector. Source-sentence information is no longer passed only through the last hidden state, but is jointly provided by all encoder hidden states. Second, it gives the context vector dynamic retrieval capability. Different target words can have different \(c_t\) values, and the model can focus on different source regions when generating different words. Finally, it produces an intermediate structure with interpretive value. Although attention weights should not be equated with a complete explanation, in machine translation they can show which source positions the model tends to refer to when generating a target word.

From this perspective, the core of attention is not a particular formula, but an idea:

Do not compress all information into a fixed representation in advance. Retrieve relevant information dynamically when it is needed.

This idea was later extended repeatedly. Initially, it mainly served RNN seq2seq encoder-decoder architectures. Later, it was abstracted into the more general query, key, and value form. After that, Self-Attention moved this dynamic retrieval mechanism inside a single sequence and eventually became the core component of the Transformer.

8.1.6 The Relationship Between Bahdanau Attention and Modern Attention

Using the terminology of modern attention, Bahdanau attention can be understood as an early form of the mechanism.

Modern attention is often described with three concepts: query, key, and value. The query represents what the current position is looking for; the key represents how candidate information participates in matching; the value represents the content that is actually retrieved.

If we reinterpret Bahdanau attention in this language, the decoder hidden state \(s_{t-1}\) is similar to the query, while the encoder hidden state \(h_i\) plays the role of both key and value.

It is like a key because the model uses \(h_i\) and \(s_{t-1}\) to compute a matching score:

\[ e_{t,i} = a(s_{t-1}, h_i) \]

It is also like a value because the vectors that are finally weighted and passed back to the decoder are still these \(h_i\):

\[ c_t = \sum_{i=1}^{T_x} \alpha_{t,i} h_i \]

That is, in Bahdanau attention, the information used for matching and the information retrieved at the end have not yet been clearly separated. In later forms of modern attention, the model usually obtains \(Q\), \(K\), and \(V\) explicitly through different linear transformations, so that “what is used for matching” and “what content is retrieved” can enter two separately learnable representation spaces. The core logic remains unchanged: first compute relevance between the current need and candidate information, then use the relevance weights to aggregate information.

This is also why Bahdanau attention is a good starting point for understanding the Transformer. It explains, in the concrete seq2seq translation setting, what problem attention is meant to solve: fixed-length representations are too rigid, while generation needs the ability to dynamically look back at the input.

After this point is clear, cross-attention, self-attention, and multi-head attention no longer look like disconnected concepts. They all answer the same question:

When the model processes one position, how should it dynamically find the most relevant parts from a set of candidate information?

8.1.7 Summary

This section started from the fixed-length context vector in traditional seq2seq and introduced the core idea of Bahdanau attention.

Traditional seq2seq compresses the whole source sentence into one fixed vector. This easily creates an information bottleneck and prevents the decoder from dynamically selecting information when generating different target words. Bahdanau attention preserves the hidden states from all encoder time steps, allowing the decoder to recompute attention weights at every generation step and obtain a context vector specific to the current step.

Intuitively, attention dynamically decides “where to look now” during generation. From the perspective of machine translation, it can also be understood as soft alignment: instead of hard-selecting one source word, the model assigns continuous weights to all source positions and learns this correspondence automatically through end-to-end training.

The point of this section is not to memorize a specific scoring function, but to understand what problem attention tries to solve and what it changes. It moves the model from “compress the input once” to “dynamically retrieve relevant information when needed”. The next section will further abstract this idea into its modern form, discuss cross-attention and self-attention, and formally introduce query, key, and value representations.