I am trying to understand the multi-head attention introduced in the paper, “Attention Is All You Need“. The purpose to understand the multi-head attention is understand the style token layer, which contains multi-head attention and was introduced in the paper, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis“.

Multi-head attention is composed of scaled dot-product attention. Scaled dot-product attention is parameterized by query Q, key

K, and value V. Since the paper,

“Attention Is All You Need“, does not describe what query, key, and value mean. Thus, it is inevitable to find those meanings through other literature.

This video of Lecture 8 in Stanford CS224N from 1:22:17 provides an understandable definition of attention. You can find the slides used to explain a general concept of attention down below. Their page numbers range from 73 to 76.

**Key statements dealt with the slides:**

- Attention is a technique to compute a weighted sum of the values, dependent on the query.
- [In more general terms] Attention is a way to obtain a
(the values), dependent on some other representation (the query).*fixed-sized representation of an arbitrary set of representations* - The weighted sum is a
of the information contained in the values, where the query determines which values to focus on.In some cases, the query attends to the values.*selective summary* - Key components of attention:
- a query
- values
- attention scores
- attention distribution = attention weights
- an attention output = a context vector

- Attention variants: Classifying attention by whether elements of the query and values are added or multiplied
**Multiplicative attention**- An attention score \mathbf{e}_i of multiplicative attention is a weighted sum of multiplication of one value element s_j and one query element h_{i,j} : \mathbf{e}_i = \mathbf{s}^T W \mathbf{h}_{i} = \sum_{p=1}^{d_1} \sum_{q=1}^{d_2} w_{p,q} s_{q} h_{i,p}
- As the norm of each element gets larger, the norm of the attention score gets larger.
- Multiplicative attention becomes
**basic dot-product attention**if W is an identity matrix and d_{1} = d_{2}

**Additive attention**- A weighted sum of elements of \mathbf{s} and \mathbf{h}_{i} becomes a logits of tanh. d_{3} weighted sums become logits of tanh. d_{3} tanh outputs are weighted-summed with \mathbf{v}, resulting in a scalar, i.e., attention score.

Now, it is clear what multi-head attention is used in the Style Token paper. The paper employs multi-head attention that has multiple additive attention networks, the structure of which follows a multi-layer perceptron. The paper does not state an exact structure of the attention network. However, I presume its structure is equal to the Bahdanau’s attention network that takes as inputs the concatenation of a reference encoding (a query) and a style token (a value) and has only one tanh 1000-unit hidden layer followed by the attention score layer. For each value \mathbf{h}_i, attention score \mathbf{e}_i gets computed.

e_i = \mathbf{v}^T \mathtt{tanh} (W_1 \mathbf{h}_i + W_2 \mathbf{s} )Then, with all attentions scores and softmax, attention weights \alpha_{i} are computed.

\alpha_{i} = \mathtt{softmax} ( e_i )Finally, we get a context vector \mathbf{c}.

\mathbf{c} = \sum_{i=1}^{d_1} {\alpha_{i} \mathbf{h}_i}