# Sequence Modeling | Deep Learning Specialization | Coursera

#### Course planning

##### Week 1: Recurrent neural networks

Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and Bidirectional RNNs, which you are going to learn about in this section.

• Lectures: Recurrent neural networks
• C4W1L01 Why sequence models
• C4W1L02 Notation
• C4W1L03 Recurrent neural network model
• C4W1L04 Backpropagation through time
• C4W1L05 Different types of RNNs
• C4W1L06 Language model and sequence generation
• C4W1L07 Sampling novel sequences
• C4W1L08 Vanishing gradients with RNNs
• C4W1L09 Gated recurrent unit (GRU)
• C4W1L10 Long short term memory (LSTM)
• C4W1L11 Bidirectional RNN
• C4W1L12 Deep RNNs
• Practice questions
• C4W1Q1 Recurrent neural networks
• Programming assignments
• C4W1P1 Building a recurrent neural network – Step by step
• C4W1P2 Dinosaur island – Character-level language modeling
##### Week 2: Natural language processing & word embeddings

Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.

• Lectures: Introduction to word embeddings
• C4W2L01 Word representation
• C4W2L02 Using word embeddings
• C4W2L03 Properties of word embeddings
• C4W2L04 Embedding matrix
• Lectures: Learning word embeddings: Word2vec & GloVe
• C4W2L05 Learning word embeddings
• C4W2L06 Word2vec
• C4W2L07 Negative sampling
• C4W2L08 GloVe word vectors
• Lectures: Applications using word embeddings
• C4W2L09 Sentiment classification
• C4W2L10 Debiasing word embeddings
• Practice questions
• C4W2Q1 Natural language processing & word embeddings
• Programming assignments
• C4W2P1 Operations on word vectors – Debiasing
• C4W2P2 Emoify
##### Week 3:?Sequence models & attention mechanism

Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs. This week, you will also learn about speech recognition and how to deal with audio data.

• Lectures: Various sequence to sequence architectures
• C4W3L01 Basic models
• C4W3L02 Picking the most likely sentence
• C4W3L03 Beam search
• C4W3L04 Refinements to beam search
• C4W3L05 Error analysis in beam search
• C4W3L06 Bleu score (optional)
• C4W3L07 Attention model intuition
• C4W3L08 Attention model
• Lectures: Speech recognition – Audio data
• C4W3L09 Speech recognition
• C4W3L10 Trigger word detection
• Lectures: Conclusion
• C4W3L11 Conclusion and thank you
• Practice questions
• C4W3Q1 Sequence models & attention mechanism
• Programming assignments
• C4W3P1 Neural machine translation with attention
• C4W3P2 Trigger word detection

C4W1L01 Why sequence models

C4W1L02 Notation

C4W1L03 Recurrent neural network model

C4W1L04 Backpropagation through time

C4W1L05 Different types of RNNs

C4W1L06 Language model and sequence generation

#### C4W1L07 Sampling novel sequences

• Question: Why $a^{<0>}=0$ and $x^{<0>}=0$?
• ?$x^{<0>}=0$ and $a^{<0>}=0$ are assumption of the model that means no previous step nor hidden state.
• $x^{<0>}=0$: No previous step
• $a^{<0>}=0$: No previous hidden step
###### Sampling novel sequences

Get softmax units in the last layer. Based on the word distribution by softmax, randomly choose [=sample] one word from np.random.choice. This last choosing process is called ‘sampling.’ The sampled word becomes $\hat{y}^{<1>}$.

• Every $\hat{y}^{<t>}$ are sampled like the process above.
• Teacher forcing: the current ground truth target becomes the next input. This is used while training.
• Sampling a sequence is not used while training.
###### When do we stop generating a sequence?
1. When the network generates the end of sentence token, i.e., <EOS>.
2. Fix the limit of length of a sequence.

If you do not want to get the unknown token <UNK>, just reject samples including any <UNK>.

###### Character-level language model
• Vocabulary = [a,b,c,d,…,z,A,B,…,Z,0,…,9,_]
• Pros
• No <UNK>
• Cons
• End up with much longer sequences.
• Word-level models are better at generating sequences with long-term dependency.
• More computationally expensive to train than word-level models

C4W1L09 Gated recurrent unit (GRU)

C4W1L10 Long short term memory (LSTM)

C4W1L11 Bidirectional RNN

C4W1L12 Deep RNNs

#### C4W2L01 Word representation

###### One-hot word representation
• The distance of each word is all same. It will be better that two similar words have close distance.
• The inner product of each word is all zero.
###### Featurized representation: word embedding
• Featurized representation is representation by a vector of features elements of which are expressed by their degree. For example, Man can be represented as (Gender, Royal, Age, Food) = (-1, 0.01, 0.03, 0.09) and Woman as?(Gender, Royal, Age, Food) = (1, 0.02, 0.02, 0.01).
###### Visualizing word embeddings

[van der Maaten and Hinton., 2008. Visualizing data using t-SNE]

• t-SNE is a method to visualize a set of N dimensional data in the 2 dimensional plane.

#### C4W2L02 Using word embeddings

• Name entity recognition
• Use bidirectional RNNs
###### Transfer learning and word embeddings
1. Get a word embedding.
1. Learning word embedding from large text corpus (1-100B words).
2. Transfer the embedding to a new task with a smaller training set (say, 100k words).
3. Optional: Continue to adjust the word embedding with new data.

Transfer learning is not applicable for some machine translation tasks.

###### Relation to face encoding
• Face encoding and word embedding are quite similar in terms of clustering based on feature representation.
• On the other hand, face encoding is targeted to encode any faces but word embedding is to embed a fixed number of words.
• It is impossible to embed words not in the vocabulary.

#### C4W2L03 Properties of word embeddings

[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations]

###### Word embedding as analogy reasoning
$e_{\textup{man}}-e_{\textup{woman}}\approx e_{\textup{king}}-e_{w}$

$w=?$ We can find $w$ by solving the following equation.

$w^{*} = \arg \max_{w} \textup{sim}(e_{w}, e_{\textup{king}} – e_{\textup{man}} + e_{\textup{woman}})$

One candidate of $w^{*}$ is queen.

The result above works because each feature component of vectors stands for one particular feature. The subtraction one vector from the other produces reduction on each component.?The addition produces enhancement on each component. The difference can be interpreted as a kind of relationship.

###### t-SNE
• t-SNE projects high dimensional vectors into the 2 dimensional plane.
• Only strong vector addition/subtraction relationships are shown on the t-SNE plane. Most of relationships disappear while projecting on the 2D plane.
###### Similarity between two word embeddings: Cosine similarity

The cosine similarity between two word vectors: $e_{1}$, $e_{2}$.

$\frac{e_{1} \cdot e_{2}}{|e_{1}||e_{2}|}$
###### Dissimilarity

The dissimilarity between two word vectors: $e_{1}$, $e_{2}$.

$|e_{1} - e_{2}|^2$

#### C4W2L04 Embedding matrix

###### Where is the embedding matrix used?
• Vocabulary size: $|v|$
• One-hot encoded word $j$: $o_j \in \{0,1\}^{|v|}$
• Embedded word $j$: $e_j$
• Embedding size: $|e|$ (hyperparameter)
• Embedding matrix: $E \in \mathbb{R}^{|e|} \times \mathbb{R}^{|v|}$
• The $j$th column of $E$ is $e_j$.
• All elements of $E$ are trainable and randomly initialized at first.

To get the embedding of word $j$, we compute the following.

$
e_{j}=E \cdot o_{j}
$
• In practice, we do not use the equation above. Instead, we just access the embedding matrix directly.

#### C4W2L05 Learning word embeddings

[Bengio et. al., 2003, A neural probabilistic language model]

#### C4W2L06 Word2vec

[Mikolov et. al., 2013. Efficient estimation of word representations in vector space.]

Recommended to read the paper above.

• Skip-grams
• Two layers: one for the embedding matrix, the other for classification.
• Hierarchical softmax
• Binary tree
• Branching classes into two groups at each depth.
• The more frequent words, the less deep in the softmax tree
• How to sample the context $c$?
• When sampling common words, like ‘the’, ‘a’, ‘and’, ‘to’, and so on, they could dominate the training set. We should balance common words and uncommon words in the training set.

#### C4W2L07 Negative sampling

[Mikolov et. al., 2013. Distributed representation of words and phrases and their compositionality]

C4W2L08 GloVe word vectors

C4W2L09 Sentiment classification

C4W2L10 Debiasing word embeddings

C4W3L01 Basic models

C4W3L02 Picking the most likely sentence

C4W3L03 Beam search

C4W3L04 Refinements to beam search

C4W3L05 Error analysis in beam search

C4W3L06 Bleu score (optional)

C4W3L07 Attention model intuition

C4W3L08 Attention model

C4W3L09 Speech recognition

C4W3L10 Trigger word detection

C4W3L11 Conclusion and thank you