A **conditional language model** assigns probabilities to sequences of words, \(w = (w_1, w_2, …, w_l)\), given some conditioning context, **x**.

As with unconditional models, it is again helpful to use the chain rule to decompose this probability: \[ p(w|x) = \prod_{t=1}^lp(w_t|x, w_1, w_2, ..., w_{t-1}) \]

x "input" | w "text output" |
---|---|

An author | A document written by that author |

A topic label | An article about that topic |

{SPAM, NOT_SPAM} | An email |

A sentence in French | Its English translation |

A sentence in English | Its French translation |

A sentence in English | Its Chinese translation |

An image | A text description of the image |

A question + a document | Its answer |

A question + an image | Its answer |

To **train** contitional language models, we need paired samples, \(\{(x_i, w_i)\}\).

**Algorighmic challenges**

We often want to find the most likely \(w\) given some \(x\). This is unfortunately generally and *intractable problem*. \[
w^* = \arg \max_wp(w|x)
\] We therefore approximate it using a **beam search** or with Monte Carlo methods since \(w^{(i)} \approx p(w|x)\) is often computationally easy.

**Evaluating conditional LMs**

We can use **cross entropy** or **preplexity**, it's okay to implement, but hard to interpret.

**Task-specific evaluation**. Compare the model's most likely output to human-generated expected output using a task-specific evaluation metric \(L\). \[
w^* = \arg \max_wp(w|x)\ \ \ \ \ L(w^*, w_{ref})
\] Examples of \(L\): BLUE, METEOR, WER, ROUGE, easy to implement, okay to interpret.

**Encoder-Decoder**

Two questions

- How do we encode \(x\) as a fixed-size vector, \(c\) ?
- How do we condition on \(c\) in the decoing model?

## Kalchbrenner and Blunsom 2013

Encoder \[ c = embed(x) \]

\[ s = Vc \]

Recurrent decoder \[ h_t = g(W[h_{t-1}; w_{t-1}] + s + b) \]

\[ u_t = Ph_t + b' \]

\[ p(W_t|x, w<t) = softmax(u_t) \]

**CSM Encoder**

How should we define \(c = embed(x)\) ?

Convolutional sentence model(CSM)

Good

- By stacking them, longer range dependencies can be learnt
- Convolutions learn interactions among features in a local context
- Deep ConvNets have a branching structure similar to trees, but no parser is required

Bad

- Sentences have different lengths, need different depth trees; convnets are not usually so dynamic, but see Kalchbrenner et al. (2014). A convolutional neural network for modelling sentences. In Proc. ACL.

## Sutskever et al. (2014)

LSTM encoder

- \((c_0, h_0)\) are parameters
- \((c_i, h_i)\) = LSTM(\(x_i, c_{i-1}, h_{i-1}\))

The encoding is \((c_l, h_l)\) where \(l = |x|\)

LSTM decoder

- \(w_0 = <s>\)
- \((c_{t+l}, h_{t+l}) = LSTM(w_{t-1}, c_{t+l-1}, h_{t+l-1})\)
- \(u_t = Ph_{t+l} + b\)
- \(P(W_t|x, w<t) = softmax(u_t)\)

Good

- RNNs deal naturally with sequences of various lengths
- LSTMs in principle can propagate gradients a long
- Very simple architecture

Bad

- The hidden state has to remember a lot of information!

**Tricks**

Read the input sequence "backwards" : **+4 BLEU**

Use an ensemble of J **independently trained** models.

- Ensemble of 2 models:
**+3 BLEU** - Ensemble of 5 models;
**+4.5 BLEU**

**A word about decoding**

In general, we want to find the most probable (MAP) output given the input, i.e. \[
\begin{align}
w^* = \arg\max_{w}p(w|x) = \arg \max_w\sum_{t=1}^{|w|}\log p(w_t|x, w_{<t})
\end{align}
\] This is, for general RNNs, a hard problem. We therefore approximate it with a **greedy search**： \[
\begin{array}{lcl}
w_1^* = \arg\max_{w_1}p(w_1|x)\\
w_2^* = \arg\max_{w_2}p(w_2|x, w_1^*)\\
...\\
w^*_t = \arg\max_{w_t}p(w_t|x, w^*_{<t})
\end{array}
\] A slightly better approximation is to use a **beam search** with beam size \(b\). Key idea: **keep track of top b hypothesis**. Use beam search: **+1 BLEU**

## Kiros et al.(2013)

**Image caption generation**

- Neural networks are great for working with multiple modalities -
**Everything is a vector!** - Image caption generation can therefore use the same techniques as translation modeling
- A word about data
- Relatively few captioned images are avaliable
- Pre-train image embedding model using another task, like image identification (e.g., ImageNet)

Look a lot like Kalchbrenner and Blunsom(2013)

- convolutional network on the input
- n-gram language model on the output

Innovation: **multiplicative interactions** in the decoder n-gram model

Encoder **x** = enbed(\(x\))

Simple conditional n-gram LM: \[ \begin{array}{lcl} h_t = W[w_{t-n+1}; w_{t-n+2};...; w_{t-1}] + Cx\\ u_t = Ph_t+b\\ p(W_t|x, w_{t-n+1}^{t-1}) = softmax(u_t) \end{array} \] Multiplicative n-gram LM:

- \(w_i = r_{i,j,w}x_j\)
- \(w_i = u_{w,i}v_{i,j}\ \ \ \ \ \ \ \ (U\in R^{|V|*d}, V \in R^{d*k})\)
- \(r_t = W[w_{t-n+1}; w_{t-n+2};...; w_{t-1}] + Cx\)
- \(h_t = (W^{fr}r_t)\odot (W^{fx}x)\)
- \(u_t = Ph_t + b\)
- \(p(W_t|x, w_{<t}) = softmax(u_t)\)

Two messages:

- Feed-forward n-gram models can be used in place of RNNs in conditional models
- Modeling interactions between input modalities holds a lot of promise
- Although MLP-type models can approximate higher order tensors, multiplicative models appear to make learning interactions easier