Deep NLP - Conditional Language Model

Kalchbrenner and Blunsom 2013
Sutskever et al. (2014)
Kiros et al.(2013)

A conditional language model assigns probabilities to sequences of words, \(w = (w_1, w_2, …, w_l)\), given some conditioning context, x.

As with unconditional models, it is again helpful to use the chain rule to decompose this probability: \[ p(w|x) = \prod_{t=1}^lp(w_t|x, w_1, w_2, ..., w_{t-1}) \]

x "input"	w "text output"
An author	A document written by that author
A topic label	An article about that topic
{SPAM, NOT_SPAM}	An email
A sentence in French	Its English translation
A sentence in English	Its French translation
A sentence in English	Its Chinese translation
An image	A text description of the image
A question + a document	Its answer
A question + an image	Its answer

To train contitional language models, we need paired samples, \(\{(x_i, w_i)\}\).

Algorighmic challenges

We often want to find the most likely \(w\) given some \(x\). This is unfortunately generally and intractable problem. \[ w^* = \arg \max_wp(w|x) \] We therefore approximate it using a beam search or with Monte Carlo methods since \(w^{(i)} \approx p(w|x)\) is often computationally easy.

Evaluating conditional LMs

We can use cross entropy or preplexity, it's okay to implement, but hard to interpret.

Task-specific evaluation. Compare the model's most likely output to human-generated expected output using a task-specific evaluation metric \(L\). \[ w^* = \arg \max_wp(w|x)\ \ \ \ \ L(w^*, w_{ref}) \] Examples of \(L\): BLUE, METEOR, WER, ROUGE, easy to implement, okay to interpret.

Encoder-Decoder

Two questions

How do we encode \(x\) as a fixed-size vector, \(c\) ?
How do we condition on \(c\) in the decoing model?

Kalchbrenner and Blunsom 2013

Encoder \[ c = embed(x) \]

\[ s = Vc \]

Recurrent decoder \[ h_t = g(W[h_{t-1}; w_{t-1}] + s + b) \]

\[ u_t = Ph_t + b' \]

\[ p(W_t|x, w<t) = softmax(u_t) \]

CSM Encoder

How should we define \(c = embed(x)\) ?

Convolutional sentence model(CSM)

Good

By stacking them, longer range dependencies can be learnt
Convolutions learn interactions among features in a local context
Deep ConvNets have a branching structure similar to trees, but no parser is required

Bad

Sentences have different lengths, need different depth trees; convnets are not usually so dynamic, but see Kalchbrenner et al. (2014). A convolutional neural network for modelling sentences. In Proc. ACL.

Sutskever et al. (2014)

LSTM encoder

\((c_0, h_0)\) are parameters
\((c_i, h_i)\) = LSTM(\(x_i, c_{i-1}, h_{i-1}\))

The encoding is \((c_l, h_l)\) where \(l = |x|\)

LSTM decoder

\(w_0 = <s>\)
\((c_{t+l}, h_{t+l}) = LSTM(w_{t-1}, c_{t+l-1}, h_{t+l-1})\)
\(u_t = Ph_{t+l} + b\)
\(P(W_t|x, w<t) = softmax(u_t)\)

Good

RNNs deal naturally with sequences of various lengths
LSTMs in principle can propagate gradients a long
Very simple architecture

Bad

The hidden state has to remember a lot of information!

Tricks

Read the input sequence "backwards" : +4 BLEU

Use an ensemble of J independently trained models.

Ensemble of 2 models: +3 BLEU
Ensemble of 5 models; +4.5 BLEU

A word about decoding

In general, we want to find the most probable (MAP) output given the input, i.e. \[ \begin{align} w^* = \arg\max_{w}p(w|x) = \arg \max_w\sum_{t=1}^{|w|}\log p(w_t|x, w_{<t}) \end{align} \] This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search： \[ \begin{array}{lcl} w_1^* = \arg\max_{w_1}p(w_1|x)\\ w_2^* = \arg\max_{w_2}p(w_2|x, w_1^*)\\ ...\\ w^*_t = \arg\max_{w_t}p(w_t|x, w^*_{<t}) \end{array} \] A slightly better approximation is to use a beam search with beam size \(b\). Key idea: keep track of top b hypothesis. Use beam search: +1 BLEU

Kiros et al.(2013)

Image caption generation

Neural networks are great for working with multiple modalities - Everything is a vector!
Image caption generation can therefore use the same techniques as translation modeling
A word about data
- Relatively few captioned images are avaliable
- Pre-train image embedding model using another task, like image identification (e.g., ImageNet)

Look a lot like Kalchbrenner and Blunsom(2013)

convolutional network on the input
n-gram language model on the output

Innovation: multiplicative interactions in the decoder n-gram model

Encoder x = enbed(\(x\))

Simple conditional n-gram LM: \[ \begin{array}{lcl} h_t = W[w_{t-n+1}; w_{t-n+2};...; w_{t-1}] + Cx\\ u_t = Ph_t+b\\ p(W_t|x, w_{t-n+1}^{t-1}) = softmax(u_t) \end{array} \] Multiplicative n-gram LM:

\(w_i = r_{i,j,w}x_j\)
\(w_i = u_{w,i}v_{i,j}\ \ \ \ \ \ \ \ (U\in R^{|V|*d}, V \in R^{d*k})\)
\(r_t = W[w_{t-n+1}; w_{t-n+2};...; w_{t-1}] + Cx\)
\(h_t = (W^{fr}r_t)\odot (W^{fx}x)\)
\(u_t = Ph_t + b\)
\(p(W_t|x, w_{<t}) = softmax(u_t)\)

Two messages:

Feed-forward n-gram models can be used in place of RNNs in conditional models
Modeling interactions between input modalities holds a lot of promise
- Although MLP-type models can approximate higher order tensors, multiplicative models appear to make learning interactions easier