MIT Press

Figure 6:

Text summarization study on CNN/ Daily Mail. (a) Global norm of the gradients over time; (b) Norm of the last hidden state over time; (c) Encoder gradients of the cost wrt the bi-directional output (400 encoder steps); (d) Decoder gradients of the cost wrt the decoder output (100 decoder steps). Note that (c,d) are evaluated upon convergence, at a specific batch, and the norms for each time step are averaged across the batch and the hidden dimension altogether.

This Feature Is Available To Subscribers Only