Text summarization study on CNN/ Daily Mail. (a) Global norm of the gradients over time; (b) Norm of the last hidden state over time; (c) Encoder gradients of the cost wrt the bi-directional output (400 encoder steps); (d) Decoder gradients of the cost wrt the decoder output (100 decoder steps). Note that (c,d) are evaluated upon convergence, at a specific batch, and the norms for each time step are averaged across the batch and the hidden dimension altogether.
This site uses cookies. By continuing to use our website, you are agreeing to our privacy policy. No content on this site may be used to train artificial intelligence systems without permission in writing from the MIT Press.