Language model training hyperparameters.
Hyperparameter . | Value . |
---|---|
Hidden size | 768 |
Embedding size | 768 |
Vocab size | 30004 |
Max sequence length | 128 |
Batch size | 128 |
Train steps | 1M |
Learning rate decay | Linear |
Warmup steps | 10000 |
Learning rate | 1e-4 |
Adam ϵ | 1e-6 |
Adam β1 | 0.9 |
Adam β2 | 0.999 |
Dropout | 0.1 |
Transformer hyperparameter | Value |
Transformer layers | 12 |
Intermediate hidden size | 3072 |
Attention heads | 12 |
Attention head size | 64 |
Attention dropout | 0.1 |
BERT mask proportion | 0.15 |
LSTM hyperparameter | Value |
LSTM layers | 3 |
Context size | 768 |
Hyperparameter . | Value . |
---|---|
Hidden size | 768 |
Embedding size | 768 |
Vocab size | 30004 |
Max sequence length | 128 |
Batch size | 128 |
Train steps | 1M |
Learning rate decay | Linear |
Warmup steps | 10000 |
Learning rate | 1e-4 |
Adam ϵ | 1e-6 |
Adam β1 | 0.9 |
Adam β2 | 0.999 |
Dropout | 0.1 |
Transformer hyperparameter | Value |
Transformer layers | 12 |
Intermediate hidden size | 3072 |
Attention heads | 12 |
Attention head size | 64 |
Attention dropout | 0.1 |
BERT mask proportion | 0.15 |
LSTM hyperparameter | Value |
LSTM layers | 3 |
Context size | 768 |