Table 9: 

Pre-training speed and computation of mT5 vs. ByT5. Left: Sequences per second pre-training on a TPUv3-64 device. Right: Total einsum operations for a forward pass, as logged by the T5 framework.

sequences / seceinsum ops × 1e12
mT5ByT5mT5ByT5
Small 1646 1232 (0.75 ×) 87 98 (1.13 ×) 
Base 747 576 (0.77 ×) 168 194 (1.15 ×) 
Large 306 232 (0.76 ×) 346 416 (1.20 ×) 
XL 94 70 (0.74 ×) 1000 1220 (1.22 ×) 
XXL 33 25 (0.76 ×) 1660 2070 (1.25 ×) 
sequences / seceinsum ops × 1e12
mT5ByT5mT5ByT5
Small 1646 1232 (0.75 ×) 87 98 (1.13 ×) 
Base 747 576 (0.77 ×) 168 194 (1.15 ×) 
Large 306 232 (0.76 ×) 346 416 (1.20 ×) 
XL 94 70 (0.74 ×) 1000 1220 (1.22 ×) 
XXL 33 25 (0.76 ×) 1660 2070 (1.25 ×) 
Close Modal

or Create an Account

Close Modal
Close Modal