Skip to Main Content
Table 6: 
Jensen-Shannon divergence between the attention distributions of a random local attention head and a random head that routes attention as in Section 4.1 per layer on the Wikitext-103 data-set. We report means and standard deviations computed over 10 runs and use the natural logarithm so that divergences are upper-bounded by 0.6931.
JSD(locallocal)JSD(localrouting)JSD(routingrouting)
layer 0 0.0038 ± 0.0018 0.4706 ± 0.0319 0.1579 ± 0.0576 
layer 1 0.3071 ± 0.1217 0.6674 ± 0.0153 0.5820 ± 0.0104 
layer 2 0.2164 ± 0.0803 0.5896 ± 0.0249 0.4015 ± 0.0121 
layer 3 0.1163 ± 0.0336 0.6047 ± 0.0181 0.4144 ± 0.0264 
layer 4 0.1840 ± 0.0562 0.6266 ± 0.0062 0.4191 ± 0.0879 
layer 5 0.2284 ± 0.0225 0.6463 ± 0.0155 0.4687 ± 0.0449 
layer 6 0.1901 ± 0.0525 0.6471 ± 0.0040 0.5175 ± 0.0469 
layer 7 0.1566 ± 0.0685 0.5798 ± 0.0235 0.4350 ± 0.0139 
layer 8 0.1638 ± 0.0739 0.5993 ± 0.0148 0.4268 ± 0.0291 
layer 9 0.2095 ± 0.0560 0.6127 ± 0.0053 0.3581 ± 0.0019 
JSD(locallocal)JSD(localrouting)JSD(routingrouting)
layer 0 0.0038 ± 0.0018 0.4706 ± 0.0319 0.1579 ± 0.0576 
layer 1 0.3071 ± 0.1217 0.6674 ± 0.0153 0.5820 ± 0.0104 
layer 2 0.2164 ± 0.0803 0.5896 ± 0.0249 0.4015 ± 0.0121 
layer 3 0.1163 ± 0.0336 0.6047 ± 0.0181 0.4144 ± 0.0264 
layer 4 0.1840 ± 0.0562 0.6266 ± 0.0062 0.4191 ± 0.0879 
layer 5 0.2284 ± 0.0225 0.6463 ± 0.0155 0.4687 ± 0.0449 
layer 6 0.1901 ± 0.0525 0.6471 ± 0.0040 0.5175 ± 0.0469 
layer 7 0.1566 ± 0.0685 0.5798 ± 0.0235 0.4350 ± 0.0139 
layer 8 0.1638 ± 0.0739 0.5993 ± 0.0148 0.4268 ± 0.0291 
layer 9 0.2095 ± 0.0560 0.6127 ± 0.0053 0.3581 ± 0.0019 
Close Modal

or Create an Account

Close Modal
Close Modal