Table 5:

Categorization of 150 C levr-Humans examples, together with the accuracy of the GLT model (for many categories, sample is too small to make generalizations on model performance). Spelling mistakes are underlined. Some examples were labeled with multiple categories. Six examples with gold-label mistakes were ignored.

CategoryExamplesCorrect
Negation – How many objects are not shiny? ✓
– What color is the cylinder that is not the same colr as the sphere? ✗
1/2 (50%)
Spelling mistakes – Are there two green cumes? ✓
– How many rubber spehres are there here? ✗
1/5 (20%)
Superlatives – What color is the object furthest to the right? ✓
– What shape is the smallest matte object? ✓
8/10 (80%)
Visual Concepts: obscuring, between – Is the sphere the same color as the object that is obscuring it? ✓
– What color is the object in between the two large cubes? ✓
5/7 (71%)
Visual Concepts: reflection, shadow – Which shape shows the largest reflection ✗
– Are all of the objects casting a shadow? ✗
0/2 (0%)
Visual Concepts: relations – What color is the small ball near the brown cube? ✓
– What is behind and right of the cyan cylinder? ✗
3/8 (38%)
All quantifier – Are all the spheres the same size? ✓
– Are all the cylinders brown? ✗
10/12 (83%)
Counting abstract concepts – How many different shapes are there? ✓
– How many differently colored cubes are there? ✗
1/4 (25%)
Complex logic – if these objects were lined up biggest to smallest, what would be in the middle? ✓
– if most of the items shown are shiny and most of the items shown are blue, would it be fair to say most of the items are shiny and blue? ✗
3/4 (75%)
Different question structure – Are more objects metallic or matte? ✓
– Each shape is present 3 times except for the ✗
1/2 (50%)
Uniqueness – What color object is a different material from the rest? ✓
– What color is the object that does not match the others? ✗
1/2 (50%)
Long-tail concepts – Can you roll all the purple objects? ✓
– How many of these things could be stacked on top of each other? ✗
2/5 (40%)
Operators used differently than C levr – Are the large cylinders the same color? ✓
– Are there more rubber objects than matte cylinders and green cubes? ✗
2/5 (40%)
Same operators as C levr– What color is the cube directly in front of the blue cylinder? ✓ 1/80 (89%)
possibly different phrasing – What color is the little cylinder? ✗
