Word sense disambiguation (WSD) is an old and important task in computational linguistics that still remains challenging, to machines as well as to human annotators. Recently there have been several proposals for representing word meaning in context that diverge from the traditional use of a single best sense for each occurrence. They represent word meaning in context through multiple paraphrases, as points in vector space, or as distributions over latent senses. New methods of evaluating and comparing these different representations are needed. In this paper we propose two novel annotation schemes that characterize word meaning in context in a graded fashion. In WS sim annotation, the applicability of each dictionary sense is rated on an ordinal scale. U sim annotation directly rates the similarity of pairs of usages of the same lemma, again on a scale. We find that the novel annotation schemes show good inter-annotator agreement, as well as a strong correlation with traditional single-sense annotation and with annotation of multiple lexical paraphrases. Annotators make use of the whole ordinal scale, and give very fine-grained judgments that “mix and match” senses for each individual usage. We also find that the U sim ratings obey the triangle inequality, justifying models that treat usage similarity as metric. There has recently been much work on grouping senses into coarse-grained groups. We demonstrate that graded WS sim and U sim ratings can be used to analyze existing coarse-grained sense groupings to identify sense groups that may not match intuitions of untrained native speakers. In the course of the comparison, we also show that the WS sim ratings are not subsumed by any static sense grouping.