Large Language Models Enable Few-Shot Clustering

Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1


Introduction
Unsupervised clustering aims to do an impossible task: organize data in a way that satisfies a domain expert's needs without any specification of what those needs are.Clustering, by its nature, is fundamentally an underspecified problem.According to Caruana (2013), this underspecification makes clustering "probably approximately useless." Semi-supervised clustering, on the other hand, aims to solve this problem by enabling the domain expert to guide the clustering algorithm (Bae et al., 2020).Prior works have introduced different types of interaction between an expert and a clustering algorithm, such as initializing clusters with handpicked seed points (Basu et al., 2002), specifying Figure 1: In traditional semi-supervised clustering, a user provides a large amount of feedback to the clusterer.In our approach, the user prompts an LLM with a small amount of feedback.The LLM then generates a large amount of pseudo-feedback for the clusterer.
pairwise constraints (Basu et al., 2004;Zhang et al., 2019), providing feature feedback (Dasgupta and Ng, 2010), splitting or merging clusters (Awasthi et al., 2013), or locking one cluster and refining the rest (Coden et al., 2017).These interfaces have all been shown to give experts control of the final clusters.However, they require significant effort from the expert.For example, in a simulation that uses split/merge, pairwise constraint, and lock/refine interactions (Coden et al., 2017), it took between 20 and 100 human-machine interactions to get any clustering algorithm to produce clusters that fit the human's needs.Therefore, for large, real-world datasets with a large number of possible clusters, the feedback cost required by interactive clustering algorithms can be immense.
Building on a body of recent work that uses Large Language Models (LLMs) as noisy simulations of human decision-making (Fu et al., 2023;Horton, 2023;Park et al., 2023), we propose a different approach for semi-supervised text clustering.In particular, we answer the following research question: Can an expert provide a few demonstrations of their desired interaction (e.g., pairwise constraints) to a large language model, then let the LLM direct the clustering algorithm?
We explore three places in the text clustering process where an LLM could be leveraged: before clustering, during clustering, and after clustering.We leverage an LLM before clustering by augmenting the textual representation.For each example, we generate keyphrases with an LLM, encode these keyphrases, and add them to the base representation.We incorporate an LLM during clustering by adding cluster constraints.Adopting a classical algorithm for semi-supervised clustering, pairwise constraint clustering, we use an LLM as a pairwise constraint pseudo-oracle.We then explore using an LLM after clustering by correcting low-confidence cluster assignments using the pairwise constraint pseudo-oracle.In every case, the interaction between a user and the clustering algorithm is enabled by a prompt written by the user and provided to a large language model.We test these three methods on five datasets across three tasks: canonicalizing entities, clustering queries by intent, and grouping tweets by topic.We find that, compared to traditional K-Means clustering on document embeddings, using an LLM to enrich each document's representation empirically improves cluster quality on every metric for all datasets we consider.Using an LLM as a pairwise constraint pseudo-oracle can also be highly effective when the LLM is capable of providing pairwise similarity judgements but requires a larger number of LLM queries to be effective.However, LLM post-correction provides limited upside.Importantly, LLMs can also approach the performance of traditional semi-supervised clustering with a human oracle at a fraction of the cost.
Our work stands out from recent deep-learningbased text clustering methods (Zhang et al., 2021(Zhang et al., , 2023) ) in its remarkable simplicity.Using an LLM to expand documents' representation or correct clustering outputs can be added as a plug-in to any text clustering algorithm using any set of text features, while our pseudo-oracle pairwise constraint clustering approach requires using K-Means as the underlying clustering algorithm.In our investigation of what aspect of the LLM prompt is most responsible for the clustering behavior, we find that just using an instruction alone (with no demonstrations) adds significant value.This can motivate future research directions for integrating natural language instructions with a clustering algorithm.

Methods to Incorporate LLMs
In this section, we describe the methods that we use to incorporate LLMs into clustering.

Clustering via LLM Keyphrase Expansion
Before any cluster is produced, experts typically know what aspects of each document they wish to capture during clustering.Instead of forcing clustering algorithms to mine such key factors from scratch, it could be valuable to globally highlight these aspects (and thereby specify the task emphases) beforehand.To do so, we use an LLM to make every document's textual representation taskdependent, by enriching and expanding it with evidence relevant to the clustering need.Specifically, each document is passed through an LLM which generates keyphrases, these keyphrases are encoded by an embedding model, and the keyphrase embedding is then concatenated to the original document embedding.We generate keyphrases using .We provide a short prompt to the LLM, starting with an instruction (e.g."I am trying to cluster online banking queries based on whether they express the same intent.For each query, generate a comprehensive set of keyphrases that could describe its intent, as a JSON-formatted list.").The instruction is followed by four demonstrations of keyphrases (example shown in Figure 2).Examples of full prompts are shown in Appendix B.
We then encode the generated keyphrases into a single vector, and concatenate this vector with the original document's text representation.To disentangle the knowledge from an LLM with the benefits of a better encoder, we encode the keyphrases using the same encoder as the original text. 2 2 An exception to this is entity clustering.There, the BERT

Pseudo-Oracle Pairwise Constraint Clustering
We explore the situation where a user conceptually describes which kinds of points to group together and wants to ensure the final clusters follow this grouping.
Arguably, the most popular approach to semisupervised clustering is pairwise constraint clustering, where an oracle (e.g. a domain expert) selects pairs of points which must be linked or cannot be linked (Wagstaff and Cardie, 2000), such that more abstract clustering needs of experts can be implicitly induced from the concrete feedback.
We use this paradigm to investigate the potential of LLMs to amplify expert guidance during clustering, using an LLM as a pseudo-oracle.
To select pairs to classify, we take different strategies for entity canonicalization and for other text clustering tasks.For text clustering, we adapt the Explore-Consolidate algorithm (Basu et al., 2004) to first collect a diverse set of pairs from embedding space (to identify pairs of points that must be linked), then collect points that are nearby to already-chosen points (to find pairs of points that cannot be linked).For entity canonicalization, where there are so many clusters that very few pairs of points must be linked, we simply identify the closest distinct pairs of points in embedding space.
We prompt an LLM with a brief domain-specific instruction (provided in entirety in Appendix A), followed by up to 4 demonstrations of pairwise constraints, obtained from test set labels.We use these pairwise constraints to generate clusters with the PCKMeans algorithm of Basu et al. (2004).This algorithm applies penalties for cluster assignments that violate any constraints, weighted by a hyperparameter w.Following prior work (Vashishth et al., 2018), we tune this parameter on each dataset's validation split.

Using an LLM to Correct a Clustering
We finally consider the setting where one has an existing set of clusters, but wants to improve their quality with minimal local changes.We use the same pairwise constraint pseudo-oracle as in section 2.2 to achieve this, and we illustrate this procedure in Figure 3.
We identify the low-confidence points by finding the k points with the least margin between the nearencoder has been specialized for clustering Wikipedia sentences, so we use DistilBERT to support keyphrase clustering.est and second-nearest clusters (setting k = 500 for our experiments).We textually represent each cluster by the entities nearest to the centroid of that cluster in embedding space.For each lowconfidence point, we first ask the LLM whether or not this point is correctly linked to any of the representative points in its currently assigned cluster.If the LLM predicts that this point should not be linked to the current cluster, we consider the 4 next-closest clusters in embedding space as candidates for reranking, sorted by proximity.To rerank the current point, we ask the LLM whether this point should be linked to the representative points in each candidate cluster.If the LLM responds positively, then we reassign the point to this new cluster.If the LLM responds negatively for all alternative choices, we maintain the existing cluster assignment.

Entity Canonicalization
Task.In entity canonicalization, we must group a collection of noun phrases 1 such that m 1 ∈ C j and m 2 ∈ C j if and only if m 1 and m 2 refer to the same entity.For example, the noun phrases President Biden (m 1 ), Joe Biden (m 2 ) and the 46th U.S. President (m 3 ) should be clustered in one group (e.g., C 1 ).The set of noun phrases M are usually the nodes of an "open knowledge graph" produced by an OIE system. 3 Unlike the related task of entity link-ing (Bunescu and Pasca, 2006;Milne and Witten, 2008), we do not assume that any curated knowledge graph, gazetteer, or encyclopedia contains all the entities of interests.
Entity canonicalization is valuable for motivating the challenges of semi-supervised clustering.
Here, there are hundreds or thousands of clusters and relatively few points per cluster, making this a difficult clustering task that requires lots of human feedback to be effective.
Datasets.We experiment with two datasets: • OPIEC59k (Shen et al., 2022)  -Prec: How many points are in the same gold cluster as the majority of their predicted cluster?-Rec: How many points are in the same predicted cluster as the majority of their gold cluster?• Pairwise Precision and Recall -Prec: How many pairs of points predicted to be linked are truly linked by a gold cluster?-Rec: How many pairs of points linked by a gold cluster are also predicted to be linked?We finally compute the harmonic mean of each pair to obtain Macro F1, Micro F1, and Pairwise F1.

Text Clustering
Task.We then consider the case of clustering short textual documents.This clustering task has been extensively studied in the literature (Aggarwal and Zhai, 2012).
Metrics.Following prior work (Zhang et al., 2021), we compare our text clusters to the ground truth using normalized mutual information and accuracy (obtained by finding the best alignment between ground truth and predicted clusters using the Hungarian algorithm (Kuhn, 1955)).

Baselines 4.1 K-Means on Embeddings
We build our methods on top of a baseline of K-means clustering (Lloyd, 1982) over encoded data with k-means++ cluster initialization (Arthur and Vassilvitskii, 2007).We choose the features and number of cluster centers that we use by task, largely following previous work.
Entity Canonicalization Following prior work (Vashishth et al., 2018;Shen et al., 2022), we clus-ter individual entity mentions (e.g."ever since the ancient Greeks founded the city of Marseille in 600 BC.") by representing unique surface forms (e.g."Marseille") globally, irrespective of their particular mention context.After clustering unique surface forms, we compose this cluster mapping onto the individual mentions (extracted from individual sentences) to obtain mention-level clusters.We build off of the "multi-view clustering" approach of Shen et al. (2022), and represent each noun phrase using textual mentions from the Internet and the "open" knowledge graph extracted from an OIE system, as shown in Figure 4.They use a BERT encoder (Devlin et al., 2019) to represent the textual context where an entity occurs (called the "context view"), and a TransE knowledge graph encoder (Bordes et al., 2013) to represent nodes in the open knowledge graph (called the "fact view").They improve these encoders by finetuning the BERT encoder using weak supervision of coreferent entities and improving the knowledge graph representations using data augmentation on the knowledge graph.These two views of each entity are then combined to produce a representation.
In their original paper, they propose an alternating multi-view K-Means procedure where cluster assignments that are computed in one view are used to initialize cluster centroids in the other view.After a certain number of iterations, if the per-view clusterings do not agree, they perform a "conflict resolution" procedure to find a final clustering with low inertia in both views.One of our secondary contributions is a simplification of this algorithm.We find that by simply using their finetuned encoders, concatenating the representations from each view, and performing K-Means clustering with K-Means++ initialization (Arthur and Vassilvitskii, 2007) in a shared vector space, we can match their reported performance.
Finally, regarding the number of cluster centers, following the Log-Jump method of Shen et al. (2022), we choose 490 and 6687 clusters for OPIEC59k and ReVerb45k, respectively.
Intent Clustering For the Bank77 and CLINC datasets, we follow Zhang et al. (2023) and encode each user query using the Instructor encoder.We use a simple prompt to guide the encoder: "Represent utterances for intent classification".Again following previous work, we choose 150 and 77 clusters for CLINC and Bank77, respectively.

Clustering via Contrastive Learning
In addition to the methods described in Section 2, we also include two other methods for text clustering, where previously reported: SCCL (Zhang et al., 2021) and ClusterLLM (Zhang et al., 2023).Both use constrastive learning of deep encoders to improve clusters, making these significantly more complicated and compute-intensive than our proposed methods.SCCL combines deep embedding clustering (Xie et al., 2015) with unsupervised contrastive learning to learn features from text.Clus-terLLM uses LLMs to improve the learned features.After running hierarchical clustering, they also use triplet feedback from the LLM ("is point A more similar to point B or point C?") to decide the cluster granularity from the cluster hierarchy and generate a flat set of clusters.To compare effectively with these approaches, we use the same encoders reported for SCCL and ClusterLLM in prior works: Instructor (Su et al., 2022) for Bank77 and CLINC and DistilBERT (finetuned for sentence similarity classification) (Sanh et al., 2019;Reimers and Gurevych, 2019) for Tweet.

Summary of Results
We summarize empirical results for entity canonicalization in Table 1 and text clustering in Table 2. 5We find that using the LLM to expand textual representations is the most effective, achieving state-ofthe-art results on both canonicalization datasets and significantly outperforming a K-Means baseline for all text clustering datasets.Pairwise constraint Kmeans, when provided with 20,000 pairwise con-  straints pseudo-labeled by an LLM, achieves strong performance on 3 of 5 datasets (beating the current state-of-the-art on OPIEC59k).Below, we conduct more in-depth analyses on what makes each method (in-)effective.

LLMs excel at text expansion
In Table 1 and Table 2, we see that the "Keyphrase Clustering" approach is our strongest approach, achieving the best results on 3 of 5 datasets (and giving comparable performance to the next strongest method, pseudo-oracle PCKMeans, on the other 2 datasets).This suggests that LLMs are useful for expanding the contents of text to facilitate clustering.What makes LLMs useful in this capacity?Is it the ability to specify task-specific modeling instructions, the ability to implicitly specify a similarity function via demonstrations, or do LLMs contain knowledge that smaller neural encoders lack?
We answer this question with an ablation study.For OPIEC59k and CLINC, we consider the "Keyphrase Clustering" technique but omit either the instruction or the demonstration examples from the prompt.For CLINC, we also compare with Table 3: We compare the effect of LLM intervention without demonstrations or without instructions.We see that GPT-3.5-basedKeyphrase Clustering outperforms instruction-finetuned encoders of different sizes, even when we provide the same prompt.
K-Means clustering on features from the Instructor model, which allows us to specify a short instruction to a small encoder.We find empirically that providing either instructions or demonstrations in the prompt to the LLM enables the LLM to improve cluster quality, but that providing both gives the most consistent positive effect.Qualitatively, providing instructions but omitting demonstrations leads to a larger set of keyphrases with less consistency, while providing demonstrations without any instructions leads to a more focused group of keyphrases that sometimes fail to reflect the desired aspect (e.g.topic vs. intent).
Why is keyphrase clustering using GPT-3.5 in the instruction-only ("without demonstrations") setting better than Instructor, which is an instructionfinetuned encoder?While GPT-3.5's size is not published, GPT-3 contains 175B parameters, Instructor-base/large/xl contain 110M, 335M parameters, and 1.5B parameters, respectively.The modest scaling curve suggests that scale is not solely responsible.
Our prompts for Instructor are brief (e.g."Represent utterances for intent classification"), while our prompts for GPT-3.5 (in Appendix B) are very detailed.Instructor-XL does not handle long prompts well; in the bottom row of Table 3, we see that Instructor-XL performs poorly when given the same prompt that we give to GPT-3.5.We speculate that today's instruction-finetuned encoders are insufficient to support the detailed, task-specific prompts that facilitate few-shot clustering.

The limitations of LLM post-correction
LLM post-correction consistently provides small gains on datasets over all metrics -between 0.1 and 5.2 absolute points of improvement.In Table 4, we see that when we provide the top 500 most-uncertain cluster assignments to the LLM to reconsider, the LLM only reassigns points in a small minority of cases.Though the LLM pairwise oracle is usually accurate, the LLM is disproportionately inaccurate for points where the original clustering already had low confidence.

How much does LLM guidance cost?
We've shown that using an LLM to guide the clustering process can improve cluster quality.How- ever, large language models can be expensive; using a commercial LLM API during clustering imposes additional costs to the clustering process.
In Table 5, we summarize the pseudo-labeling cost of collecting LLM feedback using our three approaches.Among our three proposed approaches, pseudo-labeling pairwise constraints using an LLM (where the LLM must classify 20K pairs of points) incurs the greatest LLM API cost.While PCKMeans and LLM Correction both query the LLM the same number of times for each dataset, Keyphrase Correction's cost scales linearly with the size of the dataset, making this infeasible for clustering very large corpora.

Using an LLM as a pseudo-oracle is cost-effective
Using large language models increases the cost of clustering.Does the improved performance justify this cost?By employing a human expert to guide the clustering process instead of a large language model, could one achieve better results at a comparable cost?Since pseudo-labeling pairwise constraints requires the greatest API cost in our experiments, we take this approach as a case study.Given a sufficient amount of pseudo-oracle feedback, we see in Figure 5 that pairwise constraint K-means is able to yield an improvement in Macro F1 (suggesting better purity of clusters) without dramatically reducing Pairwise or Micro F1.
Is this cost reasonable?For the $41 spent on the OpenAI API for OPIEC59k (as shown in Table 5), one could hire a worker for 3.7 hours of labeling time, assuming an $11-per-hour wage (Hara et al., 2017).We observe that an annotator can label roughly 3 pairs per minute.Then, $41 in worker wages would generate <700 human labels at the same cost as 20K GPT-3.5 labels.
Based on the feedback curve in Figure 5, we see Compared to the same algorithm with true oracle constraints, we see the sensitivity of this algorithm to a noisy oracle.
that GPT-3.5 is remarkably more effective than a true oracle pairwise constraint oracle at this price point; unless at least 2500 pairs labeled by a true oracle are provided, pairwise constraint KMeans fails to deliver any value for entity canonicalization.This suggests that if the goal is maximizing empirical performance, querying an LLM is more cost-effective than employing a human labeler.

Conclusion
We find that using LLMs in simple ways can provide consistent improvements to the quality of clusters for a variety of text clustering tasks.We find that LLMs are most consistently useful as a means of enriching document representations, and we believe that our simple proof-of-concept should motivate more elaborate approaches for document expansion via LLMs.

A Pairwise Constraint Pseudo-Oracle Prompt
We use a task-specific prompt for the pairwise constraint pseudo-oracle, using 4 entities from each dataset as demonstration examples.We provide an example of one of the exact prompts used, for reference:

OPIEC59k Prompt
You are tasked with clustering entity strings based on whether they refer to the same Wikipedia article.To do this, you will be given pairs of entity names and asked if their anchor text, if used separately to link to a Wikipedia article, is likely referring to the same article.Entity names may be truncated, abbreviated, or ambiguous.
To help you make this determination, you will be given up to three context sentences from Wikipedia where the entity is used as anchor text for a hyperlink.Amongst each set of examples for a given entity, the entity for all three sentences is a link to the same article Wikipedia.
Based on these examples, you will decide whether the first entity and the second entity listed would likely link to the same Wikipedia article if used as separate anchor text.
Please note that the context sentences may not be representative of the entity's typical usage, but should aid in resolving the ambiguity of entities that have similar or overlapping meanings.
To avoid subjective decisions, the decision should be based on a strict set of criteria, such as whether the entities will generally be used in the same contexts, whether the context sentences mention the same topic, and whether the entities have the same domain and scope of meaning.
Your task will be considered successful if the entities are clustered into groups that consistently refer to the same Wikipedia articles.

B Keyphrase Expansion Prompt
We provide a domain-specific prompt to the keyword expansion clusterer, using 4 entities from each dataset as demonstration examples.We provide the exact prompts used for reference: OPIEC59k Prompt I am trying to cluster entity strings on Wikipedia according to the Wikipedia article title they refer to.To help me with this, for a given entity name, please provide me with a comprehensive set of alternative names that could refer to the same entity.Entities may be weirdly truncated or ambiguous -e.g."Wind" may refer to the band "Earth, Wind, and Fire" or to "rescue service".

Figure 2 :
Figure 2: We expand document representations by concatenating them with keyphrase embeddings.The keyphrases are generated by a large language model.

Figure 4 :
Figure 4: Using the CMVC architecture, we encode a knowledge graph-based "fact view" and a text-based "context-view" to represent each entity.

Figure 5 :
Figure 5: Collecting more pseudo-oracle feedback for pairwise constraint K-Means on OPIEC59k improves the Macro F1 metric without reducing other metrics.Compared to the same algorithm with true oracle constraints, we see the sensitivity of this algorithm to a noisy oracle.
contains 22K noun phrases (with 2138 unique entity surface forms) belonging to 490 ground truth clusters.
(Vashishth et al., 2018)7(Gashteovski et al., , 2019)) al., 2017(Gashteovski et al., , 2019)), and the ground truth entity clusters are anchor texts from Wikipedia that link to the same Wikipedia article.•ReVerb45k(Vashishthet al., 2018)contains 15.5K mentions (with 12295 unique entity surface forms) belonging to 6700 ground truth clusters.The noun phrases are the output of the ReVerb (Fader et al., 2011) system, and the "ground-truth" entity clusters come from automatically linking entities to the Freebase knowledge graph.We use the version of this dataset from Shen et al. (2022), who manually removed samples containing labeling errors.Canonicalization Metrics.We follow the standard metrics used by Shen et al. (2022): • Macro Precision and Recall -Prec: For what fraction of predicted clusters is every element in the same gold cluster?-Rec: For what fraction of gold clusters is every element in the same predicted cluster?• Micro Precision and Recall

Table 1 :
Shen et al. (2022)for integrating LLMs into entity canonicalization."CMVC"refers to the multi-view clustering method ofShen et al. (2022), while "KMeans" refers to our simplified reimplementation of the same method.Where applicable, standard deviations are obtained by running clustering 5 times with different seeds.

Table 2 :
Zhang et al. (2021)or integrating LLMs into text clustering."SCCL"referstoZhangetal. (2021)while "ClusterLLM" refers toZhang et al. (2023).We use the same base encoders as those methods in our experiments.Where applicable, standard deviations are obtained by running clustering 5 times with different seeds.

Table 4 :
When re-ranking the top 500 points in each dataset, the LLM rarely disagrees from the original clustering, and when it does, it is frequently wrong.
For each entity, I will provide you with a sentence where this entity is used to help you understand what this entity refers to.Gen-erate a comprehensive set of alternate entity names as a JSON-formatted list.Ray was honored to be a host of the informal lunch for Queen Elizabeth 's visit to Toronto ." 3) "Its 1977 premiere was staged at the Royal Festival Hall in London as part of Queen Elizabeth 's Silver Jubilee .""