Abstract
Few-shot Aspect Category Sentiment Analysis (ACSA) is a crucial task for aspect-based sentiment analysis, which aims to detect sentiment polarity for a given aspect category in a sentence with limited data. However, few-shot learning methods focus on distance metrics between the query and support sets to classify queries, heavily relying on aspect distributions in the embedding space. Thus, they suffer from overlapping distributions of aspect embeddings caused by irrelevant sentiment noise among sentences with multiple sentiment aspects, leading to misclassifications. To solve the above issues, we propose a metric-free method for few-shot ACSA, which models the associated relations among the aspects of support and query sentences by Dual Relations Propagation (DRP), addressing the passive effect of overlapping distributions. Specifically, DRP uses the dual relations (similarity and diversity) among the aspects of support and query sentences to explore intra-cluster commonality and inter-cluster uniqueness for alleviating sentiment noise and enhancing aspect features. Additionally, the dual relations are transformed from support-query to class-query to promote query inference by learning class knowledge. Experiments show that we achieve convincing performance on few-shot ACSA, especially an average improvement of 2.93% accuracy and 2.10% F1 score in the 3-way 1-shot setting.
1 Introduction
Aspect Category Sentiment Analysis (ACSA) (Seoh et al., 2021; Cai et al., 2021; Xiao et al., 2021; Chen et al., 2022a; Li et al., 2022a, b) is a fine-grained sentiment analysis task, which aims to identify sentiment polarity for a given aspect category in a sentence. For example, given a predefined aspect category “Staff” and a sentence “High rates for just ok room but the server keeps me waiting 1.5 hours”, ACSA aims to identify sentiment polarity towards the aspect “Staff” in the sentence. Briefly, given an aspect category and a sentence, an aspect embedding is obtained from the original sentence to predict the sentiment polarity of the aspect category in the sentence.
Existing methods mostly rely on sufficient labeled data for each aspect category. Though effective, they assume training and testing share a predefined set of aspects. However, this assumption becomes problematic in real-world scenarios with many unseen aspect categories. Annotating abundant data for these emerging aspects poses a significant challenge, and there is a burden of retraining models for newly encountered aspects. Therefore, generalizing experiences from seen aspect categories to unseen ones becomes crucial. This is where few-shot ACSA becomes indispensable.
Existing few-shot learning methods (e.g., meta-learning) mostly focus on distance metrics (Yang et al., 2020; Wang et al., 2021; Lv et al., 2021; Assran et al., 2022; Liu et al., 2022a). Among these methods, the prototypical network is a distance metric method well known because of its impressive performance. The prototypical network uses the support set to generate a prototype for each class and then classifies the query by measuring the distance (e.g., Euclidean distance or cosine similarity) with different prototypes in the embedding space.
Though few-shot learning achieves impressive progress, there are challenging issues for the few-shot ACSA task. Specifically, simple distance metrics struggle to address overlapping distributions of aspect embeddings caused by irrelevant sentiment noise in scenarios (e.g., Table 1) where each sentence may include many aspects with different sentiment polarities. Generally, overlapping distributions present an unclear decision boundary among aspect embeddings with different sentiment polarities, causing misclassifications. Take the example in Figure 1; the aspects “Service” and “Price” show a closer distance than “Service” and other aspects, indicating they tend to have the same sentiment polarity. However, in reality, they should have opposite sentiment polarities, resulting in final wrong predictions. Recent efforts have been devoted to these issues. Liang et al. (2023) explored external knowledge (e.g., aspect-associated words and aspect semantics) to alleviate irrelevant sentiment noise for enhancing aspect embeddings. However, maintaining and updating the knowledge base requires domain experts, making it resource-intensive. Additionally, collecting abundant knowledge for unseen aspect categories limits scalability. Therefore, the mentioned issues are still a considerable challenge. As the research on few-shot ACSA is still young, a novel method is expected to perform on the few-shot ACSA task.
To solve the above issues, we propose a metric-free method to address the few-shot ACSA task by modeling Dual Relations Propagation (DRP). Following the meta-learning formulation, DRP is designed in a relation graph to explicitly model the dual relations (i.e., similarity relation1 and diversity relation2) among support and query nodes. In the relation graph, each node is a sentence-aspect pair, and its aspect embedding is considered as the node feature in the embedding space. Additionally, the dual relations are formalized as two undirected edges between a node pair, and each relation has an associated strength measure. Briefly, the similarity relation presents a similarity strength between a connected node pair, and the diversity relation gives a discrepancy strength between them. With the relation graph, the proposed method propagates and aggregates the dual relations to explore intra-cluster commonality and inter-cluster uniqueness, alleviating irrelevant sentiment noise and enhancing node and edge features. Also, the dual relations are transformed from support-query to class-query to guide query inference by learning sentiment class knowledge from the relation graph. Extensive experiments show that the proposed method outperforms strong baselines and obtains significant performance. Significantly, it surpasses the latest baseline by 2.93% accuracy and 2.10% F1 score on average in the 3-way 1-shot setting. The contributions are summarized as follows:
An effective metric-free method for the few-shot ACSA task is proposed by modeling dual relations propagation. The dual relations propagation exploits the similarity and diversity among the support and query sets to explore intra-cluster commonality and inter-cluster uniqueness to address the passive effect of overlapping distributions caused by irrelevant sentiment noise.
The proposed method transforms the dual relations from support-query to class-query to promote query inference by learning sentiment class knowledge.
Extensive experiments on four benchmark datasets show that the proposed method outperforms strong baselines and obtains significant performance on few-shot ACSA.
2 Related Work
2.1 Aspect Category Sentiment Analysis
The ACSA task aims to detect sentiment polarity for a specific aspect mentioned in a sentence. Generally, it is used in recommendation systems (Cui et al., 2020; Jannach et al., 2021; Ahmadian et al., 2022) and intention detection (Hou et al., 2021; Chen et al., 2022c; Zhou et al., 2022) to understand the fine-grained sentiment of users. In recent years, ACSA has attracted the attention of researchers and developers.
Conventional methods focus on handcraft-based and attention-based methods. Handcraft-based methods (Ding et al., 2015; Liu et al., 2015) utilize handcrafted features to establish the dependency between a specific aspect and its context. Attention-based methods (Su et al., 2021; Wu et al., 2021; Liu et al., 2021) capture the interaction between an aspect and its context. Recently, some syntax-aware methods (Tian et al., 2021; Li et al., 2021b; Xiao et al., 2022; Effland and Collins, 2023) utilized Graph Neural Networks (GNN) based on syntactical dependency trees to exploit syntactic structure information. However, these methods heavily rely on labeled data and may fail to solve unseen aspect categories. Therefore, few-shot learning is of great importance.
2.2 Few-Shot Learning
Few-shot learning (Tsendsuren and Hong, 2017; Lee et al., 2019b; Zhang et al., 2022a) matches the human learning process in that the few-shot learner leverages a few labeled samples to obtain new knowledge based on prior knowledge. Few-shot learning has achieved promising processes in Computer Vision (CV) (Huang et al., 2021; Hu et al., 2022; Liu et al., 2022b; Ouyang et al., 2022), Natural Language Processing (NLP) (Hu et al., 2021; Tan et al., 2022; Chen et al., 2022b; Gao et al., 2022), etc. Especially in NLP, a number of research works exist on few-shot learning, such as few-shot aspect category detection (Zhao et al., 2023), few-shot named entity recognition (Fang et al., 2023; Xu et al., 2023; Ma et al., 2023), few-shot relation extraction (Chen et al., 2023; Li et al., 2023), etc.
Few-shot learning mainly contains meta-learning, prompt learning, and data augmentation. Meta-learning (Sung et al., 2018) leverages prior experiences to enable the model to obtain learning abilities and generalize them to new fields. Prompt learning (Lu et al., 2022) constructs task-related prompts to guide large language models to generate task-specific outputs. Data augmentation (Zhang et al., 2020) transforms existing samples to expand the dataset to promote the model to learn data patterns and features.
2.3 Meta-Learning
In recent years, meta-learning has been the main few-shot learning method due to its impressive performance, including model-based (Tsendsuren and Hong, 2017), optimization-based (Lee et al., 2019b), and metric-based (Assran et al., 2022; Wang et al., 2021; Lv et al., 2021) methods. Among them, metric-based methods are the most popular research line on meta-learning due to their simplicity and effectiveness. The main idea (Yu et al., 2022; Zhang et al., 2022b) is to use an episode paradigm to project support and query samples to an embedding space and then measure their distances to predict query labels. However, these methods heavily rely on aspect distributions in the embedding space. Therefore, they suffer from overlapping distributions of aspect embeddings caused by irrelevant sentiment noise among sentences with different sentiment aspects. Recently, Hosseini-Asl et al. (2022) proposed a generative method to explore aspect semantics to capture the interactions between a specific aspect and its context. More recently, Liang et al. (2023) leveraged aspect-associated words from an external knowledge base (Cambria et al., 2020) to construct two auxiliary sentences to enhance aspect embeddings. Unfortunately, their improvements are limited due to the complexity of semantic relations and knowledge structures. Therefore, these methods still struggle to handle irrelevant sentiment noise in scenarios where each sentence contains different sentiment aspects.
Unlike the mentioned methods, the proposed method explores the shared features among samples in a class and diverse features in separate classes to model the dual relations (similarity and diversity) among samples. With relation propagation and aggregation, the proposed method alleviates irrelevant sentiment noise and enhances sample features to improve performance on the few-shot ACSA task. Compared to previous methods, the proposed method works well in scenarios where each sentence contains different sentiment aspects.
3 Proposed Method
The overall architecture of the proposed method is shown in Figure 2. Broadly, the proposed method includes four components: relation graph construction, support-query relation propagation, class-query relation transformation, and training objective. Here, we present the proposed method in detail.
3.1 Problem Formulation
In the meta-val or meta-test phase, the meta-learner aims to verify the effectiveness of the model in /. Unlike the meta-train phase, the meta-learner only constructs a fixed support set for each aspect based on / and then takes the remaining samples in the dataset as the query set, as shown in Table 1. The meta-learner aims to use to predict the labels of query samples in and evaluate the performance of the proposed method. Finally, we report the corresponding results of the meta-test phase when the meta-val phase obtains the best results.
3.2 Overall Framework
As shown in Figure 2, the proposed method includes four components: relation graph construction, support-query relation propagation, class-query relation transformation, and training objective. Specifically, the proposed method designs a simple yet effective relation graph. The relation graph is an undirected, fully connected graph that aims to model dual relations (i.e., similarity and diversity) among support and query nodes. In the relation graph, each node is a sentence-aspect pair, and its aspect embedding (see Equation 7) is considered as the node feature. Additionally, two edges are used to connect two nodes, and edge features indicate the similarity and diversity strength between these two nodes.
With the relation graph, the proposed method propagates dual relations to enrich node features from edges to nodes and aggregates dual relations to update edge features from nodes to edges. Briefly, the propagation and aggregation of the dual relations enhance node and edge features and alleviate irrelevant sentiment noise by exploring intra-cluster commonality and inter-cluster uniqueness. Besides, the dual relations are transformed from support-query to class-query to promote query inference effectively.
3.3 Relation Graph Construction
We present the relation graph construction, including node initialization and edge initialization, as shown on the left of Figure 2. The relation graph is defined as , where denotes the set of nodes, and indicate similarity edges and diversity edges, respectively, and M is the total number of nodes. Briefly, two edges as dual bridges are built between two adjacent nodes to represent the similarity and diversity relations of the node pair. Besides, vi represents the features of node Vi, and and indicate the features of edges and . The following initialization of nodes and edges is presented in the relation graph.
3.3.1 Node Initialization
3.3.2 Edge Initialization
Between a node pair Vi and Vj, two edges and represent the similarity and diversity relations of the two nodes, and the edge features and represent the strength of similarity and diversity relations. Briefly, is a probability that nodes Vi and Vj from the same class, while is the probability that they belong to different classes.
In the relation graph, for inter-cluster nodes, the lower the similarity strength, the greater the difference between them; conversely, for intra-cluster nodes, the lower the diversity strength, the more common features they share.
3.4 Support-Query Relation Propagation
The relation graph promotes the modeling of support-query relations. As shown in Figure 2, the component includes dual relations propagation and aggregation, which aims to learn discriminative node and edge features.
3.4.1 Dual Relations Propagation
3.4.2 Dual Relations Aggregation
During relation propagation, the sentiment label of a query node can be predicted by final edge voting with support labels. However, edge voting makes the prediction difficult because the relation graph contains many edges. Therefore, the proposed method transforms support-query relations into class-query relations to promote query inference effectively.
3.5 Class-Query Relation Transformation
As shown in Figure 2, the proposed method transforms dual relations from support-query to class-query by learning sentiment class knowledge, which models the relations between a query and different sentiment classes to promote query inference further.
3.5.1 Class Node Generation
3.5.2 Dual Relations Transformation
3.6 Training Objective
During the meta-train phase, we suppose there are k sentiment classes. Given a query qi, the strength score of the similarity relation between it and sentiment classes is defined as Zi = {zi1, zi2,..., zik}∈ Rk, i.e., (j ∈{1,2,..., k}). The one-hot label of qi is defined as yi = {yi1, yi2,…, yik}∈{0,1}k, where yij = 1 indicates qi belongs to jth class. For training, we define the positive set and the negative set .
4 Experiments
4.1 Experimental Setup
Datasets.
Extensive experiments are conducted on four datasets: Rest I, Rest II, Lap, and Mams. These four datasets were collected by Liang et al. (2023) for the few-shot ACSA task. Rest I and Rest II originate from the restaurant domain, with Rest II providing a fine-grained aspect (entity #attribute) compared to Rest I (entity). Lap is obtained from the laptop domain to explore performance in other domains. For these three datasets, most sentences contain only one or multiple aspects with the same sentiment polarity. Therefore, Mams presents a more complex scenario, where each sentence includes many aspects with different sentiment polarities. The detailed statistics are presented in Table 2. Our code and data are available at https://github.com/sentiments-Ananda/FSACSA.
Evaluation Metric.
Implementation Details.
The proposed method is implemented with PyTorch (version 1.10.0). The uncased English version of BERT is our encoder for H (see Equation 6). In practice, the bottom layers of large language models are unnecessary (Lee et al., 2019a). Thus, we freeze the first six layers of BERT to reduce trainable parameters. We conduct experiments on a single GPU (RTX 3090 Ti) with CUDA version 11.3. The model is trained by the AdamW optimizer. To ensure a fair comparison, we follow Liang et al. (2023) to obtain experiment results by using a four-fold cross-validation. For example, a dataset has eight aspects, and these aspects are divided into four folds. We take each fold as the testing set and the others as the validation and training sets, and the splitting proportion is 1:1:2 for testing, validation, and training. The schematic diagram of the four-fold cross-validation is shown in Figure 3. Therefore, we can obtain four experiment results, and the average of these four experiment results is calculated to evaluate the performance of the proposed method.
4.2 Baselines
We compare the proposed method with a series of strong baselines to evaluate performance on the few-shot ACSA task.
Question-Driven (Sun et al., 2019): For an aspect, it designs the corresponding question prompt to guide a large language model (e.g., BERT) to identify sentiment polarity towards the aspect. For example, the prompt is “The polarity of the aspect safety is positive”, and then the large language model outputs a probability value of yes as the matching score to determine if the sentiment of safety is positive. Although the method achieves impressive performance, it heavily relies on the quality of prompts. It is hard to find a great prompt to obtain the best performance.
MIMLLN (Li et al., 2020): In a sentence, it first extracts some aspect-associated words to depict the context of a specific aspect. Then, the method combines the sentiment information of these words to predict the overall sentiment polarity towards the aspect. Though effective, it focuses on the sentiment of individual words and fails to capture the entire semantic content.
CapsNet (Jiang et al., 2019): It designs a capsule-guided routing method to model the interactions between an aspect and its contexts. Specifically, the method constructs a set of capsules by linear transformation and squashing activation (Sabour et al., 2017). These capsules use aspect-associated words to construct a sentiment matrix to learn some sentiment knowledge for a specific aspect. Then, the method utilizes the sentiment matrix to learn the relationship between an aspect and its contexts for predicting the sentiment label of the aspect.
Relation Network (Sung et al., 2018): In the meta-learning formulation, a neural network computes the similarity scores between each query sample and all support samples. The similarity score represents the relation strength between the query sample and different support samples. Therefore, the method leverages the support label that exhibits the highest similarity to the query to deduce the label of a query sample.
Induction Network (Geng et al., 2019): It performs a matrix transformation on support samples to generate a class embedding for each sentiment label. Then, a neural tensor network (Geng et al., 2017) computes the similarity scores between each query and all class embeddings to determine which class matches the query.
MTM (Deng et al., 2020): It designs a meta-pretraining strategy for a large language model (e.g., BERT) to learn task-agnostic general features that extract linguistic properties to benefit downstream few-shot learning tasks. Then, task-specific parameters are fine-tuned on the large language model for the few-shot ACSA task to enable predictions aligning with its specific requirements.
AFML (Liang et al., 2023): It uses an existing knowledge-based method (Liang et al., 2021) to collect highly aspect-associated words from an external knowledge source (e.g., SenticNet [Cambria et al., 2020]). Then, it constructs two auxiliary sentences by masking aspect-associated words and masking non-aspect words in the original sentence. Finally, it combines these two auxiliary sentences and the original sentence to enhance the features of a specific aspect and highlight the significant contextual sentiment clues of the specific aspect to promote the sentiment prediction of the aspect.
T5 (Raffel et al., 2020): It adopts an encoder-decoder architecture, where the few-shot ACSA task could be formulated as a text-to-text problem. Specifically, the encoder part encodes a sentence into hidden states, and the decoder part takes the encoder outputs and a specific aspect as inputs to identify the sentiment polarity of the aspect. In experiments, we use T5-base to evaluate the performance of T5 for the few-shot ACSA task.
MetaAdapt (Yue et al., 2023): Based on meta-learning, it proposes a few-shot domain adaptation method. The method divides a dataset into a source domain and a target domain, and it constructs the support set in the source domain and the query set in the target domain. Then, it leverages the support set to train the model to obtain gradients and evaluates the model on the query set to get second-order gradients w.r.t. the original parameters. Additionally, it computes the similarity between the original and second-order gradients to select more ‘informative’ support samples. These selected support samples are used to reweight the support set to optimize the model performance in the query set. Therefore, the model can optimally adapt to the target distribution with the provided source domain knowledge.
5 Analysis & Discussion
5.1 Overall Performance
We conduct extensive experiments with 3/2-way and 1/5-shot settings on Rest I, Rest II, Lap, and Mams datasets. The results are reported in Tables 3, 4, 5 and 6, where the best scores are highlighted in bold, and the runner-up scores are marked by underline, with the following observations.
Models . | Rest I . | Rest II . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 53.25 | 70.36 | 70.72 | 87.13 | 58.32 | 73.50 | 78.72 | 82.25 |
MTM (Deng et al., 2020) | 53.93 | 57.15 | 66.23 | 69.10 | 63.12 | 63.71 | 72.13 | 73.87 |
Induction Network (Geng et al., 2019) | 72.03 | 74.75 | 82.77 | 85.96 | 76.53 | 78.17 | 82.70 | 83.55 |
MIMLLN (Li et al., 2020) | 74.63 | 74.07 | 87.19 | 87.32 | 79.21 | 78.46 | 81.26 | 81.97 |
Question-Driven (Sun et al., 2019) | 74.66 | 74.79 | 87.52 | 86.83 | 78.69 | 79.84 | 82.63 | 83.31 |
CapsNet (Jiang et al., 2019) | 75.18 | 73.92 | 87.10 | 87.15 | 78.72 | 80.11 | 81.57 | 82.18 |
MetaAdapt (Yue et al., 2023) | 65.06 | 74.74 | 87.69 | 86.42 | 70.20 | 79.44 | 82.68 | 83.58 |
T5 (Raffel et al., 2020) | 77.01 | 78.15 | 81.61 | 85.85 | 81.26 | 82.36 | 85.48 | 87.57 |
AFML (Liang et al., 2023) | 77.13 | 77.53 | 88.12 | 88.79 | 81.56 | 81.95 | 83.89 | 84.15 |
Our method | 78.17 | 78.64 | 88.75 | 87.74 | 81.58 | 84.18 | 87.22 | 86.85 |
Models . | Rest I . | Rest II . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 53.25 | 70.36 | 70.72 | 87.13 | 58.32 | 73.50 | 78.72 | 82.25 |
MTM (Deng et al., 2020) | 53.93 | 57.15 | 66.23 | 69.10 | 63.12 | 63.71 | 72.13 | 73.87 |
Induction Network (Geng et al., 2019) | 72.03 | 74.75 | 82.77 | 85.96 | 76.53 | 78.17 | 82.70 | 83.55 |
MIMLLN (Li et al., 2020) | 74.63 | 74.07 | 87.19 | 87.32 | 79.21 | 78.46 | 81.26 | 81.97 |
Question-Driven (Sun et al., 2019) | 74.66 | 74.79 | 87.52 | 86.83 | 78.69 | 79.84 | 82.63 | 83.31 |
CapsNet (Jiang et al., 2019) | 75.18 | 73.92 | 87.10 | 87.15 | 78.72 | 80.11 | 81.57 | 82.18 |
MetaAdapt (Yue et al., 2023) | 65.06 | 74.74 | 87.69 | 86.42 | 70.20 | 79.44 | 82.68 | 83.58 |
T5 (Raffel et al., 2020) | 77.01 | 78.15 | 81.61 | 85.85 | 81.26 | 82.36 | 85.48 | 87.57 |
AFML (Liang et al., 2023) | 77.13 | 77.53 | 88.12 | 88.79 | 81.56 | 81.95 | 83.89 | 84.15 |
Our method | 78.17 | 78.64 | 88.75 | 87.74 | 81.58 | 84.18 | 87.22 | 86.85 |
Models . | Lap . | Mams . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 57.15 | 69.03 | 80.80 | 85.91 | 37.20 | 36.91 | 58.32 | 62.19 |
MTM (Deng et al., 2020) | 51.99 | 53.22 | 66.15 | 68.19 | 37.58 | 36.26 | 58.33 | 57.90 |
Induction Network (Geng et al., 2019) | 70.01 | 70.53 | 87.18 | 86.43 | 38.20 | 35.46 | 62.75 | 59.31 |
MIMLLN (Li et al., 2020) | 68.79 | 70.03 | 87.03 | 86.18 | 36.52 | 37.43 | 62.30 | 63.17 |
Question-Driven (Sun et al., 2019) | 70.30 | 68.82 | 86.17 | 87.15 | 36.08 | 35.44 | 63.17 | 61.05 |
CapsNet (Jiang et al., 2019) | 71.53 | 69.82 | 86.82 | 86.73 | 37.12 | 36.98 | 61.61 | 63.75 |
MetaAdapt (Yue et al., 2023) | 60.76 | 69.45 | 87.36 | 87.52 | 42.03 | 43.60 | 62.04 | 64.48 |
T5 (Raffel et al., 2020) | 74.72 | 75.12 | 89.18 | 89.42 | 46.09 | 47.38 | 68.82 | 72.31 |
AFML (Liang et al., 2023) | 72.96 | 73.80 | 88.17 | 88.67 | 40.07 | 40.35 | 65.57 | 66.30 |
Our method | 76.05 | 75.51 | 88.21 | 87.75 | 47.66 | 45.86 | 69.77 | 68.56 |
Models . | Lap . | Mams . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 57.15 | 69.03 | 80.80 | 85.91 | 37.20 | 36.91 | 58.32 | 62.19 |
MTM (Deng et al., 2020) | 51.99 | 53.22 | 66.15 | 68.19 | 37.58 | 36.26 | 58.33 | 57.90 |
Induction Network (Geng et al., 2019) | 70.01 | 70.53 | 87.18 | 86.43 | 38.20 | 35.46 | 62.75 | 59.31 |
MIMLLN (Li et al., 2020) | 68.79 | 70.03 | 87.03 | 86.18 | 36.52 | 37.43 | 62.30 | 63.17 |
Question-Driven (Sun et al., 2019) | 70.30 | 68.82 | 86.17 | 87.15 | 36.08 | 35.44 | 63.17 | 61.05 |
CapsNet (Jiang et al., 2019) | 71.53 | 69.82 | 86.82 | 86.73 | 37.12 | 36.98 | 61.61 | 63.75 |
MetaAdapt (Yue et al., 2023) | 60.76 | 69.45 | 87.36 | 87.52 | 42.03 | 43.60 | 62.04 | 64.48 |
T5 (Raffel et al., 2020) | 74.72 | 75.12 | 89.18 | 89.42 | 46.09 | 47.38 | 68.82 | 72.31 |
AFML (Liang et al., 2023) | 72.96 | 73.80 | 88.17 | 88.67 | 40.07 | 40.35 | 65.57 | 66.30 |
Our method | 76.05 | 75.51 | 88.21 | 87.75 | 47.66 | 45.86 | 69.77 | 68.56 |
Models . | Rest I . | Rest II . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 52.19 | 60.75 | 67.28 | 83.41 | 52.87 | 61.34 | 71.34 | 78.95 |
MTM (Deng et al., 2020) | 53.83 | 53.79 | 63.54 | 65.19 | 61.14 | 60.05 | 70.47 | 71.45 |
Induction Network (Geng et al., 2019) | 60.89 | 60.51 | 79.62 | 81.34 | 63.08 | 62.41 | 80.49 | 80.64 |
MIMLLN (Li et al., 2020) | 61.24 | 61.53 | 82.08 | 83.44 | 64.21 | 63.20 | 78.84 | 79.51 |
Question-Driven (Sun et al., 2019) | 62.69 | 62.52 | 83.93 | 83.40 | 64.18 | 64.61 | 80.03 | 80.15 |
CapsNet (Jiang et al., 2019) | 60.84 | 60.82 | 82.44 | 83.01 | 65.78 | 64.29 | 81.22 | 79.85 |
MetaAdapt (Yue et al., 2023) | 58.36 | 61.22 | 82.42 | 82.20 | 62.02 | 62.43 | 81.40 | 82.20 |
T5 (Raffel et al., 2020) | 55.12 | 63.15 | 75.55 | 80.14 | 59.12 | 61.33 | 84.25 | 85.05 |
AFML (Liang et al., 2023) | 64.05 | 62.87 | 74.53 | 74.68 | 66.19 | 63.58 | 81.74 | 81.28 |
Our method | 64.49 | 63.30 | 86.06 | 85.16 | 67.62 | 66.51 | 85.03 | 84.55 |
Models . | Rest I . | Rest II . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 52.19 | 60.75 | 67.28 | 83.41 | 52.87 | 61.34 | 71.34 | 78.95 |
MTM (Deng et al., 2020) | 53.83 | 53.79 | 63.54 | 65.19 | 61.14 | 60.05 | 70.47 | 71.45 |
Induction Network (Geng et al., 2019) | 60.89 | 60.51 | 79.62 | 81.34 | 63.08 | 62.41 | 80.49 | 80.64 |
MIMLLN (Li et al., 2020) | 61.24 | 61.53 | 82.08 | 83.44 | 64.21 | 63.20 | 78.84 | 79.51 |
Question-Driven (Sun et al., 2019) | 62.69 | 62.52 | 83.93 | 83.40 | 64.18 | 64.61 | 80.03 | 80.15 |
CapsNet (Jiang et al., 2019) | 60.84 | 60.82 | 82.44 | 83.01 | 65.78 | 64.29 | 81.22 | 79.85 |
MetaAdapt (Yue et al., 2023) | 58.36 | 61.22 | 82.42 | 82.20 | 62.02 | 62.43 | 81.40 | 82.20 |
T5 (Raffel et al., 2020) | 55.12 | 63.15 | 75.55 | 80.14 | 59.12 | 61.33 | 84.25 | 85.05 |
AFML (Liang et al., 2023) | 64.05 | 62.87 | 74.53 | 74.68 | 66.19 | 63.58 | 81.74 | 81.28 |
Our method | 64.49 | 63.30 | 86.06 | 85.16 | 67.62 | 66.51 | 85.03 | 84.55 |
Models . | Lap . | Mams . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 49.10 | 53.35 | 80.61 | 80.59 | 34.79 | 35.36 | 56.00 | 60.79 |
MTM (Deng et al., 2020) | 50.11 | 51.47 | 62.23 | 64.69 | 36.96 | 35.21 | 52.23 | 50.65 |
Induction Network (Geng et al., 2019) | 54.67 | 54.79 | 83.92 | 84.56 | 37.15 | 34.54 | 60.08 | 58.78 |
MIMLLN (Li et al., 2020) | 53.71 | 53.66 | 84.59 | 83.87 | 36.03 | 36.92 | 59.41 | 60.59 |
Question-Driven (Sun et al., 2019) | 54.80 | 54.43 | 84.75 | 84.21 | 35.79 | 34.66 | 60.37 | 60.13 |
CapsNet (Jiang et al., 2019) | 54.30 | 53.26 | 83.53 | 83.39 | 35.92 | 35.12 | 58.63 | 62.65 |
MetaAdapt (Yue et al., 2023) | 50.28 | 53.34 | 85.72 | 86.35 | 35.64 | 35.57 | 60.31 | 63.70 |
T5 (Raffel et al., 2020) | 53.63 | 55.37 | 87.43 | 88.12 | 38.53 | 41.46 | 66.48 | 70.53 |
AFML (Liang et al., 2023) | 54.75 | 52.06 | 85.92 | 86.27 | 38.46 | 34.09 | 64.36 | 65.33 |
Our method | 58.91 | 56.49 | 86.15 | 87.52 | 40.84 | 38.82 | 66.54 | 66.48 |
Models . | Lap . | Mams . | ||||||
---|---|---|---|---|---|---|---|---|
3-way(%) . | 2-way(%) . | 3-way(%) . | 2-way(%) . | |||||
1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | 1-shot . | 5-shot . | |
Relation Network (Sung et al., 2018) | 49.10 | 53.35 | 80.61 | 80.59 | 34.79 | 35.36 | 56.00 | 60.79 |
MTM (Deng et al., 2020) | 50.11 | 51.47 | 62.23 | 64.69 | 36.96 | 35.21 | 52.23 | 50.65 |
Induction Network (Geng et al., 2019) | 54.67 | 54.79 | 83.92 | 84.56 | 37.15 | 34.54 | 60.08 | 58.78 |
MIMLLN (Li et al., 2020) | 53.71 | 53.66 | 84.59 | 83.87 | 36.03 | 36.92 | 59.41 | 60.59 |
Question-Driven (Sun et al., 2019) | 54.80 | 54.43 | 84.75 | 84.21 | 35.79 | 34.66 | 60.37 | 60.13 |
CapsNet (Jiang et al., 2019) | 54.30 | 53.26 | 83.53 | 83.39 | 35.92 | 35.12 | 58.63 | 62.65 |
MetaAdapt (Yue et al., 2023) | 50.28 | 53.34 | 85.72 | 86.35 | 35.64 | 35.57 | 60.31 | 63.70 |
T5 (Raffel et al., 2020) | 53.63 | 55.37 | 87.43 | 88.12 | 38.53 | 41.46 | 66.48 | 70.53 |
AFML (Liang et al., 2023) | 54.75 | 52.06 | 85.92 | 86.27 | 38.46 | 34.09 | 64.36 | 65.33 |
Our method | 58.91 | 56.49 | 86.15 | 87.52 | 40.84 | 38.82 | 66.54 | 66.48 |
(1) Overall, the proposed method outperforms most baselines. Additionally, we also observe that two strong baselines, AFML and T5, achieve competitive results, but their overall performance is much worse than our proposed method. Specifically, in terms of accuracy, the proposed method improves upon the strong baseline AFML up to an average of 0.43%, 2.07%, 0.98%, and 4.89% on Rest I, Rest II, Lap, and Mams, respectively. Compared to T5, the proposed method achieves an average of 2.67% and 0.79% accuracy improvements on Rest I and Rest II, respectively. Although T5 obtains competitive performance on Lap and Mams, it only outperforms our proposed method by a margin of 0.23% and 0.68% in accuracy. More specifically, as for accuracy, in Table 3 and Table 4, T5 obtains convincing performance on five scenarios due to the advantage of abundant pre-trained knowledge in its encoder-decoder architecture, but it has worse results than our proposed method on the other 11 scenarios. Therefore, our proposed method performs better than T5 for the few-shot ACSA task overall. Regarding F1 score, the proposed method improves upon AFML by 0.23%–11.53% on Rest I, Rest II, Lap, and Mams. The proposed method improves upon T5 by an average of 6.26%, 3.49%, and 1.13% F1 scores on Rest I, Rest II, and Lap, respectively. Although T5 obtains competitive results for the 2-way setting on Lap, its average performance is much worse due to the low F1 score for the 3-way setting. As for F1 score, although T5 also obtains convincing performance on five scenarios in Table 5 and Table 6, our proposed method still outperforms the other 11 scenarios. The results demonstrate the effectiveness of the proposed method for the few-shot ACSA task. The proposed method learns the similarity and diversity relations among support and query samples to alleviate irrelevant sentiment noise and effectively predict query labels by exploiting intra-cluster commonality and inter-cluster uniqueness.
(2) For all mentioned methods, the performance of the 3-way setting on Rest I is inferior to those on Rest II. This is because Rest II provides a more fine-grained aspect (i.e., entity#attribute), whereas Rest I only gives a general aspect (i.e., entity). For the 3-way setting, the proposed method surpasses the strong baseline AFML by an average improvement of 1.07% accuracy and 0.43% F1 score on Rest I and 1.12% accuracy and 2.18% F1 score on Rest II. Besides, in the 3-way setting, the proposed method surpasses the strong baseline T5 by an average improvement of 0.82% accuracy and 4.76% F1 score on Rest I and 1.07% accuracy and 6.84% F1 score on Rest II. The results indicate that our proposed method performs better when given fine-grained aspects.
(3) Compared to generative model T5, the proposed method achieves accuracy/F1 improvements on Rest I, Rest II, Lap, and Mams. In experiments, the proposed method is based on the BERT-base model and freezes half of the parameters. However, the T5 model has large amounts of parameters due to the advantage of its encoder-decoder architecture. To ensure a fair comparison, we freeze the encoder of the strong baseline T5 (i.e., T5-base) to compare performance with our proposed method. Although T5 obtains some competitive results due to its abundant pre-trained knowledge, it performs worse than our proposed method in most experiments. Significantly, our proposed method improves upon T5 by 10.51% at most and an average of 6.26% F1 score on Rest I. Additionally, our proposed method achieves an average of 6.36% F1 score improvements in the 3-way 1-shot setting. The results denote that T5 performs poorly on the few-shot ACSA task due to irrelevant sentiment noise in the above complex few-shot scenario. In short, our proposed method alleviates irrelevant sentiment noise to improve performance in few-shot scenarios by exploring intra-cluster commonality and inter-cluster uniqueness.
5.2 Impact of Dual Relations Propagation
DRP is used to enhance similarity and diversity relations among samples by exploiting intra-cluster commonality and inter-cluster uniqueness. To verify the impact of DRP, we design two cases to evaluate the performance of the proposed method. The first case uses cosine distance between a node pair to replace the learning of dual relations to evaluate the importance of dual relations propagation and aggregation. In another case, the proposed method only learns the similarity relation among samples to analyze whether dual relations are essential. Thus, we redefine the relation graph and use instead of dual edges to evaluate the proposed method.
The experimental results are presented in Figure 4 for the Rest I, Rest II, Lap, and Mams datasets. We can observe that DRP performs best in these two cases. Specifically, DRP better enhances the similarity and diversity relations among samples than simple cosine distances among those to promote query inference to improve performance. Also, DRP obtains more convincing results than single similarity relation propagation because DRP considers contrastive enhancement between similarity and diversity relations. Furthermore, most results of the cosine distance are better than the single similarity relation propagation. This indicates that single similarity relation propagation is weak in few-shot scenarios. Besides, the performances of all methods are worse in the 3-way setting than those in the 2-way setting due to sentiment complicity. However, DRP can still achieve the best results in Figure 4. The experimental results verify the effectiveness of DRP in the proposed method, revealing it is vital to guarantee good performance.
5.3 Impact of Propagation Layer
To investigate the impact of the DRP layer, we evaluate the proposed method with one to eight layers on Rest I, Rest II, Lap, and Mams. Figure 5 shows the performance of the proposed method with increasing the number of layers in 3-way 1-shot and 2-way 1-shot scenarios. In terms of a 3-way 1-shot scenario, DRP with two layers obtains the best results on Rest II while DRP with three layers performs best on Rest I, Lap, and Mams. In terms of a 2-way 1-shot scenario, DRP with two layers obtains the best results on Rest I and Mams. The results indicate that DRP plays a positive effect during relation propagation with increasing layers, but excessive layers result in low performance due to over-fitting. In short, the results demonstrate the effectiveness of DRP within limited layers.
5.4 Ablation Study
To investigate the significance of the proposed method, we conduct an ablation study on the most competitive Mams dataset to compare performance. Due to the complexity of Mams, we can observe convincing differences in ablation experiments. Experimental results are shown in Figure 6, where the comparison results are presented in Figure 6a, and the gap values are reported in Figure 6b, with the following observations.
(1) In the “w/o_Class”, we remove the relation transformation from support-query to class-query and only use support-query relations to induce the labels of query samples. The performance is significantly reduced when removing the class-query relation transformation mechanism. This fact indicates that modeling the relations between a query and classes can promote query inference. Therefore, the class-query relation transformation plays an important role in the performance of the proposed method.
(2) In the “w/o_ContraLoss”, we remove the training objective mentioned in Section 3.6 and use a cross-entropy loss instead. When the proposed training objective is removed, the performance drops considerably. The negative effect suggests that the proposed training objective has a positive role in capturing dual relations features. The proposed training objective promotes the model to explore intra-cluster commonality and inter-cluster uniqueness to learn discriminative dual relations for query inference. Therefore, the training objective has considerable importance.
(3) In the “w/o_DirNet”, we replace the learning of diversity relation with a simple mechanism. Specifically, we set the strongest relation score to 1 and replace with () as the diversity score to analyze the necessity of the diversity network in Equation 15. For example, if the similarity score is 0.7, we set 0.3 as its corresponding diversity score, i.e., there are 0.7 similarity relation and 0.3 diversity relation between a sample pair. Experimental results demonstrate that the diversity network provides more supplementary information and encourages DRP to learn more robust dual relations features among samples. Therefore, the learning of the diversity relation is essential for the few-shot ACSA task.
Obviously, the absence of any method part can decrease performance. In short, the whole model consistently surpasses all ablation studies and achieves the best performance.
5.5 Relation Strength Visualization
The relation strength between query samples and sentiment classes is visualized to verify the performance of DRP compared with conventional distance metrics (e.g., cosine distance). Specifically, we use similarity relation score and cosine distance to draw heat maps in Figure 7. The experimental results are conducted on Rest I, Rest II, Lap, and Mams datasets. It is easy to find the following observations.
(1) For Rest I, Rest II, and Lap, the proposed method significantly improves positive and negative class results. Regarding the neutral class, these two methods are weak in the relation drawing between query samples and the neutral class because many queries with neutral sentiment are predicted to be positive and negative. However, the proposed method performs better than the conventional distance metric. The proposed method learns discriminative relation features to improve performance on the few-shot ACSA task by modeling the dual relations among samples.
(2) Mams provides a more complex scenario where each sentence contains many aspects with different sentiment polarities. Inevitably, there are overlapping distributions of aspect embeddings caused by irrelevant sentiment noise among sentences with multiple sentiment aspects. Therefore, the conventional distance method mainly classifies everything into the neutral class, failing to identify sentiment features. Relatively, the proposed method learns valuable features to classify different sentiment classes compared with the conventional distance method, but it still has a weak relation strength between query samples and sentiment classes. Therefore, in the following work, we could focus on refining the feature extraction and exploring additional domain knowledge to enhance the relation strength between query samples and sentiment classes for query inference.
6 Conclusion & Future Work
We propose an effective metric-free method for the few-shot ACSA task, which explicitly models the associated relations among aspects of query and support samples, addressing the passive effect of overlapping distributions caused by irrelevant sentiment noise in aspect distributions. Specifically, the proposed method designs a fully connected relation graph to model the dual relations (similarity and diversity) among support and query nodes in the embedding space. With the relation graph, the proposed method uses the dual relations among nodes to explore intra-cluster commonality and inter-cluster uniqueness to alleviate irrelevant sentiment noise and enhance aspect features, eliminating the passive effect of overlapping distributions. Additionally, the dual relations are transformed from support-query to class-query to guide query inference by learning sentiment class knowledge from the relation graph. Experiments show that the proposed method outperforms strong baselines and obtains significant performance.
The proposed method is not limited to few-shot ACSA, and it can be applied to more complex tasks, e.g., fake news detection, text classification, and intention detection, since it could better enhance semantic textual similarity and diversity with ground truth texts. Therefore, we will extend our method to these tasks in follow-up work.
7 Limitations
The proposed method obtains convincing performance compared with baselines but still has a few limitations.
(1) The neutral classification for the few-shot ACSA task is still a considerable challenge. The neutral classification lacks opinion sentiment contexts, making it susceptible to irrelevant sentiment noise, resulting in wrong classifications, as shown in Figure 7. Therefore, we want to attract more researchers and developers to pay more attention to the challenge of neutral classification.
(2) We follow the meta-learning formulation to perform few-shot ACSA. For meta-learning, the meta-task structure could generalize experiences from seen aspects to newly encountered aspects but requires annotating a few samples to construct the support set from the newly encountered aspects for query inference. Therefore, we could focus on reducing shots per class in the sample set, eliminating the burden of data annotation.
Acknowledgments
The authors are grateful for helpful comments from the anonymous reviewers and the TACL action editor. This work is supported by the National Key Research and Development Program of China (no. 2021ZD0111202), the National Natural Science Foundation of China (no. 62176005), and the Art Project of the National Social Science Fund of China (no. 2022CC02195).
Notes
Similarity relation indicates that nodes from the same sentiment label share similar sentiment features.
Diversity relation indicates that nodes from different sentiment labels express contrasting sentiment features.
References
Author notes
Action Editor: Sebastian Padó