ABSTRACT
The exption of Chinese natural language processing (NLP) has stimulated research in the broader NLP domain. However, existing large language models have limitations in comprehending and reasoning in Chinese. This paper addresses these limitations by enhancing Chinese language models comprehension and reasoning capabilities while minimizing resource requirements. We propose LLaMA-LoRA, a neural prompt engineering framework that builds upon the LLaMA-13B model and incorporates the Low-Rank Adaptation (LoRA) of Large Language Models technique for refinement. Chain-of-Thought (CoT) are crucial for generating intermediate reasoning chains in language models, but their effectiveness can be limited by isolated language patterns. Erroneous reasoning resulting from conventional prompts negatively impacts model performance. Automatic prompts are introduced to encourage reasoning chain generation and accurate answer inference. Training the model with an extensive corpus of Chinese CoT data enhances its comprehension and reasoning abilities. The LLaMA-LoRA model demonstrates exceptional performance across numerous Chinese language tasks, surpassing benchmark performance achieved by related language models such as GPT-3.5, Chat-GLM, and OpenAssistant, delivering accurate, comprehensive, and professional answers. The availability of our open-source model code facilitates further research in the field of Chinese text logical reasoning thinking chains.
1. INTRODUCTION
Pre-trained language models [1-5] like GPT-3.5 [6] have achieved considerable advancement in Natural Language Processing (NLP). These exceptional models have shown remarkable abilities in language comprehension, generation, and reasoning, which can adapt and complete new tasks quickly by leveraging extensive amounts of high-quality human text and can be effectively utilized across various scenarios. However, these models still encounter vital issues and limitations [7]. Firstly, it has been observed that increasing the number of model parameters does not consistently lead to significant performance improvements [8]. In fact, smaller models have been found to outperform larger ones in specific NLP tasks [9]. This highlights the necessity for in-depth investigations into model structures and training strategies to identify the optimal performance trade-off for different tasks. Secondly, large pretrained models have limitations in terms of comprehending non-English texts, particularly languages like Chinese. This limitation arises due to the imbalanced nature of training data and the adaptability of model structures to accommodate diverse linguistic characteristics. Addressing this challenge requires further research on optimization methods and adaptive training strategies tailored to Chinese (non-English) texts.
Significant advancements have been achieved in the field of NLP through the utilization of Large Language Models (LLMs). Among these models, the Large Language Model Meta AI (LLaMA) [2] stands out as a subset of foundational language models that have undergone training on a vast corpus comprising billions of samples. The LLaMA model has many benefits when compared to traditional language models. It performs better in reasoning tasks since it is trained on a more extensive vocabulary. The LLaMA-13B [2] model has shown exceptional performance, outperforming the current GPT-3.5 model in most benchmark assessments. In addition, the LLaMA-13B model is precious for large model applications because it can be executed on a single graphics processing unit (GPU). Nonetheless, challenges persist in existing large models, particularly concerning their consumption of significant amounts of memory and storage resources. The Low-Rank Adaptation (LoRA) model has been introduced to address this issue, providing substantial advantages in mitigating resource requirements. The LoRA model effectively reduces the storage of parameters and the consumption of Virtual Random Access Memory (VRAM). Importantly, when the rank (k) is significantly smaller than the model, the LoRA model eliminates the need to store optimizer state parameters, resulting in a notable reduction in VRAM usage.
The LoRA model offers several advantages during task switching. To streamline the computation process, the model only exchanges weights instead of exchanging all parameters since most parameter gradients are not required. This targeted approach significantly reduces computational costs. One common finetuning technique involves training a selected group of pre-trained parameters, which does not result in additional inference latency. By employing an adaptive process, LoRA achieves full-rank performance without requiring cumulative gradient updates on weight matrices. LoRA allows the model to closely match the performance of the original model during fine-tuning by setting the rank (k) of all weight matrices to match the rank of the pre-trained weight matrices. When the number of trainable parameters increases, LoRA training effectively mimics the training impact of the original model. LoRA's approach to convergence differs from those that rely on adapters for Multi-Layer Perceptrons (MLP). Instead, LoRA's adapter layers are designed to mirror the structure of MLP. In contrast, prefix-based methods are unsuitable for models processing lengthy input sequences because they require prefix layers at every position within the input sequence, leading to excessive model parameters. Consequently, LoRA offers significant advantages in handling downstream tasks. Its adaptation to the low-rank structure reduces hardware requirements and enables the parallel execution of multiple experiments. This capability fosters a deeper understanding of the relationship between weight updates and pre-trained weights.
This study aims to leverage the LLaMA model to enhance the efficiency and performance of LLMs. The approach employed involves reducing the number of trainable parameters while maintaining task performance. The utilization of LoRA has been expanded to encompass various attention weight types within the LLaMA model, and its accuracy has been evaluated using relevant datasets. Recent investigations have highlighted the potential of integrating LLMs with the Chain of Thought (CoT) approach to partially enhance the reasoning capabilities of extensive pre-trained models while reducing training costs and reliance on datasets. Building upon this concept, we introduce a comprehensive fine-tuning framework called “LLaMA-LoRA Neural Prompt Engineering” that uses prompts from automated thinking training to generate a logical reasoning chain for Chinese text. By employing the proposed LLaMA-LoRA neural prompt engineering, the comprehension, reasoning, and consistency of LLMs for the Chinese language can be dramatically improved.
LLMs, as evidenced in the study [10], encounter limitations in reasoning abilities, specifically in domains such as mathematical reasoning, symbolic reasoning, and contextual understanding. These limitations can manifest as common sense errors and contradictions when these models handle complex tasks.
This research focuses on enhancing the capabilities of LLMs in Chinese expression and comprehension domains through fine-tuning the LoRA method. Furthermore, the CoT training method enhances the model's expressive power and logical reasoning abilities. Extensive use of open-source Chinese datasets ensures compatibility within the open environment, distinguishing this approach from existing models that rely heavily on undisclosed or inadequately documented data, posing challenges for practical applications. This paper evaluates the model across multiple tasks, including text generation, Chinese reasoning, and mathematical reasoning. State-of-the-art models such as GPT-3.5, ChatGLM, Alpaca, ERNIEBot, and their generalized counterparts are benchmarks to provide a comprehensive comparative analysis. The model outperforms other models through extensive testing, particularly in Chinese reasoning and multi-turn dialogue tasks. The model also demonstrates improved diversity and fluency in various textgeneration tasks. Our main contributions are summarised as follows:
LLaMA-LoRA Neural Prompt Engineering Framework: This study introduces the LLaMA-LoRA framework, an extension of the LLaMA-13B model that incorporates the LoRA technique. By optimizing the model's parameter efficiency while maintaining task performance, this framework enhances the model's reasoning capabilities and reduces resource requirements.
Automatic Chain of Thought prompting: To overcome the limitations of traditional cognitive chain prompts, this paper proposes the use of Automatic Chain of Thought (Auto-CoT) in the refined model. These prompts are dynamically sampled to encourage the generation of reasoning chains and improve answer inference, effectively enhancing the model's reasoning performance and mitigating errors in answer generation.
Enhanced Comprehension and Reasoning with Chinese CoT Data: Leveraging a comprehensive corpus of Chinese CoT data, this research enriches the model's comprehension and reasoning abilities. By training the model on this specific dataset, it gains a deeper understanding of Chinese language tasks, leading to improved performance and addressing the challenges associated with Chinese text comprehension and reasoning.
Superior Performance and Benchmark Surpassing: Through extensive comparative experimentation, this study demonstrates the outstanding performance of the proposed LLaMA-LoRA model across various Chinese language tasks. It outperforms state-of-the-art models such as GPT-3.5, Chat-GLM, and OpenAssistant, offering more accurate, comprehensive, and professional answers. This achievement establishes the LLaMA-LoRA framework as a new benchmark in the field of Chinese natural language processing, providing valuable insights and resources for further research.
Open Model Data: The open-source model data of LLaMA-LoRA is readily available, making it easier for researchers to conduct studies in this area and inspiring future endeavors.
2. RELATED WORK
2.1 Large Language Models
Transformers have paved the way for significant progress in NLP by developing LLMs. These models have exhibited exceptional proficiency in diverse NLP tasks, including dialogue generation [11], Entity Identification [12] and basic reasoning [6], guided by simple human prompts (Instructions). Overall, LLMs deliver impressive performance across a broad spectrum of NLP tasks. However, unlocking the complete potential of these models demands meticulous deliberation of prompt information. Choosing the correct or appropriate prompts is crucial since distinct NLP tasks entail specific prompt requirements that substantially influence the model's performance.
In response to this challenge, a promising approach known as Prompts Tuning has emerged, with the aim of enhancing the comprehension ability of language models by providing prompts that facilitate accurate understanding and generation of responses. Early prompt methods were categorized as hard or discrete prompts [13], requiring domain expertise and a comprehensive understanding of the underlying model's characteristics to achieve state-of-the-art performance. In 2020, soft prompts [14] were introduced to overcome the limitations of hard prompts. Soft prompts treat prompt generation as an independent task, transforming the process from discrete manual attempts to machine-driven continuous learning and expLoRAtion. Noteworthy examples of soft prompt methods include P-tuning [13] and Prefix-tuning [14]. Since 2022, researchers have acknowledged the benefits and drawbacks of continuous prompt learning techniques, such as instability [15] and the absence of explicit reasoning steps provided by LLMs [16].
Google introduced the Chain of Thought (CoT) method in 2022 [16], aimed at enhancing the performance of tasks involving mathematical calculations and commonsense reasoning using LLMs. This method integrates a series of intermediate reasoning steps, facilitating the generation of more coherent logical chains for intricate reasoning tasks and offering improved interpretability for the answers generated by the model. During the logical chain formation, pertinent concerns undergo population categorization, then extracting related problems from each population. Numerous extraction methods exist, one of which is the diversified top-K maximal clique detection method proposed by Hao in his research. This method demonstrates applicability to alignment analysis within a broad spectrum of social networks, enabling the extensive dissemination of relational connections. Moreover, it can be employed to extract problems from the population of thought chains [17]. Remarkably, when applied to the more challenging GSM8K [18] task with initially low-performance levels, the CoT method exhibited a notable performance improvement exceeding twofold compared to GPT-3 and PaLM. However, its impact on performance in the simpler MAWPS [19] task was minimal and sometimes even resulted in adverse effects.
2.2 LoRA
Low-Rank Adaptation (LoRA) [20] introduced by Microsoft is a training technique developed explicitly for LLMs to expedite training and enhance model performance through the random assignment of parameters. Pre-trained models with strong interdependencies among parameters across different layers can impede training efficiency, requiring significant time and computational resources. The LoRA model facilitated parameter communication between different layers during training through random parameter assignment, resulting in improved model performance and accelerated training speed. Current research [20] demonstrates that the LoRA model significantly enhances model performance, accelerates training speed, and reduces computational resource consumption. This methodology has been successfully applied to various pre-trained language models, including GPT and BERT, yielding remarkable outcomes.
We employ adaptive strategy of LoRA to make the fine-tuning of LLMs for independent tasks more efficient. This results in reduced costs related to hardware resources and storage/switching overhead and ensures the preservation of high-quality model performance without introducing inference latency or reducing the input sequence length. LoRA excels in enabling fast task switching in service deployment scenarios by sharing the majority of model parameters. By approximating the performance achieved through global training, LoRA fine-tuning effectively mitigates resource waste. LoRA leverages attention-related matrices such as Wq and Wv while considering Wk to achieve optimal overall performance.
The standard deviation of performance remains consistent across different random seeds for a given dataset. When setting all parameters as ΔWq or ΔWk, a significant performance decrease occurs, whereas adjusting both Wq and Wv yields the best results. Hence, utilizing multiple smaller-rank weight matrices is preferable over a single type with a larger rank. Experimental evidence from the related study [20] demonstrates the high utility of top singular vector directions when the matrix rank is set to 8 since other directions often contain most of the accumulated random noise during training. Therefore, in our LoRA model training, the rank is set to 8. In current production environments, incorporating the LoRA approach does not introduce additional inference latency since W = W0 + BA can be explicitly computed and stored, where W0 and BA are d□r matrices. When switching to another downstream task, W0 can be restored by subtracting BA and adding different B0A0. This restoration operation is fast and incurs minimal memory overhead.
Researchers have suggested using adapter layers to adjust to downstream tasks by adding them between existing layers in neural networks, as per previous studies [21-23]. The LoRA fine-tuning method utilizes a bottleneck structure to enforce low-rank constraints for weight updates, which differs from the earlier approach. Our approach enables the trained weights to merge with the primary weights smoothly when making predictions, erasing any worries about delays caused by adapter layers. Another relevant adapter extension is the approach COMPACTER [24] which utilizes Kronecker products with predefined weight allocation schemes to parameterize adapter layers. Additionally, investigating the integration of LoRA with other tensor-product techniques may increase parameter efficiency. However, more research is necessary to understand this area's potential benefits fully.
Several subsequent researchers [13-14, 25-26] have proposed an alternative approach to fine-tuning, which involves optimizing input word embeddings using continuous and differentiable prompt engineering techniques. However, these methods have a restriction in that they expand the model size by utilizing additional special tokens in the prompt. Consequently, these extra tokens occupy sequence lengths that could be used for task tokens when learning positional embeddings. The prevalence of low-rank structures [27-30] has been observed in various machine learning studies, as many machine learning problems inherently exhibit certain intrinsic low-rank characteristics. In many deep learning tasks, especially those that use overparameterized neural networks, the trained neural networks often show low-rank characteristics [31]. Early research [32-35] explicitly imposed low-rank constraints during the training of original neural networks. However, these works primarily focused on low-rank updates for frozen models to adapt to downstream tasks.
In academic literature, neural networks are well-known for their better performance compared to classical learning methods like neural tangent kernels with limited width. Multiple studies [36-37], support this recognition. Some studies [38-40] have shown that neural networks perform exceptionally well when the underlying concepts have a low-rank structure. Furthermore, the research conducted by Allen Zhu and Li [41] emphasizes the benefits of utilizing low-rank adaptive updates in the context of adversarial training.
2.3 Chain of Thought
Prompt engineering plays a crucial role in the inference process of LLMs as it generates a sequence of reasoning chains that guide the model toward producing the final answer. CoT prompt has emerged as a promising technique for generating accurate answers by gradually generating intermediate reasoning steps within LLMs without relying on gradients. Wei et al. [42] conducted a study that explored the uses of CoT prompt to guide LLMs in producing coherent intermediate reasoning steps, thereby improving the accuracy of their answers. Additionally, LLMs can leverage zero-shot prompts or manually designed demonstrations to conduct reasoning analysis [42]. This capability is advantageous for generating intermediate reasoning steps while addressing various tasks, and these generated steps are commonly referred to as CoT prompts.
CoT prompts can be classified into two scenarios. The first scenario involves incremental reasoning before generating the answer, which facilitates the generation of accurate answers. The model gradually reasons through intermediate steps to arrive at the final answer. The second scenario involves manual demonstrations, where a reasoning chain accompanies a question. The manual demonstration approach has limitations since it depends heavily on engineers’ expertise, making it less scalable. Designing manual questions and constructing the corresponding reasoning chains incurs significant overhead. Recent research efforts [43] have primarily focused on manually creating complex problem demonstrations or employing integrated methods. These methods aim to progressively simplify complex problems into subproblems and solve them step by step, thereby significantly enhancing the quality of problem answers. According to certain researchers [44], enhancing the performance of model question answering has been given priority by including supplementary prompts for reasoning steps. Most research has relied on a voting system to evaluate various reasoning paths for questions and give scores in question assessments. In their study, Wang et al. [45] suggested a self-consistency decoding approach that involves analyzing and sampling various outputs from LLMs to reach a majority decision. Wang [45] and Li [10] presented techniques that use randomness analysis in the input space to produce a wider range of voting results. Usually, these methods require manually handpicking a question and inputting it into a LLM. The model uses its language generation ability to create arguments that match the input question, producing a statement that is specific to that question. The studies mentioned above require manual design, which means much human effort is required to create the prompts and reasoning chains.
This paper follows the approach guided by Auto-CoT to address the shortcomings of the Retrieval-Q-CoT system, as described in the source cited [46]. The Retrieval-Q-CoT system works using the principles and answers provided by Zero-Shot-CoT, which can sometimes lead to incorrect reasoning chains and worsen retrieval errors. When errors occur together, they can cause distinct answers to be given for similar questions. The defect can weaken the reliability of Auto-CoT. One way to address this problem is by reducing the impact of a particular group on the contextual learning model. We can sort the questions into categories and choose a few representative questions from each category to create a sample thinking chain. Minimizing the impact of incorrect reasoning chains can improve the overall effectiveness of Auto-CoT.
The reasoning methods mentioned above have similar limitations. To form Chinese reasoning chains, it is necessary to refine the intermediate reasoning steps. As a result, Chinese language models need to enhance their abilities in both expression and reasoning, which can lead to lower performance in Chinese reasoning tasks. This study suggests incorporating CoT data into the model and utilizing joint training to improve Chinese reasoning skills.
3. PROBLEM DEFINITION
3.1 Overview
This study examines the fundamental aspects of the problem at hand. Our model endeavors to augment Chinese comprehension and reasoning capabilities, consequently necessitating the provision of a mathematical definition of our input. We designate Qtest as the variable representing the input question, which undergoes encoding and decoding procedures to generate a testing matrix. It should be noted that multiple input questions may be considered. By subjecting our “LLaMA-LoRA” model to extensive training, our primary objective is to derive the answer matrix Qresult corresponding to the input questions. This matrix encompasses the answer inference matrix rresult and the final result matrix aresult.
3.2 Multi-Step Training Definitions
This section aims to provide a precise mathematical description of our model training process, which consists of two separate train stages. In the first step, relevant contextualized information in Chinese is used as training data. The encoded and decoded result is represented by a sentence matrix denoted by Hchinese = [h1,h2,…,hk,…] Each vector matrix hk ∈ Hchinese is derived from a question and its corresponding answer sentence. Following the initial training phase, a new matrix Mchinese = [m1,m2,…,mk,…] that is called the understanding skill matrix is introduced, which is constructed by evaluating the comprehension level of our model through the association of questions with their corresponding answers. Each mk ∈ Mchinese represents a matrix of understanding skills specific to a particular question and its corresponding answer.
In the second training step, CoT training data is incorporated into the pre-trained model. To accommodate this, the sentence matrix obtained through encoding and decoding is redefined as HCoT = [h1,h2,…,hk,…], where hk ∈ HCoT represents a vector matrix generated from question and answer sentences in the CoT data. Furthermore, this method introduces a reasoning skill matrix labeled as MCoT = [m1,m2,…,mk,…]. Each mk ∈ MCoT represents a reasoning skill matrix that is obtained from a question and its corresponding answer. After finishing the training process, the two skill matrices are merged together to create a concatenated matrix dennoted by Qskill = [Mchinese;MCoT]. The concatenated matrix is a useful tool for capturing the model's understanding and reasoning skills since it progresses through the training process.
The questions are categorized into n populations when creating examples of thinking chains. For a population j ∈ (1,2,…,i,…,j,…,n) in which the problem matrix table tj = [tj1, tj2,…,tjk,…] is defined. Each problem is a vector matrix. During the creation of example instances, a sample matrix table is defined as S = [s1, s2,…,si,…,sn], where si = [ti;ri;ai] consists of the question matrix, reasoning matrix, and the final answer matrix. This study defines the question matrix formed through encoding and decoding as Qtest during the testing process. After the model provides a response, it produces a result matrix called Qresult = [rresult;aresult], where Qresult is spliced by rresult and aresult.
4. MAIN WORKI
4.1 Overview
This section provides a comprehensive explanation of the model's construction, training, and workflow. The model consists of three primary components: the model training segment, the generation of thinking chain examples, and the question testing phase. The overall architecture of our model is illustrated in Figure 1. During the training phase, the model receives vectorized Chinese sentences as input. These sentences are simultaneously processed by both the LLaMA base model and the LoRA model for parallel training. While the LoRA model undergoes fine-tuning, the parameters of the LLaMA model remain fixed. The resulting sentences are then stored within the trained model. For a more detailed description of the workflow, please refer to Section 4.2.
To address reasoning errors in constructing thinking chain examples, we employ the method of diversity clustering analysis [46]. This approach is crucial for improving the performance and accuracy of LLMs in logical reasoning tasks. It provides adequate guidance and training examples to generate coherent and precise thinking chains, significantly improving LLM performance in reasoning tasks.
The process begins with applying the K-means clustering algorithm to partition questions into distinct clusters based on their types. Each cluster contains multiple question examples that are encoded and decoded using the BERT model, yielding fixed-size vector representations. These questions are then sorted in descending order within each cluster based on their distance from the cluster center, resulting in an ordered list. To create a reasoning chain demonstration, we select representative questions from various clusters for answer demonstrations. These selected questions are organized into a sample matrix table denoted as S = [s1, s2,…,si…sn]. Each sample matrix si ∈ S undergoes standardization and includes both the question and answer matrices. The answer matrix is formed by concatenating the reasoning matrix ri and the final answer matrix ai. A detailed process for generating thinking chain examples is established, providing further specifics in Section 4.3.
During the question testing phase, the questions to be evaluated are inputted into the trained model to search for a matching model sample table S, enabling the establishment of associations between questions and specific question type clusters. Answers can be inferred by leveraging the thinking chains within sample table S, thereby obtaining the underlying generated reasoning process and the model's provided answers.
The evaluation process begins by inputting the test question into the trained model and subsequently matching it with the corresponding sample table S based on its question type. The thinking chains in sample table S provide valuable insights for generating the answer. By leveraging these thinking chains, we are able to deduce the answer and closely analyze the model's reasoning process and final response. For a comprehensive understanding of the testing procedures and steps involved, please refer to Section 4.4, which presents a detailed depiction and flowchart of the question-testing process. The overall framework of our model is visually presented in Figure 1. Through rigorous question testing, we can thoroughly evaluate and validate the performance and accuracy of our model in generating reasoning and answers using thinking chains. This research addresses the challenges faced by LLMs in logical reasoning tasks, enhancing their reliability and resilience.
4.2 Training
This section presents a detailed account of the model training process, which encompasses two distinct stages. The primary stage focuses on augmenting the model's proficiency in Chinese comprehension, whereas the subsequent stage endeavors to refine the model's aptitude in logical reasoning.
In the initial training phase, the model undergoes training using diverse datasets, including Chinese dialogue data, Alapca-data [47] translated into Chinese, and other relevant datasets. This process generates a sentence matrix Hchinese = [h1, h2,…,hk,…] through sentence encoding and decoding techniques.
Simultaneously, training is performed using a composite model combining the LLaMA base language model and the LoRA model. Throughout this training phase, the parameters of the base language model remain fixed, while matrix low-rank decomposition is applied to the input of the LoRA model. As a result, the final comprehension skill matrix Mchinese = [m1,m2,…,mk,…] is derived by merging the outputs of the original model and the LoRA model. Subsequently, these comprehension skill matrices are integrated into the newly trained model, initiating the fine-tuning process of the LoRA model. The mathematical representation of the fine-tuning procedure for the LoRA model is given by Eq. 1.
where, W is the weight matrix. Wpm and WLoRA represents the weight matrix of the base model and LoRA during training process, respectively. DLoRAzeros and CLoRAgaussian are weight matrices utilized for low-rank processing in LoRA training, defined as a zero matrix and a Gaussian distribution, respectively. Incorporating LoRA into the mapping matrix W of Query and Value further enhances its effectiveness. The weight calculation for the mapping matrices of Query and Value in the attention mechanism is expressed as Eq. 2 and Eq. 3.
The active form of the LoRA model for the Query (Q), Key (K), and Value (V) matrices after multiple self-attention is calculated as in Eq. 4, Eq. 5, and Eq. 6. First, the training data Y is transformed into a matrix after it is mapped in the following way:
When utilizing Softmax, the computational inference for matrices Q and K using the LoRA layer is expressed in Eq. 7.
The final attention calculation can be shown in Eq. 8
In the second phase of training, CoT data is incorporated into the newly trained model after the initial training, aiming to further enhance the model's reasoning capability. Throughout the training of the reasoning process, the Auto-CoT clusters and analyzes diverse categories of queries, culminating in the formulation of appropriate reasoning steps tailored to each question type. This compilation establishes a fundamental array of instances that serve as guidance for the model's reasoning procedure across a spectrum of questions. This phase of training follows a similar process as the first stage, involving the utilization of sentence encoding and decoding layers to form a sentence matrix denoted as HCoT = [h1,h2,…,hk,…]. By leveraging the base language model and the LoRA model, the final reasoning skill matrix MCoT = [m1,m2,…,mk,…] is generated and stored within the currently trained model. The Qskill feature skill matrix is then created by concatenating these two skill matrices, as illustrated in Eq. 9. Through this approach, the LoRA model effectively completes the final phase of fine-tuning.
4.3 CoT Sample Construction and Formation
This section presents a comprehensive exposition of the construction procedure employed for generating thinking chain examples. The concept of an Automatic Chain-of-Thought (Auto-CoT) prompts is introduced, wherein during the reasoning phase of the model, prompts are generated based on the initially produced examples. As the model proceeds through the stages of reasoning for a given question, it generates answers corresponding to each stage based on the example prompts from different stages of the automatic thought chain. This approach ensures that the model's reasoning process aligns more closely with the desired human-like outcomes. The procedure involves the utilization of K-means clustering to organize the questions, resulting in the formation of n distinct clusters. Each cluster j is then meticulously arranged in descending order based on the distance between the questions and the centroid of their corresponding cluster. Consequently, a matrix table denoted as tj = [tj1,tj2,…,tjk,…], j ∈ [1, 2, 3,…,n]. Using the zero-shot thinking chain approach, we generated thinking chain examples by inputting the extracted questions into our trained model. The model leverages the feature memory matrix Qskill from the training process to enhance the quality of the examples. Subsequently, the model combines the input question with the corresponding answer steps, resulting in the final example matrix table S = [s1,s2,…,sn].
Each example sj with j ∈ [1,2,…, n] is standardized to ensure consistent matrix dimensions. The relevant answer steps within the example matrix are generated by concatenating the answer reasoning with the outcome. The procedure for forming and constructing thinking chain examples is outlined in Pseudocode 1.
4.4 Test Answer Generation
This section provides a comprehensive description of the testing process for answer generation. The model initially takes the test questions as input and performs encoding and decoding operations to create the question matrix Qtest. In the result construction phase, the model incorporates the automatically generated example matrix S = [s1,s2,…,sj,…,sn], which contains instances sj ∈ S that are similar to the test question. By searching for examples resembling the test question, the model gradually generates the final result matrix Qresult, excluding the question itself. This matrix is obtained by concatenating the reasoning process matrix rresult and the final result matrix aresult. The subsequent pseudocode2 provides an illustration of the evaluation outcomes during the testing process.
This paper presents an assessment introducing a methodology for evaluating the performance of multi-turn dialogue systems. Current evaluation criteria for multi-turn dialogue primarily rely on subjective approaches such as perplexity, F1-score, and engagement rate, which require extensive manual intervention. In contrast, our fully automated method provides a more objective evaluation. Our evaluation methodology employs entity extraction and comprises the following steps:
Step1: Task assignment and generation of the initial statement, i.e., < sentence1 >, for the dialogue.
Step2: Iterative providing prompts to the model, appending attributes (emotion, tense, role) to the previous sentence, and sequentially generating < sentence2,…,sentence1 >.
Step3: Gradually removing attributes from the sentences generated in Step 2, resulting in < sentencei, sentencei+1,…,sentencen >.
Step4: Utilizing the Regularized Canonical Correlation Analysis (RCCA) [48] method to project the generated sentences. Pairwise correlation ρi is computed between two sentences. In Formula21 PPLi and BLEUi is the average of two sentence (e.g., < sentence1 > and < sentencen >, < sentence2 > and < sentencen-1>).
In this study, the generated sentences undergo encoding and decoding processes, forming the sentence feature matrices X and Z. By employing RCCA for correlation analysis, the principal objective of the RCCA evaluation approach is the identification of a set of linear projection functions applicable to two distinct sets of variables. This assessment entails the projection of matrix vectors characterized by dissimilar dimensions into a common dimensional space. This process facilitates the computation of inter-vector distance relationships. Subsequently, the magnitudes of correlation are established based upon these distance associations. This methodology is designed to optimize post-projection correlation among the variables. In the context of this manuscript, the implementation of the RCCA methodology is primarily focused on appraising the extent of correlation present between sentences within multi-turn dialogues. This evaluative endeavor transpires subsequent to the extraction of entity content from the sentences. This extraction allows for the subsequent analysis of the correlation magnitudes between antecedent and subsequent sentences. This correlation analysis, consequently, yields valuable insights into the consistency of the model's outputs during multi-turn dialogues. The adoption of the RCCA evaluation framework confers a notable advantage over alternative methods of correlation assessment, particularly in cases where the dimensionality of features surpasses the count of observed samples. This particular attribute renders the approach particularly suited for scenarios characterized by a scarcity of labeled data, thereby resulting in efficient resource utilization. we construct two linear projection matrices, H = [h1,h2,…,hi,…hl] ∈ Rd×l and Q = [q1, q2,…,qj,…,ql] ∈ Rd×l, to project the feature matrices X/Z into the latent vector space W. The goal is to maximize the correlation between HT X and QT Z. The correlation between typical variables hi ∈ Rd×1 and qj ∈ Rd×1 is calculated as follows:
where, hT and qT denote the transpose operations of the vectors, while CXY, CXZ, and CZZ represent the covariance matrices of the feature matrices X/Z. Since X/Z have mean values of 0 and standard deviations of 1, these covariance matrices can be computed using the following formulas:
The optimization objective of Eq 12 is equivalent to considering the independence of any set of typical variables hi and qj from the coefficients.
The convergence of the above optimization objectives can be ensured, especially when the number of observed samples is small, by incorporating the constraint terms rX > 0 and rZ > 0 into the covariance matrix. As the formula Eq.13, where E represents the unit matrix.
The projection matrices H and Q can be obtained through generalized eigenvalue decomposition in Eq. 14.
Consequently, the sentences’ feature representations in the same latent vector space W are denoted as Wx = HTX and Wz = QTZ, respectively. The distances between the two vectors were computed and analyzed using Eq. 15.
The degree of sentence similarity is determined by measuring the distance in the shared latent vector space. A smaller distance indicates a higher level of semantic similarity between the sentences.
5. EXPERIMENTS
To enhance the model's understanding of Chinese, we conducted thorough pre-training, using 8 A100 GPUs for a duration of one month and four days. The pre-training dataset is as follows:
Belle-dataset [48] is an Chinese dataset, comprising 1.5 million samples, was created with reference to the Stanford Alpaca dataset. It includes 175 sub-characters. Low-quality data was removed during processing, such as data claiming to be from GPT models, data where the model couldn't respond due to incomplete input, and data with Chinese instructions but English input or target.
Moss-003-sft-data [50] is an open-source collection of Chinese-English multi-turn dialogue data from Fudan University's MOSS team, containing over one million samples. The multi-turn dialogue data is constructed based on approximately 100,000 user input data collected during the MOSS-002 internal testing phase. It more closely aligns with real user intent distribution and includes finer-grained usefulness category labels, a wider range of harmless data, and longer conversation lengths, totaling around 1.1 million dialogue samples.
InstructMT Data [51] collects comprises instruction data and scripts for machine translation. The generated files are primarily suitable for ParroT's format, with some compatibility with LLaMA's format. In this paper, we mainly focus on documents translated between Chinese and English.
This study extends prior research [52] by investigating zero-shot learning and conducting a comparative analysis of 9 advanced Chinese models. The findings are presented for benchmark tasks in 6 distinct domains, spanning fundamental natural language comprehension and generation, as well as applications involving natural language reasoning. These applications include knowledge question-answering, open-ended questioning, multi-turn dialogue understanding, simple numerical computation, and reasoning tasks. Multiple datasets were employed, with training conducted on the A100 GPU and testing on GPUs with 32GB of memory. This meticulous experimental setup ensures the effectiveness and reliability required for handling large datasets and complex models. The comprehensive experimental design enables the evaluation and comparison of model performance across diverse task domains.
5.1 Metrics
We employ a comprehensive set of carefully chosen evaluation metrics to thoroughly assess the performance of our model. These metrics have been specifically selected to provide in-depth insights into various aspects of its effectiveness. The evaluation metrics utilized in our study encompass a wide range of indicators, including the following:
- Accuracy (acc) measures the proportion of correct predictions made by the model on a given input.(16)
- Precision represents the ratio of correctly predicted positive cases to all predicted positive cases.(17)
- Perplexity (PPL) evaluates the linguistic modeling ability of the model. It quantifies the difference between the predicted and actual outcomes, with lower values indicating better performance.(18)
- F1-score assesses the accuracy of the model for Dialogue State Tracking (DST). It calculates the match between the model's predicted and actual conversation states, using the F1-score formula that includes the recall rate.(19)
- BLEU is a widely used machine translation evaluation metric. It measures the similarity between the machine-generated translation and the reference translation. The BLEU formula involves multiple steps, and it incorporates a penalty term called BP (Brevity Penalty) to account for shorter translations. BP is computed as the ratio of the total length of the candidate translation to the total length of the reference translation, using natural logarithm.(20)
Human is an established metric through a manual scoring mechanism to comprehensively evaluate the model's linguistic fluency, answer relevance, and other metrics when handling open-ended questions and answers. Human evaluators are assigned the task of assessing verbal fluency, relevance of answers to the questions, and various other metrics as part of a comprehensive evaluation.
5.2 Baseline
We faithfully replicated the configurations and findings of previous studies for comprehensive comparisons. However, it is important to note that specific baseline models may be applicable only in certain experimental contexts. Our selection of comparative models prioritized those with parameters similar to our model (13B). We also collected publicly available, extensively pre-trained models known for their exceptional proficiency in the Chinese language. The chosen comparative model for our study is as follows:
GPT-3.5 [6] is a LLM of OpenAI, utilizing the GPT-3.5 architecture. It is the latest iteration in the GPT series, exhibiting remarkable proficiency in NLP. The model effectively understands and generates natural language text by leveraging a large corpus of internet text data during training. It has broad applicability across various language processing tasks, including question answering, task fulfillment, and dialogue generation.
ChatGLM [53] is an open-source bilingual conversational language model based on the General Language Model (GLM) architecture with 6.2 billion parameters. chatGLM-6B employs similar technology to GPT-3.5 but is optimized for Chinese question answering and conversation. It undergoes bilingual training with approximately 1 trillion identifiers, along with supervised finetuning, feedback self-help, and reinforcement learning with human feedback.
OpenAssistants an open-source project developed by LAION Institute to train a scaled-down alternative version of GPT-3.5. Similar to Stable Diffusion's relation to DALL-E, OpenAssistant enables easy adoption and widespread dissemination, empowering ordinary individuals to effectively utilize the technology.
Chinese-LLaMA-Alpaca [54] is an extension of the original LLaMA model that incorporates an expanded Chinese vocabulary and undergoes a second round of pre-training using Chinese data. This enhancement further enhances the model's Chinese semantic understanding capabilities and improves the efficiency of Chinese encoding and decoding processes. We will compare its three generalized models.
StableLM-Tuned-Alpha-13B [55] is an RLHF fine-tuning of Vicuna-13B, which is a fine-tuned version of LLaMA-13B. This model is developed by the CarperAI team at StabilityAI.
Chinese-Alpaca-LoRA [56] is a language model for the Chinese language developed by the open-source community.
ERNIEBot [57] is a chatbot developed by Baidu Inc. capable of interacting with users, answering questions, and collaborating on creative tasks. The media has compared this product to a Chinese counterpart of the internationally renowned chatbot GPT-3.5 and its competitor.
The availability of Chinese-specific LLMs is currently limited, necessitating our reliance on high-quality models from the open-source community. Unless stated otherwise, the following parameters remained constant for experimental analysis: (Temperature: 0.1, Top P: 0.75, Top K: 40, Beams: 4, Max tokens: 128).
The Temperature parameter controls text diversity, with higher values yielding more varied and random outputs, and lower values generating consistent and deterministic text. The Top P parameter selects words surpassing a predefined probability threshold, ensuring concise and sensible text while preserving diversity. The Top K parameter identifies the K words with the highest predicted probabilities, providing choices for the next word and enhancing text quality when used in conjunction with Top P. The Beams parameter maintains top partial sequences, extending them at each step and selecting the sequence with the highest score as the final output. The Max tokens parameter limits the length of generated text. Keeping these parameters fixed ensures consistent evaluation and comparison, validating model performance and stability. Their selection aims to balance diversity, coherence, and length control for reliable experimental results.
Based on the mentioned tasks, we devised our testing approach scores (Eq. 21) and evaluated performance with metrics including PPL, BLEU, and ρ. The results, depicted in Figure 2, show the cumulative score of multiple sentences as “Score”. Through extensive experimentation, we determined the values of λ1,λ2 and λ3 to be 0.2, 0.3, and 0.5, respectively.
5.3 Knowledge Quiz
Building upon Chinese-LLaMA-Alpaca [54], we expanded the knowledge question-answering (QA) test set and evaluated our model's performance. To ensure language inference consistency, prompt statement coherence was maintained across different models. The evaluation was conducted in a zero-shot setting, comparing our model with GPT-3.5, Chinese-LLaMA-Alpaca, and other models.
We employed a closed-book approach for this task, limiting the model's access to relevant evidence for QA. Although our model exhibited relatively lower QA accuracy, it achieved the lowest PPL in language generation (Table 1). Due to its reduced parameter count, our model had a relatively weaker information retrieval capability than GPT-3.5, specifically for knowledge QA. Nevertheless, it provided concise and accurate responses to basic knowledge questions. Our model has only 13 billion parameters, which is much smaller (about 5 to 10 times) than GPT-3.5. Despite this, it still performs well in the benchmark task.
None . | Human . | PPL . | acc . |
---|---|---|---|
GPT-3.5 | 83 | 11.8036 | 0.54 |
ERNIEBot | 73 | 1.1753 | 0.35 |
Chinese-LLaMA-Alpaca-13B | 67 | 17.2122 | 0.26 |
Chinese-LLaMA-Alpaca-Plus-7B | 70 | 15.2103 | 0.29 |
Chinese-LLaMA-Alpaca-Plus-13B | 75 | 12.9571 | 0.32 |
ChatGLM | 76 | 9.8173 | 0.42 |
OpenAssistant | 68 | 14.1269 | 0.28 |
ours | 72 | 9.6654 | 0.39 |
None . | Human . | PPL . | acc . |
---|---|---|---|
GPT-3.5 | 83 | 11.8036 | 0.54 |
ERNIEBot | 73 | 1.1753 | 0.35 |
Chinese-LLaMA-Alpaca-13B | 67 | 17.2122 | 0.26 |
Chinese-LLaMA-Alpaca-Plus-7B | 70 | 15.2103 | 0.29 |
Chinese-LLaMA-Alpaca-Plus-13B | 75 | 12.9571 | 0.32 |
ChatGLM | 76 | 9.8173 | 0.42 |
OpenAssistant | 68 | 14.1269 | 0.28 |
ours | 72 | 9.6654 | 0.39 |
5.4 Chinese Reading Comprehension and Reasoning Skills
We chose the LCSTS task [58] as the standardized measure to evaluate our models. This task is designed for Chinese short-text summarization and includes a comprehensive collection of Chinese text pairs and corresponding summaries. Each sample in the dataset consists of an original text and a manually generated summary. The original text provides a concise and descriptive phrase or sentence, while the summary offers a condensed version of the original content. We performed a comparative analysis with well-known models like OpenAssistant and StableLM-tuned-alpha-13B to assess our model's effectiveness. Additionally, manual assessments were conducted to evaluate the models’ performance under scrutiny.
In Table 2, we present the performance of our models on the LCSTS task. GPT-3.5 achieved optimal performance with fixed parameter configurations, while our model performed slightly lower than GPT-3.5. The LCSTS task focuses on information summarization and text comprehension. Our model demonstrates a comprehensive understanding of the text, enabling the generation of summaries that capture essential information. However, our model tends to produce slightly longer headlines than the desired standard length. Notably, GPT-3.5 and ChatGLM generate the most concise summaries. By adjusting the maximum token count to 64, we observed a reduction in the summary length.
None . | Human . |
---|---|
GPT-3.5 | 82 |
ChatGLM | 75 |
OpenAssistant | 64 |
StableLM-Tuned-Alpha-13B | 62 |
Chinese-LLaMA-Alpaca-13B | 65 |
Chinese-alpaca-LoRA | 62 |
ours | 78 |
None . | Human . |
---|---|
GPT-3.5 | 82 |
ChatGLM | 75 |
OpenAssistant | 64 |
StableLM-Tuned-Alpha-13B | 62 |
Chinese-LLaMA-Alpaca-13B | 65 |
Chinese-alpaca-LoRA | 62 |
ours | 78 |
It is important to highlight that while GPT-3.5 excels in generating informative summaries, it often exhibits uniformity and template-like structure. In contrast, as depicted in Figure 5, our model can generate summaries in diverse styles based on the content. This feature is valuable for practical applications such as news summarization, as it enables the creation of relevant and informative headlines tailored to the content's characteristics and requirements.
5.5 Multi-Round Dialogue Understanding
We assessed our model on the C3 [59] benchmark . The C3 benchmark is a free-form multiple-choice Chinese machine reading comprehension dataset of extracted multi-turn dialogues. The benchmark requires models to select the correct option among the given choices based on the multi-turn dialogue. We employed two strategies for evaluation: one that required the model to select from the options and another that demanded the model to provide its reasoning process without given choices. We compared our model against GPT-3.5, StableLM-tuned-alpha-13B, and other models. As reported in Table 3, our model achieved the best performance on this benchmark. It exhibited significantly higher accuracy than other models (0.81). Among the remaining 0.19, the model made 0.6 selections of suboptimal answers, indicating some level of contextual relevance. When evaluated using the second strategy, most of the thinking chains generated by our model were logically consistent. However, the ChatGLM model failed to fulfill the requirements of the prompts in some test cases. Through experiments on the C3 benchmark, we demonstrated the superiority of our model in Chinese dialogue understanding. Figure 6 shows that our model performed remarkably well in selecting the correct answers and generating reasonable thinking chains.
5.6 Open-Ended Response
We selected WebQA [60] as the evaluation framework for our experiments. WebQA is a curated dataset of question-answer pairs obtained from the Baidu Zhidao platform. In this task, individual questions were presented to the model, and its performance was assessed based on accuracy and manual evaluation. Our model was compared against GPT-3.5, Chinese-LLaMA-Alpaca, and other models. As depicted in Table 4, due to the limited extent of training on extensive data, our model exhibits a relatively weaker proficiency in answering open-ended questions. In contrast, GPT-3.5 outperformed our model in this particular task, thereby emphasizing a notable limitation.
None . | Human . |
---|---|
GPT-3.5 | 79 |
ChatGLM | 72 |
StableLM-Tuned-Alpha-13B | 56 |
Chinese-LLaMA-Alpaca-13B | 62 |
ours | 69 |
None . | Human . |
---|---|
GPT-3.5 | 79 |
ChatGLM | 72 |
StableLM-Tuned-Alpha-13B | 56 |
Chinese-LLaMA-Alpaca-13B | 62 |
ours | 69 |
Nonetheless, despite this limitation, our model showcases a certain level of competence in addressing questions within the WebQA task. It can respond to simpler inquiries, albeit with a discernible gap compared to other models. Nevertheless, our model remains competitive within this benchmark task. It is crucial to underscore that our research primarily revolves around logical reasoning and the generation of thinking chains, prioritizing these aspects over large-scale corpus training or information retrieval. Consequently, the weakness of our model in addressing open-ended questions does not signify a comprehensive failure but rather reflects the specific focus and experimental design of our study.
5.7 Mathematical Reasoning Ability
We evaluated the mathematical reasoning capabilities of our model using the Math23K dataset, which consists of 23,162 Chinese math problems sourced from the internet, covering topics from elementary and junior high school levels. During the evaluation, specific parameters were fixed, and a comparative analysis was conducted using a greedy approach across multiple models. In contrast to the other models in the comparison, our model did not undergo fine-tuning specifically for math problems; instead, it directly engaged in question querying and reasoning. The performance of our model was compared with GPT-3.5, ChatGLM, and OpenAssistant models, as illustrated in Table 5. The results unequivocally demonstrate the exceptional superiority of our model in the Math23K task, particularly in showcasing remarkable reasoning abilities for Chinese language tasks.
None . | Human . | acc . |
---|---|---|
GPT-3.5 | 71 | 0.58 |
ChatGLM | 65 | 0.27 |
OpenAssistant | 59 | 0.21 |
ours | 76 | 0.65 |
None . | Human . | acc . |
---|---|---|
GPT-3.5 | 71 | 0.58 |
ChatGLM | 65 | 0.27 |
OpenAssistant | 59 | 0.21 |
ours | 76 | 0.65 |
During the testing phase, our model faced challenges in providing precise answers to pure mathematical equations, such as 1 + 23/6 = ?. This difficulty arises from a limitation in the training dataset, which does not sufficiently cover equations involving basic arithmetic operators such as +, -, *, and %. As a result, our model's performance in this particular aspect could be enhanced. Nevertheless, we have made progress in addressing this concern through fine-tuning endeavors.
The thinking process employed by our model serves as a crucial criterion for manual evaluation. As depicted in Figure 7, our model consistently demonstrates its ability to engage in rational and rigorous reasoning, surpassing reliance on prior knowledge from the dataset and exhibiting comprehensive logical reasoning skills. Furthermore, our model exhibits a relatively stable performance in answer generation even when the question description is perturbed to some extent, such as altering the question order or punctuation. This stability is achieved through the incorporation of Auto-CoT functionality, which enables the model to maintain consistency regardless of prompt variations and independent of a specific answer format. In contrast, applying similar modifications to other models leads to a decrease in accuracy. Consequently, our model outperforms the compared models significantly in this aspect.
Our model demonstrates exceptional performance in Chinese mathematical reasoning tasks, effectively employing sound reasoning processes and maintaining stability. While the model's accuracy may not yet match human-level performance, it represents a significant breakthrough for LLMs.
5.8 Chinese Idiom Comprehension
We conducted experimental evaluations using the ChiD task [61], sourced from “The Complete Collection of Chinese Idioms.” This task specifically focuses on four-character idioms, and the experimental passages were extracted from various literary works, articles, and news sources. For the purpose of gathering more comprehensive information, paragraphs shorter than 100 characters were merged with the subsequent paragraph, while paragraphs exceeding 600 characters were excluded. Moreover, paragraphs consisting solely of high-frequency idioms were also omitted. To assess the performance of our model, we compared the experimental results with those of the Chinese-Alpaca-LoRA, GPT-3.5, ChatGLM, and StableLM-Tuned-Alpha-13B models.
The experimental results presented in Table 6 reveal that our model achieved an accuracy of 0.62 on the ChiD task, surpassing the performance of other models. Among the incorrect answers, 0.13 were manually determined to be suboptimal, highlighting the exceptional performance of our model in this task. In contrast, the ChatGLM model, which we compared against, exhibited a lower accuracy of approximately 0.34. Our model's superior understanding of idioms enables a deeper comprehension of their underlying meanings.
5.9 Ablation Experiments
To accurately evaluate the impact of the LoRA and Auto-CoT modules in the LLaMA-LoRA neural prompting engineering framework on model reasoning and Chinese comprehension abilities, we conducted a series of ablation experiments and performed tests on the Math23K, C3, and ChiD tasks. Through manual evaluation and accuracy assessment, we conducted a comprehensive evaluation of the model to reveal the necessity and effectiveness of these modules.
Specifically, we conducted ablation experiments for the LoRA and Auto-CoT modules, and the experimental results are provided in Table 7. The findings indicate that the inclusion of the LoRA module significantly enhances the model's Chinese comprehension and, to some extent, its ability to reason with mathematical symbols, particularly excelling in multi-turn dialogue comprehension. However, the LoRA module exhibits a relatively poor understanding of idioms and requires further improvement. On the other hand, the inclusion of the Auto-CoT module leads to improved performance across the Math23K, C3, and ChiD tasks, with significant enhancements observed in the Math23K task. Auto-CoT module primarily enhances the model's reasoning capabilities and improves its comprehension of Chinese to a certain degree, particularly in understanding Chinese conversations. The examples presented in Figure 8 illustrate the improvement in logical thought process achieved through the Auto-CoT module.
6. CONCLUSION
This paper presents a novel LoRA fine-tuning framework based on the LLaMA-13B model, incorporating the Auto-CoT mechanism. Our framework effectively addresses limitations in Chinese comprehension, expression, and logical reasoning of language models through adaptive enhancements. The training process involves two rounds: parallel training followed by the introduction of the thinking chain for adaptive training. This methodology improves the model's Chinese comprehension and logical reasoning abilities while maintaining a relatively low parameter count.
We introduce the Auto-CoT training method, significantly enhancing the model's logical reasoning and demonstrating commendable performance even with “perturbed” questioning. To evaluate multi-turn dialogues more effectively, we propose an RCCA-based evaluation method.
Comparative experiments demonstrate that our two-round trained model outperforms various models (GPT-3.5, ChatGLM, ERNIEBot) in benchmark tasks including Chinese dialogue comprehension, mathematical reasoning, and idiom comprehension. The diminished efficacy of our model in the realms of knowledge question-answering and open-ended questioning, when contrasted with GPT-3.5, is ascribed to the dearth of comprehensive corpus training specialized in these domains. As a result, the model remains bereft of pertinent domain-specific knowledge. In subsequent endeavors, our concerted attention will be channeled towards the refinement of data through meticulous cleansing processes, aimed at elevating the quality of data inputs. This pursuit will be complemented by the utilization of advanced training sets to amplify the model's cognitive capacities. Furthermore, our exploratory endeavors will encompass the implementation of the “knowledge injection” paradigm, fostering the expeditious assimilation of novel insights by the model, thereby mitigating the imperative for extensive fine-tuning measures. Although our model slightly underperforms GPT-3.5 in knowledge question-answering and open-ended questioning, it exhibits superiority compared to models with similar parameters (Chinese-alpaca-LoRA, Chinese-LLaMA-Alpaca). Additionally, we propose an RCCA-based evaluation method for multi-turn dialogues, which yields results similar to manual evaluation and addresses related shortcomings.
DATA AVAILABILITY
Data are available for download at the following web links. https://github.com/oulaxiaoge/LLaMA-LoRA-Neural-Prompt-Engineering Generally the following parameters remained constant for experimental analysis: (Temperature: 0.1, Top P: 0.75, Top K: 40, Beams: 4, Max tokens: 128).
AUTHOR CONTRIBUTIONS
S.L. Chen, Conceptualization; S.L. Chen, Methodology; S.L. Chen, Software; S.L. Chen, Writing original draft; WC. Wang, Data curation; WC. Wang, Formal analysis; X. L. Chen, and Y. J. Du, Funding acquisition; X.L. Chen, Investigation; X.L. Chen and P. Lu, Writing-review and editing. X.L. Chen, Project administration; X.L. Chen, Supervision; Z.Y. Yang, Visualization.
ACKNOWLEDGEMENT
This work is supported by the the Science and Technology Program of Sichuan Province (Grant no. 2023YFS0424), the “Open bidding for selecting the best candidates” Science and Technology Project of Chengdu (Grant no. 2023-JB00-00020-GX), and the National Natural Science Foundation (Grant nos. 61902324, 11426179, and 61872298).