LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains

ABSTRACT The exption of Chinese natural language processing (NLP) has stimulated research in the broader NLP domain. However, existing large language models have limitations in comprehending and reasoning in Chinese. This paper addresses these limitations by enhancing Chinese language models comprehension and reasoning capabilities while minimizing resource requirements. We propose LLaMA-LoRA, a neural prompt engineering framework that builds upon the LLaMA-13B model and incorporates the Low-Rank Adaptation (LoRA) of Large Language Models technique for refinement. Chain-of-Thought (CoT) are crucial for generating intermediate reasoning chains in language models, but their effectiveness can be limited by isolated language patterns. Erroneous reasoning resulting from conventional prompts negatively impacts model performance. Automatic prompts are introduced to encourage reasoning chain generation and accurate answer inference. Training the model with an extensive corpus of Chinese CoT data enhances its comprehension and reasoning abilities. The LLaMA-LoRA model demonstrates exceptional performance across numerous Chinese language tasks, surpassing benchmark performance achieved by related language models such as GPT-3.5, Chat-GLM, and OpenAssistant, delivering accurate, comprehensive, and professional answers. The availability of our open-source model code facilitates further research in the field of Chinese text logical reasoning thinking chains.


INTRODUCTION
Pre-trained language models [1][2][3][4][5] like GPT-3.5 [6] have achieved considerable advancement in Natural Language Processing (NLP).These exceptional models have shown remarkable abilities in language comprehension, generation, and reasoning, which can adapt and complete new tasks quickly by leveraging extensive amounts of high-quality human text and can be effectively utilized across various scenarios.However, these models still encounter vital issues and limitations [7].Firstly, it has been observed that increasing the number of model parameters does not consistently lead to significant performance improvements [8].In fact, smaller models have been found to outperform larger ones in specific NLP tasks [9].This highlights the necessity for in-depth investigations into model structures and training strategies to identify the optimal performance trade-off for different tasks.Secondly, large pretrained models have limitations in terms of comprehending non-English texts, particularly languages like Chinese.This limitation arises due to the imbalanced nature of training data and the adaptability of model structures to accommodate diverse linguistic characteristics.Addressing this challenge requires further research on optimization methods and adaptive training strategies tailored to Chinese (non-English) texts.
Significant advancements have been achieved in the field of NLP through the utilization of Large Language Models (LLMs).Among these models, the Large Language Model Meta AI (LLaMA) [2] stands out as a subset of foundational language models that have undergone training on a vast corpus comprising billions of samples.The LLaMA model has many benefits when compared to traditional language models.It performs better in reasoning tasks since it is trained on a more extensive vocabulary.The LLaMA-13B [2] model has shown exceptional performance, outperforming the current GPT-3.5 model in most benchmark assessments.In addition, the LLaMA-13B model is precious for large model applications because it can be executed on a single graphics processing unit (GPU).Nonetheless, challenges persist in existing large models, particularly concerning their consumption of significant amounts of memory and storage resources.The Low-Rank Adaptation (LoRA) model has been introduced to address this issue, providing substantial advantages in mitigating resource requirements.The LoRA model effectively reduces the storage of parameters and the consumption of Virtual Random Access Memory (VRAM).Importantly, when the rank (k) is significantly smaller than the model, the LoRA model eliminates the need to store optimizer state parameters, resulting in a notable reduction in VRAM usage.
The LoRA model offers several advantages during task switching.To streamline the computation process, the model only exchanges weights instead of exchanging all parameters since most parameter gradients are not required.This targeted approach significantly reduces computational costs.One common finetuning technique involves training a selected group of pre-trained parameters, which does not result in additional inference latency.By employing an adaptive process, LoRA achieves full-rank performance without requiring cumulative gradient updates on weight matrices.LoRA allows the model to closely match the performance of the original model during fine-tuning by setting the rank (k) of all weight matrices to match the rank of the pre-trained weight matrices.When the number of trainable parameters increases, LoRA training effectively mimics the training impact of the original model.LoRA's approach to convergence differs from those that rely on adapters for Multi-Layer Perceptrons (MLP).Instead, LoRA's

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
adapter layers are designed to mirror the structure of MLP.In contrast, prefix-based methods are unsuitable for models processing lengthy input sequences because they require prefix layers at every position within the input sequence, leading to excessive model parameters.Consequently, LoRA offers significant advantages in handling downstream tasks.Its adaptation to the low-rank structure reduces hardware requirements and enables the parallel execution of multiple experiments.This capability fosters a deeper understanding of the relationship between weight updates and pre-trained weights.
This study aims to leverage the LLaMA model to enhance the efficiency and performance of LLMs.The approach employed involves reducing the number of trainable parameters while maintaining task performance.The utilization of LoRA has been expanded to encompass various attention weight types within the LLaMA model, and its accuracy has been evaluated using relevant datasets.Recent investigations have highlighted the potential of integrating LLMs with the Chain of Thought (CoT) approach to partially enhance the reasoning capabilities of extensive pre-trained models while reducing training costs and reliance on datasets.Building upon this concept, we introduce a comprehensive fine-tuning framework called "LLaMA-LoRA Neural Prompt Engineering" that uses prompts from automated thinking training to generate a logical reasoning chain for Chinese text.By employing the proposed LLaMA-LoRA neural prompt engineering, the comprehension, reasoning, and consistency of LLMs for the Chinese language can be dramatically improved.
LLMs, as evidenced in the study [10], encounter limitations in reasoning abilities, specifically in domains such as mathematical reasoning, symbolic reasoning, and contextual understanding.These limitations can manifest as common sense errors and contradictions when these models handle complex tasks.This research focuses on enhancing the capabilities of LLMs in Chinese expression and comprehension domains through fine-tuning the LoRA method.Furthermore, the CoT training method enhances the model's expressive power and logical reasoning abilities.Extensive use of open-source Chinese datasets ensures compatibility within the open environment, distinguishing this approach from existing models that rely heavily on undisclosed or inadequately documented data, posing challenges for practical applications.This paper evaluates the model across multiple tasks, including text generation, Chinese reasoning, and mathematical reasoning.State-of-the-art models such as GPT-3.5,ChatGLM, Alpaca, ERNIEBot, and their generalized counterparts are benchmarks to provide a comprehensive comparative analysis.The model outperforms other models through extensive testing, particularly in Chinese reasoning and multi-turn dialogue tasks.The model also demonstrates improved diversity and fluency in various textgeneration tasks.Our main contributions are summarised as follows:

Large Language Models
Transformers have paved the way for significant progress in NLP by developing LLMs.These models have exhibited exceptional proficiency in diverse NLP tasks, including dialogue generation [11], Entity Identification [12] and basic reasoning [6], guided by simple human prompts (Instructions).Overall, LLMs deliver impressive performance across a broad spectrum of NLP tasks.However, unlocking the complete potential of these models demands meticulous deliberation of prompt information.Choosing the correct or appropriate prompts is crucial since distinct NLP tasks entail specific prompt requirements that substantially influence the model's performance.
In response to this challenge, a promising approach known as Prompts Tuning has emerged, with the aim of enhancing the comprehension ability of language models by providing prompts that facilitate accurate understanding and generation of responses.Early prompt methods were categorized as hard or discrete prompts [13], requiring domain expertise and a comprehensive understanding of the underlying model's characteristics to achieve state-of-the-art performance.In 2020, soft prompts [14] were introduced to overcome the limitations of hard prompts.Soft prompts treat prompt generation as an independent task, transforming the process from discrete manual attempts to machine-driven continuous learning and expLoRAtion.Noteworthy examples of soft prompt methods include P-tuning [13] and Prefix-tuning [14].Since 2022, researchers have acknowledged the benefits and drawbacks of continuous prompt learning techniques, such as instability [15] and the absence of explicit reasoning steps provided by LLMs [16].
Google introduced the Chain of Thought (CoT) method in 2022 [16], aimed at enhancing the performance of tasks involving mathematical calculations and commonsense reasoning using LLMs.This method integrates a series of intermediate reasoning steps, facilitating the generation of more coherent logical chains for intricate reasoning tasks and offering improved interpretability for the answers generated by the model.During the logical chain formation, pertinent concerns undergo population categorization, then extracting related problems from each population.Numerous extraction methods exist, one of which is the diversified top-K maximal clique detection method proposed by Hao in his research.This method demonstrates applicability to alignment analysis within a broad spectrum of social networks, enabling the extensive dissemination of relational connections.Moreover, it can be employed to extract problems from the population of thought chains [17].Remarkably, when applied to the more challenging GSM8K [18] task with initially low-performance levels, the CoT method exhibited a notable performance improvement exceeding twofold compared to GPT-3 and PaLM.However, its impact on performance in the simpler MAWPS [19] task was minimal and sometimes even resulted in adverse effects.

LoRA
Low-Rank Adaptation (LoRA) [20] introduced by Microsoft is a training technique developed explicitly for LLMs to expedite training and enhance model performance through the random assignment of parameters.Pre-trained models with strong interdependencies among parameters across different layers can impede training efficiency, requiring significant time and computational resources.The LoRA model facilitated parameter communication between different layers during training through random parameter assignment, resulting in improved model performance and accelerated training speed.Current research [20] demonstrates that the LoRA model significantly enhances model performance, accelerates training speed, and reduces computational resource consumption.This methodology has been successfully applied to various pre-trained language models, including GPT and BERT, yielding remarkable outcomes.
We employ adaptive strategy of LoRA to make the fine-tuning of LLMs for independent tasks more efficient.This results in reduced costs related to hardware resources and storage/switching overhead and ensures the preservation of high-quality model performance without introducing inference latency or reducing the input sequence length.LoRA excels in enabling fast task switching in service deployment scenarios by sharing the majority of model parameters.By approximating the performance achieved through global training, LoRA fine-tuning effectively mitigates resource waste.LoRA leverages attentionrelated matrices such as W q and W v while considering W k to achieve optimal overall performance.
The standard deviation of performance remains consistent across different random seeds for a given dataset.When setting all parameters as ∆W q or ∆W k , a significant performance decrease occurs, whereas adjusting both W q and W v yields the best results.Hence, utilizing multiple smaller-rank weight matrices is preferable over a single type with a larger rank.Experimental evidence from the related study [20] demonstrates the high utility of top singular vector directions when the matrix rank is set to 8 since other directions often contain most of the accumulated random noise during training.Therefore, in our LoRA model training, the rank is set to 8. In current production environments, incorporating the LoRA approach does not introduce additional inference latency since W=W 0 +BA can be explicitly computed and stored, where W 0 and BA are d�r matrices.When switching to another downstream task, W 0 can be restored by subtracting BA and adding different B 0 A 0 .This restoration operation is fast and incurs minimal memory overhead.
Researchers have suggested using adapter layers to adjust to downstream tasks by adding them between existing layers in neural networks, as per previous studies [21][22][23].The LoRA fine-tuning method utilizes a bottleneck structure to enforce low-rank constraints for weight updates, which differs from the earlier approach.Our approach enables the trained weights to merge with the primary weights smoothly when making predictions, erasing any worries about delays caused by adapter layers.Another relevant adapter extension is the approach COMPACTER [24] which utilizes Kronecker products with predefined weight allocation schemes to parameterize adapter layers.Additionally, investigating the integration of LoRA with other tensor-product techniques may increase parameter efficiency.However, more research is necessary to understand this area's potential benefits fully.
Several subsequent researchers [13][14][25][26] have proposed an alternative approach to fine-tuning, which involves optimizing input word embeddings using continuous and differentiable prompt engineering techniques.However, these methods have a restriction in that they expand the model size by utilizing additional special tokens in the prompt.Consequently, these extra tokens occupy sequence lengths that could be used for task tokens when learning positional embeddings.The prevalence of low-rank structures [27][28][29][30] has been observed in various machine learning studies, as many machine learning problems inherently exhibit certain intrinsic low-rank characteristics.In many deep learning tasks, especially those that use overparameterized neural networks, the trained neural networks often show low-rank characteristics [31].Early research [32][33][34][35] explicitly imposed low-rank constraints during the training of original neural networks.However, these works primarily focused on low-rank updates for frozen models to adapt to downstream tasks.
In academic literature, neural networks are well-known for their better performance compared to classical learning methods like neural tangent kernels with limited width.Multiple studies [36][37], support this recognition.Some studies [38][39][40] have shown that neural networks perform exceptionally well when the underlying concepts have a low-rank structure.Furthermore, the research conducted by Allen Zhu and Li [41] emphasizes the benefits of utilizing low-rank adaptive updates in the context of adversarial training.

Chain of Thought
Prompt engineering plays a crucial role in the inference process of LLMs as it generates a sequence of reasoning chains that guide the model toward producing the final answer.CoT prompt has emerged as a promising technique for generating accurate answers by gradually generating intermediate reasoning steps within LLMs without relying on gradients.Wei et al. [42] conducted a study that explored the uses of CoT prompt to guide LLMs in producing coherent intermediate reasoning steps, thereby improving the accuracy of their answers.Additionally, LLMs can leverage zero-shot prompts or manually designed demonstrations to conduct reasoning analysis [42].This capability is advantageous for generating intermediate reasoning steps while addressing various tasks, and these generated steps are commonly referred to as CoT prompts.
CoT prompts can be classified into two scenarios.The first scenario involves incremental reasoning before generating the answer, which facilitates the generation of accurate answers.The model gradually reasons through intermediate steps to arrive at the final answer.The second scenario involves manual demonstrations, where a reasoning chain accompanies a question.The manual demonstration approach has limitations since it depends heavily on engineers' expertise, making it less scalable.Designing manual questions and constructing the corresponding reasoning chains incurs significant overhead.Recent research efforts [43] have primarily focused on manually creating complex problem demonstrations or employing integrated methods.These methods aim to progressively simplify complex problems into subproblems and solve them step by step, thereby significantly enhancing the quality of problem answers.According to certain researchers [44], enhancing the performance of model question answering has been given priority by including supplementary prompts for reasoning steps.Most research has relied on a voting system to evaluate various reasoning paths for questions and give scores in question assessments.In their study, Wang et al. [45] suggested a self-consistency decoding approach that involves analyzing and sampling various outputs from LLMs to reach a majority decision.Wang [45] and Li [10] presented techniques that use randomness analysis in the input space to produce a wider range of voting results.Usually, these methods require manually handpicking a question and inputting it into a LLM.The model uses its language generation ability to create arguments that match the input question, producing a statement that is specific to that question.The studies mentioned above require manual design, which means much human effort is required to create the prompts and reasoning chains.This paper follows the approach guided by Auto-CoT to address the shortcomings of the Retrieval-Q-CoT system, as described in the source cited [46].The Retrieval-Q-CoT system works using the principles and answers provided by Zero-Shot-CoT, which can sometimes lead to incorrect reasoning chains and worsen retrieval errors.When errors occur together, they can cause distinct answers to be given for similar questions.The defect can weaken the reliability of Auto-CoT.One way to address this problem is by reducing the impact of a particular group on the contextual learning model.We can sort the questions into categories and choose a few representative questions from each category to create a sample thinking chain.Minimizing the impact of incorrect reasoning chains can improve the overall effectiveness of Auto-CoT.
The reasoning methods mentioned above have similar limitations.To form Chinese reasoning chains, it is necessary to refine the intermediate reasoning steps.As a result, Chinese language models need to enhance their abilities in both expression and reasoning, which can lead to lower performance in Chinese reasoning tasks.This study suggests incorporating CoT data into the model and utilizing joint training to improve Chinese reasoning skills.

Overview
This study examines the fundamental aspects of the problem at hand.Our model endeavors to augment Chinese comprehension and reasoning capabilities, consequently necessitating the provision of a Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00251

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
mathematical definition of our input.We designate Q test as the variable representing the input question, which undergoes encoding and decoding procedures to generate a testing matrix.It should be noted that multiple input questions may be considered.By subjecting our "LLaMA-LoRA" model to extensive training, our primary objective is to derive the answer matrix Q result corresponding to the input questions.This matrix encompasses the answer inference matrix r result and the final result matrix a result .

Multi-Step Training Definitions
This section aims to provide a precise mathematical description of our model training process, which consists of two separate train stages.In the first step, relevant contextualized information in Chinese is used as training data.The encoded and decoded result is represented by a sentence matrix denoted by  To address reasoning errors in constructing thinking chain examples, we employ the method of diversity clustering analysis [46].This approach is crucial for improving the performance and accuracy of LLMs in logical reasoning tasks.It provides adequate guidance and training examples to generate coherent and precise thinking chains, significantly improving LLM performance in reasoning tasks.The process begins with applying the K-means clustering algorithm to partition questions into distinct clusters based on their types.Each cluster contains multiple question examples that are encoded and decoded using the BERT model, yielding fixed-size vector representations.These questions are then sorted in descending order within each cluster based on their distance from the cluster center, resulting in an ordered list.To create a reasoning chain demonstration, we select representative questions from various clusters for answer demonstrations.These selected questions are organized into a sample matrix table denoted as   During the question testing phase, the questions to be evaluated are inputted into the trained model to search for a matching model sample table S, enabling the establishment of associations between questions and specific question type clusters.Answers can be inferred by leveraging the thinking chains within sample table S, thereby obtaining the underlying generated reasoning process and the model's provided answers.
The evaluation process begins by inputting the test question into the trained model and subsequently matching it with the corresponding sample table S based on its question type.The thinking chains in sample table S provide valuable insights for generating the answer.By leveraging these thinking chains, we are able to deduce the answer and closely analyze the model's reasoning process and final response.For a comprehensive understanding of the testing procedures and steps involved, please refer to Section 4.4, which presents a detailed depiction and flowchart of the question-testing process.The overall framework of our model is visually presented in Figure 1.Through rigorous question testing, we can thoroughly evaluate and validate the performance and accuracy of our model in generating reasoning and answers using thinking chains.This research addresses the challenges faced by LLMs in logical reasoning tasks, enhancing their reliability and resilience.

Training
This section presents a detailed account of the model training process, which encompasses two distinct stages.The primary stage focuses on augmenting the model's proficiency in Chinese comprehension, whereas the subsequent stage endeavors to refine the model's aptitude in logical reasoning.
In the initial training phase, the model undergoes training using diverse datasets, including Chinese dialogue data, Alapca-data [47] translated into Chinese, and other relevant datasets.This process generates a sentence matrix   where, W is the weight matrix.W plm and W LoRA represents the weight matrix of the base model and LoRA during training process, respectively.D LoRAzeros and C LoRAgaussian are weight matrices utilized for low-rank processing in LoRA training, defined as a zero matrix and a Gaussian distribution, respectively.Incorporating LoRA into the mapping matrix W of Query and Value further enhances its effectiveness.The weight calculation for the mapping matrices of Query and Value in the attention mechanism is expressed as Eq. 2 and Eq. 3.
The active form of the LoRA model for the Query (Q), Key (K), and Value (V) matrices after multiple selfattention is calculated as in Eq. 4, Eq. 5, and Eq. 6.First, the training data Y is transformed into a matrix after it is mapped in the following way: When utilizing Softmax, the computational inference for matrices Q and K using the LoRA layer is expressed in Eq. 7.
( ) The final attention calculation can be shown in Eq. 8 In the second phase of training, CoT data is incorporated into the newly trained model after the initial training, aiming to further enhance the model's reasoning capability.Throughout the training of the reasoning process, the Auto-CoT clusters and analyzes diverse categories of queries, culminating in the formulation of appropriate reasoning steps tailored to each question type.This compilation establishes a fundamental array of instances that serve as guidance for the model's reasoning procedure across a spectrum of questions.This phase of training follows a similar process as the first stage, involving the utilization of sentence encoding and decoding layers to form a sentence matrix denoted as

= [ , , , , ]
CoT k M m m m is generated and stored within the currently trained model.The Q skill feature skill matrix is then created by concatenating these two skill matrices, as illustrated in Eq. 9.
Through this approach, the LoRA model effectively completes the final phase of fine-tuning.

CoT Sample Construction and Formation
This section presents a comprehensive exposition of the construction procedure employed for generating thinking chain examples.The concept of an Automatic Chain-of-Thought (Auto-CoT) prompts is introduced, wherein during the reasoning phase of the model, prompts are generated based on the initially produced examples.As the model proceeds through the stages of reasoning for a given question, it generates answers corresponding to each stage based on the example prompts from different stages of the automatic thought chain.This approach ensures that the model's reasoning process aligns more closely with the desired human-like outcomes.The procedure involves the utilization of K-means clustering to organize the questions, resulting in the formation of n distinct clusters.Each cluster j is then meticulously arranged in descending order based on the distance between the questions and the centroid of their corresponding cluster.Consequently, a matrix table denoted as  

Test Answer Generation
This section provides a comprehensive description of the testing process for answer generation.The model initially takes the test questions as input and performs encoding and decoding operations to create the question matrix Q test .In the result construction phase, the model incorporates the automatically generated example matrix  

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
This paper presents an assessment introducing a methodology for evaluating the performance of multi-turn dialogue systems.Current evaluation criteria for multi-turn dialogue primarily rely on subjective approaches such as perplexity, F 1 -score, and engagement rate, which require extensive manual intervention.In contrast, our fully automated method provides a more objective evaluation.Our evaluation methodology employs entity extraction and comprises the following steps: • Step1: Task assignment and generation of the initial statement, i.e., 1 < > sentence , for the dialogue.• Step4: Utilizing the Regularized Canonical Correlation Analysis (RCCA) [48] method to project the generated sentences.Pairwise correlation ρ i is computed between two sentences.In Formula21 PPL i and BLEU i is the average of two sentence (e.g., ).
In this study, the generated sentences undergo encoding and decoding processes, forming the sentence feature matrices X and Z.By employing RCCA for correlation analysis, the principal objective of the RCCA evaluation approach is the identification of a set of linear projection functions applicable to two distinct sets of variables.This assessment entails the projection of matrix vectors characterized by dissimilar dimensions into a common dimensional space.This process facilitates the computation of inter-vector Algorithm 2: Test Answer Generation.distance relationships.Subsequently, the magnitudes of correlation are established based upon these distance associations.This methodology is designed to optimize post-projection correlation among the variables.In the context of this manuscript, the implementation of the RCCA methodology is primarily focused on appraising the extent of correlation present between sentences within multi-turn dialogues.This evaluative endeavor transpires subsequent to the extraction of entity content from the sentences.This extraction allows for the subsequent analysis of the correlation magnitudes between antecedent and subsequent sentences.This correlation analysis, consequently, yields valuable insights into the consistency of the model's outputs during multi-turn dialogues.The adoption of the RCCA evaluation framework confers a notable advantage over alternative methods of correlation assessment, particularly in cases where the dimensionality of features surpasses the count of observed samples.This particular attribute renders the approach particularly suited for scenarios characterized by a scarcity of labeled data, thereby resulting in efficient resource utilization.we construct two linear projection matrices,

Input: Sample
= [ , ,..., ,..., ] Q q q q q R , to project the feature matrices X/Z into the latent vector space W. The goal is to maximize the correlation between H T X and Q T Z.The correlation between typical variables is calculated as follows: where, T i h and T j q denote the transpose operations of the vectors, while C XY , C XZ , and C ZZ represent the covariance matrices of the feature matrices X/Z.Since X/Z have mean values of 0 and standard deviations of 1, these covariance matrices can be computed using the following formulas: The optimization objective of Eq 12 is equivalent to considering the independence of any set of typical variables h i and q j from the coefficients.
The convergence of the above optimization objectives can be ensured, especially when the number of observed samples is small, by incorporating the constraint terms > 0 X r and > 0 Z r into the covariance matrix .As the formula Eq. 13, where E represents the unit matrix.

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
Data Intelligence 390 The projection matrices H and Q can be obtained through generalized eigenvalue decomposition in Eq. 14.
Consequently, the sentences' feature representations in the same latent vector space W are denoted as = X T

W H X and =
Z T W Q Z , respectively.The distances between the two vectors were computed and analyzed using Eq. 15.
( ) The degree of sentence similarity is determined by measuring the distance in the shared latent vector space.A smaller distance indicates a higher level of semantic similarity between the sentences.

EXPERIMENTS
To enhance the model's understanding of Chinese, we conducted thorough pre-training, using 8 A100 GPUs for a duration of one month and four days.The pre-training dataset is as follows: • Belle-dataset [48] is an Chinese dataset, comprising 1.5 million samples, was created with reference to the Stanford Alpaca dataset.It includes 175 sub-characters.Low-quality data was removed during processing, such as data claiming to be from GPT models, data where the model couldn't respond due to incomplete input, and data with Chinese instructions but English input or target.• Moss-003-sft-data [50] is an open-source collection of Chinese-English multi-turn dialogue data from Fudan University's MOSS team, containing over one million samples.The multi-turn dialogue data is constructed based on approximately 100,000 user input data collected during the MOSS-002 internal testing phase.It more closely aligns with real user intent distribution and includes finer-grained usefulness category labels, a wider range of harmless data, and longer conversation lengths, totaling around 1.1 million dialogue samples.• InstructMT Data [51] collects comprises instruction data and scripts for machine translation.The generated files are primarily suitable for ParroT's format, with some compatibility with LLaMA's format.In this paper, we mainly focus on documents translated between Chinese and English.
This study extends prior research [52] by investigating zero-shot learning and conducting a comparative analysis of 9 advanced Chinese models.The findings are presented for benchmark tasks in 6 distinct domains, spanning fundamental natural language comprehension and generation, as well as applications involving natural language reasoning.These applications include knowledge question-answering, openended questioning, multi-turn dialogue understanding, simple numerical computation, and reasoning tasks.Multiple datasets were employed, with training conducted on the A100 GPU and testing on GPUs with 32GB of memory.This meticulous experimental setup ensures the effectiveness and reliability required for handling large datasets and complex models.The comprehensive experimental design enables the evaluation and comparison of model performance across diverse task domains.

Metrics
We employ a comprehensive set of carefully chosen evaluation metrics to thoroughly assess the performance of our model.These metrics have been specifically selected to provide in-depth insights into various aspects of its effectiveness.The evaluation metrics utilized in our study encompass a wide range of indicators, including the following: • Accuracy (acc) measures the proportion of correct predictions made by the model on a given input.= ( / ) acc Correctly predictedsample All (16) • Precision represents the ratio of correctly predicted positive cases to all predicted positive cases.
• Perplexity (PPL) evaluates the linguistic modeling ability of the model.It quantifies the difference between the predicted and actual outcomes, with lower values indicating better performance.
• F1-score assesses the accuracy of the model for Dialogue State Tracking (DST).It calculates the match between the model's predicted and actual conversation states, using the F1-score formula that includes the recall rate.
• BLEU is a widely used machine translation evaluation metric.It measures the similarity between the machine-generated translation and the reference translation.The BLEU formula involves multiple steps, and it incorporates a penalty term called BP (Brevity Penalty) to account for shorter translations.BP is computed as the ratio of the total length of the candidate translation to the total length of the reference translation, using natural logarithm.
• Human is an established metric through a manual scoring mechanism to comprehensively evaluate the model's linguistic fluency, answer relevance, and other metrics when handling open-ended questions and answers.Human evaluators are assigned the task of assessing verbal fluency, relevance of answers to the questions, and various other metrics as part of a comprehensive evaluation.

Baseline
We faithfully replicated the configurations and findings of previous studies for comprehensive comparisons.However, it is important to note that specific baseline models may be applicable only in certain experimental contexts.Our selection of comparative models prioritized those with parameters similar to our model (13B).We also collected publicly available, extensively pre-trained models known for their exceptional proficiency in the Chinese language.The chosen comparative model for our study is as follows: • GPT-3.5 [6] is a LLM of OpenAI, utilizing the GPT-3.5 architecture.It is the latest iteration in the GPT series, exhibiting remarkable proficiency in NLP.The model effectively understands and generates natural language text by leveraging a large corpus of internet text data during training.It has broad applicability across various language processing tasks, including question answering, task fulfillment, and dialogue generation.
• ChatGLM [53] is an open-source bilingual conversational language model based on the General Language Model (GLM) architecture with 6.2 billion parameters.chatGLM-6B employs similar technology to GPT-3.5 but is optimized for Chinese question answering and conversation.It undergoes bilingual training with approximately 1 trillion identifiers, along with supervised finetuning, feedback self-help, and reinforcement learning with human feedback.
• OpenAssistants an open-source project developed by LAION Institute to train a scaled-down alternative version of GPT-3.5.Similar to Stable Diffusion's relation to DALL•E, OpenAssistant enables easy adoption and widespread dissemination, empowering ordinary individuals to effectively utilize the technology.• Chinese-LLaMA-Alpaca [54] is an extension of the original LLaMA model that incorporates an expanded Chinese vocabulary and undergoes a second round of pre-training using Chinese data.This enhancement further enhances the model's Chinese semantic understanding capabilities and improves the efficiency of Chinese encoding and decoding processes.We will compare its three generalized models.• StableLM-Tuned-Alpha-13B [55] is an RLHF fine-tuning of Vicuna-13B, which is a fine-tuned version of LLaMA-13B.This model is developed by the CarperAI team at StabilityAI.• Chinese-Alpaca-LoRA [56] is a language model for the Chinese language developed by the opensource community.
• ERNIEBot [57] is a chatbot developed by Baidu Inc. capable of interacting with users, answering questions, and collaborating on creative tasks.The media has compared this product to a Chinese counterpart of the internationally renowned chatbot GPT-3.5 and its competitor.
The availability of Chinese-specific LLMs is currently limited, necessitating our reliance on high-quality models from the open-source community.Unless stated otherwise, the following parameters remained constant for experimental analysis: (Temperature: 0.1, Top P: 0.75, Top K: 40, Beams: 4, Max tokens: 128).
The Temperature parameter controls text diversity, with higher values yielding more varied and random outputs, and lower values generating consistent and deterministic text.The Top P parameter selects words surpassing a predefined probability threshold, ensuring concise and sensible text while preserving diversity.The Top K parameter identifies the K words with the highest predicted probabilities, providing choices for the next word and enhancing text quality when used in conjunction with Top P. The Beams parameter maintains top partial sequences, extending them at each step and selecting the sequence with the highest score as the final output.The Max tokens parameter limits the length of generated text.Keeping these parameters fixed ensures consistent evaluation and comparison, validating model performance and stability.Their selection aims to balance diversity, coherence, and length control for reliable experimental results.
Based on the mentioned tasks, we devised our testing approach scores (Eq.21) and evaluated performance with metrics including PPL, BLEU, and ρ.The results, depicted in Figure 2, show the cumulative score of multiple sentences as "Score".Through extensive experimentation, we determined the values of λ λ , and λ 3 to be 0.2, 0.3, and 0.5, respectively.
Overall, our evaluation results for the six main tasks are presented in Figure 3 and Figure 4.These figures provide a comprehensive depiction of the performance of our approach across these tasks, shedding light on its effectiveness and suitability.Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00251

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
Data Intelligence 394

Knowledge Quiz
Building upon Chinese-LLaMA-Alpaca [54], we expanded the knowledge question-answering (QA) test set and evaluated our model's performance.To ensure language inference consistency, prompt statement coherence was maintained across different models.The evaluation was conducted in a zero-shot setting, comparing our model with GPT-3.5, Chinese-LLaMA-Alpaca, and other models.
We employed a closed-book approach for this task, limiting the model's access to relevant evidence for QA.Although our model exhibited relatively lower QA accuracy, it achieved the lowest PPL in language generation (Table 1).Due to its reduced parameter count, our model had a relatively weaker information  retrieval capability than GPT-3.5, specifically for knowledge QA.Nevertheless, it provided concise and accurate responses to basic knowledge questions.Our model has only 13 billion parameters, which is much smaller (about 5 to 10 times) than GPT-3.5.Despite this, it still performs well in the benchmark task.

Chinese Reading Comprehension and Reasoning Skills
We chose the LCSTS task [58] as the standardized measure to evaluate our models.This task is designed for Chinese short-text summarization and includes a comprehensive collection of Chinese text pairs and corresponding summaries.Each sample in the dataset consists of an original text and a manually generated summary.The original text provides a concise and descriptive phrase or sentence, while the summary offers a condensed version of the original content.We performed a comparative analysis with well-known models like OpenAssistant and StableLM-tuned-alpha-13B to assess our model's effectiveness.Additionally, manual assessments were conducted to evaluate the models' performance under scrutiny.

Data Intelligence
In Table 2, we present the performance of our models on the LCSTS task.GPT-3.5 achieved optimal performance with fixed parameter configurations, while our model performed slightly lower than GPT-3.5.The LCSTS task focuses on information summarization and text comprehension.Our model demonstrates a comprehensive understanding of the text, enabling the generation of summaries that capture essential information.However, our model tends to produce slightly longer headlines than the desired standard length.Notably, GPT-3.5 and ChatGLM generate the most concise summaries.By adjusting the maximum token count to 64, we observed a reduction in the summary length.
It is important to highlight that while GPT-3.5 excels in generating informative summaries, it often exhibits uniformity and template-like structure.In contrast, as depicted in Figure 5, our model can generate summaries in diverse styles based on the content.This feature is valuable for practical applications such as news summarization, as it enables the creation of relevant and informative headlines tailored to the content's characteristics and requirements.

Multi-Round Dialogue Understanding
We assessed our model on the C3 [59] benchmark .The C3 benchmark is a free-form multiple-choice Chinese machine reading comprehension dataset of extracted multi-turn dialogues.The benchmark requires models to select the correct option among the given choices based on the multi-turn dialogue.We employed two strategies for evaluation: one that required the model to select from the options and another Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00251

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
that demanded the model to provide its reasoning process without given choices.We compared our model against GPT-3.5,StableLM-tuned-alpha-13B, and other models.As reported in Table 3, our model achieved the best performance on this benchmark.It exhibited significantly higher accuracy than other models (0.81).Among the remaining 0.19, the model made 0.6 selections of suboptimal answers, indicating some level of contextual relevance.When evaluated using the second strategy, most of the thinking chains generated by our model were logically consistent.However, the ChatGLM model failed to fulfill the requirements of the prompts in some test cases.Through experiments on the C3 benchmark, we demonstrated the superiority of our model in Chinese dialogue understanding.Figure 6 shows that our model performed remarkably well in selecting the correct answers and generating reasonable thinking chains.

Open-Ended Response
We selected WebQA [60] as the evaluation framework for our experiments.WebQA is a curated dataset of question-answer pairs obtained from the Baidu Zhidao platform.In this task, individual questions were presented to the model, and its performance was assessed based on accuracy and manual evaluation.Our model was compared against GPT-3.5,Chinese-LLaMA-Alpaca, and other models.As depicted in Table 4, due to the limited extent of training on extensive data, our model exhibits a relatively weaker proficiency in answering open-ended questions.In contrast, GPT-3.5 outperformed our model in this particular task, thereby emphasizing a notable limitation.
Nonetheless, despite this limitation, our model showcases a certain level of competence in addressing questions within the WebQA task.It can respond to simpler inquiries, albeit with a discernible gap compared to other models.Nevertheless, our model remains competitive within this benchmark task.It is crucial to underscore that our research primarily revolves around logical reasoning and the generation of thinking chains, prioritizing these aspects over large-scale corpus training or information retrieval.Consequently, the weakness of our model in addressing open-ended questions does not signify a comprehensive failure but rather reflects the specific focus and experimental design of our study.

Mathematical Reasoning Ability
We evaluated the mathematical reasoning capabilities of our model using the Math23K dataset, which consists of 23,162 Chinese math problems sourced from the internet, covering topics from elementary and junior high school levels.During the evaluation, specific parameters were fixed, and a comparative analysis was conducted using a greedy approach across multiple models.In contrast to the other models in the comparison, our model did not undergo fine-tuning specifically for math problems; instead, it directly engaged in question querying and reasoning.The performance of our model was compared with GPT-3.5, ChatGLM, and OpenAssistant models, as illustrated in Table 5.The results unequivocally demonstrate the exceptional superiority of our model in the Math23K task, particularly in showcasing remarkable reasoning abilities for Chinese language tasks.During the testing phase, our model faced challenges in providing precise answers to pure mathematical equations, such as + 1 23 / 6 = ? .This difficulty arises from a limitation in the training dataset, which does not sufficiently cover equations involving basic arithmetic operators such as +, -, *, and %.As a result, our model's performance in this particular aspect could be enhanced.Nevertheless, we have made progress in addressing this concern through fine-tuning endeavors.
The thinking process employed by our model serves as a crucial criterion manual evaluation.As depicted in Figure 7, our model consistently demonstrates its ability to engage in rational and rigorous reasoning, surpassing reliance on prior knowledge from the dataset and exhibiting comprehensive logical reasoning skills.Furthermore, our model exhibits a relatively stable performance in answer generation even when the question description is perturbed to some extent, such as altering the question order or punctuation.This stability is achieved through the incorporation of Auto-CoT functionality, which enables the model to maintain consistency regardless of prompt variations and independent of a specific answer format.In contrast, applying similar modifications to other models leads to a decrease in accuracy.Consequently, our model outperforms the compared models significantly in this aspect.Our model demonstrates exceptional performance in Chinese mathematical reasoning tasks, effectively employing sound reasoning processes and maintaining stability.While the model's accuracy may not yet match human-level performance, it represents a significant breakthrough for LLMs.

Chinese Idiom Comprehension
We conducted experimental evaluations using the ChiD task [61], sourced from "The Complete Collection of Chinese Idioms."This task specifically focuses on four-character idioms, and the experimental passages were extracted from various literary works, articles, and news sources.For the purpose of gathering more comprehensive information, paragraphs shorter than 100 characters were merged with the subsequent paragraph, while paragraphs exceeding 600 characters were excluded.Moreover, paragraphs consisting solely of high-frequency idioms were also omitted.To assess the performance of our model, we compared the experimental results with those of the Chinese-Alpaca-LoRA, GPT-3.5, ChatGLM, and StableLM-Tuned-Alpha-13B models.
The experimental results presented in Table 6 reveal that our model achieved an accuracy of 0.62 on the ChiD task, surpassing the performance of other models.Among the incorrect answers, 0.13 were manually determined to be suboptimal, highlighting the exceptional performance of our model in this task.In contrast, the ChatGLM model, which we compared against, exhibited a lower accuracy of approximately 0.34.Our model's superior understanding of idioms enables a deeper comprehension of their underlying meanings.

Ablation Experiments
To accurately evaluate the impact of the LoRA and Auto-CoT modules in the LLaMA-LoRA neural prompting engineering framework on model reasoning and Chinese comprehension abilities, we conducted a series of ablation experiments and performed tests on the Math23K, C3, and ChiD tasks.Through manual evaluation and accuracy assessment, we conducted a comprehensive evaluation of the model to reveal the necessity and effectiveness of these modules.Specifically, we conducted ablation experiments for the LoRA and Auto-CoT modules, and the experimental results are provided in Table 7.The findings indicate that the inclusion of the LoRA module significantly enhances the model's Chinese comprehension and, to some extent, its ability to reason with mathematical symbols, particularly excelling in multi-turn dialogue comprehension.However, the LoRA module exhibits a relatively poor understanding of idioms and requires further improvement.On the other hand, the inclusion of the Auto-CoT module leads to improved performance across the Math23K, C3, and ChiD tasks, with significant enhancements observed in the Math23K task.Auto-CoT module primarily enhances the model's reasoning capabilities and improves its comprehension of Chinese to a certain degree, particularly in understanding Chinese conversations.The examples presented in Figure 8 illustrate the improvement in logical thought process achieved through the Auto-CoT module.

CONCLUSION
This paper presents a novel LoRA fine-tuning framework based on the LLaMA-13B model, incorporating the Auto-CoT mechanism.Our framework effectively addresses limitations in Chinese comprehension, expression, and logical reasoning of language models through adaptive enhancements.The training process involves two rounds: parallel training followed by the introduction of the thinking chain for adaptive training.This methodology improves the model's Chinese comprehension and logical reasoning abilities while maintaining a relatively low parameter count.
This section provides a comprehensive explanation of the model's construction, training, and workflow.The model consists of three primary components: the model training segment, the generation of thinking chain examples, and the question testing phase.The overall architecture of our model is illustrated in Figure 1.During the training phase, the model receives vectorized Chinese sentences as input.These sentences are simultaneously processed by both the LLaMA base model and the LoRA model for parallel training.While the LoRA model undergoes fine-tuning, the parameters of the LLaMA model remain fixed.The resulting sentences are then stored within the trained model.For a more detailed description of the workflow, please refer to Section 4.2.

Figure 1 .
Figure 1.Overview of our model: A comprehensive view of the model, comprising three components-model training, thinking chain sample formation, and test answer generation.
sample matrix ∈ i s S undergoes standardization and includes both the question and answer matrices.The answer matrix is formed by concatenating the reasoning matrix r i and the final answer matrix a i .A detailed process for generating thinking chain examples is established, providing further specifics in Section 4.3.
leveraging the base language model and the LoRA model, the final reasoning skill matrix   1 2

Figure 3 .
Figure 3.Comparison of the performance of the six primary models across multiple tasks.

Figure 5 .
Figure 5.An example from the LCSTS dataset.Our model is able to generate more diverse and stylized summaries, avoiding the problem of homogenization.

Figure 6 .
Figure 6.Example of the model's thought chain.With a Maxtoken value of 512.

• LLaMA-LoRA Neural Prompt Engineering Framework:
This study introduces the LLaMA-LoRA framework, an extension of the LLaMA-13B model that incorporates the LoRA technique.By optimizing the model's parameter efficiency while maintaining task performance, this framework enhances the model's reasoning capabilities and reduces resource requirements.These prompts are dynamically sampled to encourage the generation of reasoning chains and improve answer inference, effectively enhancing the model's reasoning performance and mitigating errors in answer generation.• • Automatic Chain of Thought prompting: To overcome the limitations of traditional cognitive chain prompts, this paper proposes the use of Automatic Chain of Thought (Auto-CoT) in the refined model.

Enhanced Comprehension and Reasoning with Chinese CoT Data:
Leveraging a comprehensive corpus of Chinese CoT data, this research enriches the model's comprehension and reasoning abilities.By training the model on this specific dataset, it gains a deeper understanding of Chinese language tasks, leading to improved performance and addressing the challenges associated with Chinese text comprehension and reasoning.•SuperiorPerformanceand Benchmark Surpassing: Through extensive comparative experimentation, this study demonstrates the outstanding performance of the proposed LLaMA-LoRA model across various Chinese language tasks.It outperforms state-of-the-art models such as GPT-3.5,Chat-GLM,andOpenAssistant, offering more accurate, comprehensive, and professional answers.This achievement establishes the LLaMA-LoRA framework as a new benchmark in the field of Chinese natural language processing, providing valuable insights and resources for further research.•OpenModel Data:The open-source model data of LLaMA-LoRA is readily available, making it easier for researchers to conduct studies in this area and inspiring future endeavors.

LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains
, , , ]To accommodate this, the sentence matrix obtained through encoding and decoding is redefined as k chinese m M represents a matrix of understanding skills specific to a particular question and its corresponding answer.In the second training step, CoT training data is incorporated into the pre-trained model.

LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains original
Simultaneously, training is performed using a composite model combining the LLaMA base language model and the LoRA model.Throughout this training phase, the parameters of the base language model remain fixed, while matrix low-rank decomposition is applied to the input of the LoRA model.As a result, the final comprehension skill matrix   model and the LoRA model.Subsequently, these comprehension skill matrices are integrated into the newly trained model, initiating the fine-tuning process of the LoRA model.The mathematical representation of the fine-tuning procedure for the LoRA model is given by Eq. 1.
By searching for examples resembling the test question, the model gradually generates the final result matrix Q result , excluding the question itself.This matrix is obtained by concatenating the reasoning process matrix r result and the final result matrix a result .The subsequent pseudocode2 provides an illustration of the evaluation outcomes during the testing process.Formation and Construction of Sample Thought Chains.
j s S that are similar to the test question.

Table 1 .
Performance of multiple models on question answering.

Table 3 .
Multiple rounds of dialogues for enhanced understanding.

Table 6 .
Model performance on idiom comprehension.

Table 7 .
Results of ablation experimental data.Different stages of the model in the logical improvement.