ChatGPT has attracted extension attention of academia and industry. This paper aims to evaluate ChatGPT in Chinese language understanding capability on 6 tasks using 11 datasets. Experiments indicate that ChatGPT achieved competitive results in sentiment analysis, summary, and reading comprehension in Chinese, while it is prone to factual errors in closed-book QA. Further, on two more difficult Chinese understanding tasks, that is, idiom fill-in-the-blank and cants understanding, we found that a simple chain-of-thought prompt can improve the accuracy of ChatGPT in complex reasoning. This paper further analyses the possible risks of using ChatGPT based on the results. Finally, we briefly describe the research and development progress of our ChatBIT.

Large Language Models (LLMs) are one of the most indispensable techniques in natural language processing research in recent years, and a lot of research work has been generated in academia and industry around LLMs: GPT-3 [1], LaMDA [2], Ernie 3.0 titan [3], PANGU-α [4], WeLM [5] and GLM-130B. According to Kiela et al. [7], LLMs have reached or even exceeded the human average in multiple tasks such as language understanding, dialogue generation, etc. Most studies assess the capabilities of language models by their performance on a range of specific datasets. These studies are still in the initial stage of whether the language model can correctly respond to multiple instructions, meet reasonable requirements, and whether the generated content is aligned with users.

ChatGPT has demonstrated impressive language ability in the process of interacting with humans by tracking the progress of dialogue and answering questions, admitting mistakes, and correcting and rejecting unreasonable requests. There have been some studies on the impact of ChatGPT on academia and education industries, and it is generally believed that ChatGPT can improve work efficiency in a variety of fields, but it needs to be used wisely [8]. On the impact of LLMs on scientific research, van Dis et al believe that AI has some benefits, but that academics need to establish more explicit accountability for using AI [9]. In the scientific research process, researchers need to be more open and transparent about the usage of AI, and Thorp [10] takes a negative view on questions such as whether LLMs can be authors of papers or whether artificial intelligence can be copyrighted.

The previous research work on performance testing of ChatGPT and other GPT-3.5 series models has focused on zero-shot and few-shot using public datasets, and comparing the evaluation results with supervised fine-tuning models, and previous studies have shown that ChatGPT performs extremely well on many English datasets, and ChatGPT is even comparable to supervised fine-tuning models on some evaluation metrics [11]. In the fields of machine translation, sentiment analysis, and fake information detection, ChatGPT surpassed SOTA performance at zero-shot but performed poorly on open-domain knowledge question answering and task-based dialogue tasks [12], machine translation performance in low-resource languages is also weaker than existing translation products [14]. However, there are currently few comprehensive evaluation of ChatGPT on Chinese language understanding tasks.

In this paper, we first give a brief introduction to the GPT family models and related research with large language models. Then we use several public datasets to evaluate the Chinese language performance of ChatGPT, compare it with the models developed in China, and study its specific performance in Chinese. We found that despite ChatGPT has impressive performance on classic NLP tasks, it still faces the challenge of fabricating facts due to hallucination. In addition, on two more difficult Chinese understanding tasks, we show that the performance of ChatGPT can be improved by chain-of-thought prompts.

We analyze its impact and potential risks based on the performance evaluation results. Finally, we introduce our ChatBIT, a knowledge-augmented dialogue system that uses citation to enhance the credibility of generated content and mitigate the impact of hallucination.

In summary, our main contributions include:

  • We evaluate ChatGPT with zero-shot prompting on sentiment analysis, summarization, reading comprehension, closed-book question answering, cants understanding, and idiom filling in the blank. We also improved ChatGPT's answer accuracy on the last two tasks by using simple chain of thought prompts.

  • We analyze the potential risks of ChatGPT from three aspects: hallucination, ethics, and privacy security.

  • We briefly present research progress on ChatBIT, our more trusted chat system.

Language model is one of the research topics in natural language processing, BERT [15]. GPT [1] and other models are widely influential innovative research work in LLMs. ChatGPT is a widely influential and groundbreaking product based on GPT-3.5. This section will introduce the main technological innovations of ChatGPT.

Generative pre-trained models. Generative Pre-trained Transformer (GPT) is an autoregressive pre-trained model that inputs a sequence of tokens of the current text and predicts the next token. Generative models such as RNN [18], LSTM [19] and Transformer [20] are widely used in deep learning. The basis of the GPT series of models is the decoder of Transformer. In each layer of the decoder, the multi-head self-attention can focus on the different aspects of the sentence, and with the increase of network layers, more advanced and abstract semantic information can be extracted, and the model becomes more capable of understanding textual features. In addition to solid semantic extraction capabilities, multiple language models based on Transformer can also learn from unlabeled data and have better transfer learning capabilities. GPT-1 [16] model follows the decoder structure of Transformer. In GPT-2 [17], layer normalization is placed before each submodule, and layer normalization is added after the last module.

In-context learning. When LLMs are self-supervised on the large-scale corpus, surprising pattern recognition and reasoning capabilities emerge, and in-context learning is one of the capabilities found in GPT-3 [1]. With In-context learning, there is no need for gradient backpropagation but a small number of labeled samples and prompts, and then input into the model to guide the inference prediction of GPT-3, which means that LLMs can automatically identify the type of task during decoding and make inferences based on tasks and samples. Noting that GPT-3 has weak code generation capabilities, OpenAI further designed Codex [21], Which is a fine-tuned GPT model on the GitHub public available code. The results of Codex indicate that GPT models can generate computer program code.

Instruction tuning. To make the large language model have better task recognition ability in the case of zero-shot, instruction tuning describes a variety of tasks in natural language and then inputs the task description and training samples into the model for fine-tuning [22]. With instruction tuning, language models can generalize their natural language understanding ability to the recognition and understanding of tasks so that they can correctly recognize new tasks in zero-shot inference.

Chain-of-Thought. Chain-of-Thought (CoT) refers to a series of intermediate steps including the reasoning process. In large-scale language models, CoT prompts can be introduced into the context to improve LLM's performance in arithmetic, common sense, and symbolic reasoning [37]. CoT prompts can also be automatically generated by LLMs, such as generating code to solve problems [39] based on majority voting [38], etc. A large amount of high-quality labeled data can be obtained through CoT prompts, and the performance of the language model can be further improved [40].

Reinforcement learning of human feedback. The results of DialoGPT [23], an open-domain dialogue model based on GPT-2, show that it is necessary to use reinforcement learning to prevent the model from producing abnormal responses. To obtain higher quality output, WebGPT [24] uses human feedback to optimize the quality of answers generated by the model and obtain results that are more aligned with human preferences through imitation learning and reinforcement learning. Generative language models need to control the generated content to avoid generating controversial or incorrect text. InstructGPT [25] applies reinforcement learning from human feedback (RLHF) to control the generated content of GPT. Firstly supervised fine-tuning is performed on GPT-3, and then the reward model is trained by ranking the output quality of the model by human annotators [26], and Finally, the supervised fine-tuning model is trained using the PPO [27] algorithm to maximize reward. Through this training method, the model can output text more aligned with humans, avoid sending toxic speech and reduce the likelihood of responding to unreasonable requests.

GPT-3.5 is trained for computer language code generation and instruction fine-tuning, Chat GPT is a language model obtained after supervised instruction fine-tuning and RLHF on GPT-3.5. This makes the text generated by ChatGPT more in line with human language habits.

Existing evaluations show that ChatGPT has achieved good results in the fields of machine translation [14], natural language understanding, and sentiment analysis [12]. To quantify the performance of ChatGPT in Chinese and compare it with large-scale language models developed in China, this section uses publicly available Chinese datasets to test and evaluate ChatGPT. We mainly test ChatGPT on zero-shot learning in sentiment analysis, automatic summary, reading comprehension, and closed-book question and answer, and in the idiom filling in the blank and cants understanding task, we conducted additional chain of thought reasoning tests. Note that in the evaluation results, except for ChatGPT, for other models we directly use the published results, and although the prompts used by these models are different, we assume that these publicly available results reflect the best performance of these baseline models. In addition, the system message and hyperparameters used in our experiments apply the default settings in OpenAI API.

3.1 Sentiment Analysis

Sentiment analysis is a classic NLP task, the model needs to determine the categories of sentiment in the input context, this paper uses the test set of ChnSentiCorp and EPRSTMT [28], Each of these samples is labeled as two emotional polarities: positive or negative, and the results are shown in Table 1.

Table 1.

Evaluation Results of Sentiment Analysis.

DatasetsMetricWeLMChatGPTErnie 3.0 titanGLM-130B
ChnSentiCorp Acc. 81.58 87.50 
EPRSTMT Acc. 85.90 88.75 92.50 
DatasetsMetricWeLMChatGPTErnie 3.0 titanGLM-130B
ChnSentiCorp Acc. 81.58 87.50 
EPRSTMT Acc. 85.90 88.75 92.50 

ChatGPT demonstrated competitive performance in sentiment analysis tasks. We observed a bias in ChatGPT's understanding of context, for example, we used a prompt to indicate that the input text contains only positive and negative sentiment, but “neutral” still appears in the output text of ChatGPT. In Addition, there are some special samples in ChnSentiCorp in which each sentence has a different sentiment tendency, in this case, the advantage we find in a generative model like ChatGPT over a discriminant model is that ChatGPT can automatically analyze the sentiment sentence by sentence and return the result, as shown in Table 2.

Table 2.

An Input/Output Example of Sentiment Analysis Task.

Prompt 判断以下句子的情感是积极或是消极: {Context}\n
(Judge whether the sentiment of the following sentences is positive or negative: {Context}\n) 
Context 看过于丹解读论语的视频,受益匪浅,想要买本书来收藏。明明都提交成功了,却没到货,
无比郁闷……
(I have benefited a lot from watching the video of Yu Dan interpreting the Analects of Confucius, and I want to buy a book for my collection. All were submitted successfully, but the book did not arrive, which made me feel extremely depressed…) 
Ground-truth 0 (消极, negative). 
ChatGPT 第一句话是积极的,表达了受益匪浅的感受;第二句话是消极的,表达了无比郁闷的感受。
(The first sentence is positive, expressing a feeling of benefit; The second sentence is negative and expresses a feeling of the great depression.) 
Prompt 判断以下句子的情感是积极或是消极: {Context}\n
(Judge whether the sentiment of the following sentences is positive or negative: {Context}\n) 
Context 看过于丹解读论语的视频,受益匪浅,想要买本书来收藏。明明都提交成功了,却没到货,
无比郁闷……
(I have benefited a lot from watching the video of Yu Dan interpreting the Analects of Confucius, and I want to buy a book for my collection. All were submitted successfully, but the book did not arrive, which made me feel extremely depressed…) 
Ground-truth 0 (消极, negative). 
ChatGPT 第一句话是积极的,表达了受益匪浅的感受;第二句话是消极的,表达了无比郁闷的感受。
(The first sentence is positive, expressing a feeling of benefit; The second sentence is negative and expresses a feeling of the great depression.) 

3.2 Summarization

The text automatic summary task requires the language model to perform inference on the input long text to output concise and accurate summary text. The text summarization task is divided into extractive abstracts and generative abstracts, and the experiments in this paper mainly focus on generative abstracts and use the test set of LCSTS [29] and TTNews [30]. Considering that some samples in TTNews are too long to cause ChatGPT to return results, we randomly sampled 1k samples with less than 1k Chinese characters. the evaluation index is ROUGE-1 [35].

For a summary and headline generation of news texts, ChatGPT achieves excellent performance, but we noticed that in the LCSTS dataset if the prompt is changed to “生成30字以内的摘要:” (“generate summarization under 30 words:”), the average length of the output summary reaches 70 words, which is far beyond the limit of “less than 30 words” in the prompt. So, we changed the prompt to “Generate headlines for the following news:”, and the text generated by ChatGPT is closer to the number of words of ground-truth. The results are shown in Table 3, and Table 4 includes examples of input and output samples.

Table 3.

Evaluation Results of Chinese Text Summary.

DatasetMetricsWeLMChatGPT
LCSTS ROUGE-1 23.74 31.30 
TTNews ROUGE-1 35.06 42.92 
DatasetMetricsWeLMChatGPT
LCSTS ROUGE-1 23.74 31.30 
TTNews ROUGE-1 35.06 42.92 
Table 4.

An Input/Output Example of Text Summary Task.

Prompt 为以下新闻生成标题: {Context}\n
(Generate headlines for the following news: {Context}\n) 
Context 随着微信等的崛起,电信传统业务正面临巨大冲击。截至2012年末,全国短信发送量同比仅
增2.1%,远低于往年。据知名电信咨询公司Ovum估算,由于免费通讯软件普及,2012年全
球电信运营商短信营收损失将达230亿美元,2016年或增至540亿美元。
(With the rise of WeChat and others, the traditional telecom business is facing a huge impact. As of the end of 2012, the national SMS sending volume increased by only 2.1% year-on-year, far lower than in previous years. According to Ovum, a well-known telecom consulting company, due to the popularization of free messaging software, global telecom operators’ SMS revenue loss will reach USD 23 billion in 2012, and may increase to USD 54 billion in 2016.) 
Ground-truth “微信系”快速崛起冲击传统电信业务
(The rapid rise of “WeChat” and others impacts traditional telecom business) 
ChatGPT “免费通讯软件”对电信传统业务造成巨大冲击,全球电信运营商短信营收或将损失540亿
美元。
(“Free communication software” has caused a huge impact on traditional telecommunications business, and global telecom operators may lose US$54 billion in SMS revenue.) 
Prompt 为以下新闻生成标题: {Context}\n
(Generate headlines for the following news: {Context}\n) 
Context 随着微信等的崛起,电信传统业务正面临巨大冲击。截至2012年末,全国短信发送量同比仅
增2.1%,远低于往年。据知名电信咨询公司Ovum估算,由于免费通讯软件普及,2012年全
球电信运营商短信营收损失将达230亿美元,2016年或增至540亿美元。
(With the rise of WeChat and others, the traditional telecom business is facing a huge impact. As of the end of 2012, the national SMS sending volume increased by only 2.1% year-on-year, far lower than in previous years. According to Ovum, a well-known telecom consulting company, due to the popularization of free messaging software, global telecom operators’ SMS revenue loss will reach USD 23 billion in 2012, and may increase to USD 54 billion in 2016.) 
Ground-truth “微信系”快速崛起冲击传统电信业务
(The rapid rise of “WeChat” and others impacts traditional telecom business) 
ChatGPT “免费通讯软件”对电信传统业务造成巨大冲击,全球电信运营商短信营收或将损失540亿
美元。
(“Free communication software” has caused a huge impact on traditional telecommunications business, and global telecom operators may lose US$54 billion in SMS revenue.) 

3.3 Machine Reading Comprehension

Machine Reading Comprehension (MRC) is one of the mainstream tasks for assessing the comprehension and reasoning ability of LLMs. Extractive MRC, that is, for a given context and a question, language model needs to identify a subsequence that can answer the question from the context as output and evaluates the performance of the model by comparing the matching score of the output sequence with ground-truth. Considering that ChatGPT is a generative language model, this paper converts the extraction task into a generation task and tests the performance of ChatGPT in zero-shot learning by constructing the same prompt template and hyperparameter settings as in PANGU-α [4]. The experiment was performed using the validation set of CMRC2018, which is in Chinese Simplified and DRCD [32] of Chinese Traditional. The evaluation metrics are EM (Exact Match) and F1.

Common sense MRC means that the answer to the question is not necessarily given explicitly in the context, the model needs to conduct common sense reasoning through the context and select the most suitable one as the output among the many candidate answers provided. In this paper, the validation set of C3 [33] is used for evaluation, and the metric is accuracy. Test Results are in Table 5.

Table 5.

Evaluation Results of MRC.

DatasetsMetricsWeLMPANGU-αChatGPTErnie 3.0 titanGLM-130B
CMRC2018 EM/F1 -/31.31 1.46/19.28 0.86/49.45 16.62/44.20 55.70/- 
DRCD C3 -(-/39.33) 5.61(0.66/10.55) 21.67(7.01/36.32) 29.46(21.08/37.83) 77.10(-/-) 
Avg. (EM/F1) Acc. 54.30 54.47 85.14 54.85 77.50 
DatasetsMetricsWeLMPANGU-αChatGPTErnie 3.0 titanGLM-130B
CMRC2018 EM/F1 -/31.31 1.46/19.28 0.86/49.45 16.62/44.20 55.70/- 
DRCD C3 -(-/39.33) 5.61(0.66/10.55) 21.67(7.01/36.32) 29.46(21.08/37.83) 77.10(-/-) 
Avg. (EM/F1) Acc. 54.30 54.47 85.14 54.85 77.50 

The evaluated results of reading comprehension show that the Ernie 3.0 titan with a larger parameter scale of 260B has a greater advantage than PANGU-α with a 13B parameter and WeLM with a 10B parameter, confirming that larger general models generally perform better for generative tasks. Locally, ChatGPT is comparable to Ernie 3.0 Titan in terms of reading comprehension in both simplified and traditional Chinese, and overall Ernie 3.0 Titan has more advantages. The reason why ChatGPT's EM score is low and F1 is higher is presumably due to its training method. EM requires two texts to match exactly at the character level to score. ChatGPT is trained by RLHF to generate question-and-answer text with more consideration of human users’ feelings, that is, to generate sentences that make the semantics more coherent rather than giving answers directly. Table 6 is an intuitive example. Ground-truth is a segment extracted directly from the context and may not form complete sentences grammatically and semantically. For long input context, the table shows only the beginning and omits the rest of it.

Table 6.

An Input/Output Example of MRC Task.

CMRC2018 DRCD Prompt 阅读文章:{Context}\n问: {Question}\n答:
(Read article: {Context}\n Question: {Question}\n Answer:) 
Context 短兵在中国文化当中是对尺寸较短的冷兵器,如短刀、剑等的统称。在现当代中
国,短兵除了具有其传统意义外,还特指一种新发展起来的类似击剑和剑道的运动
形式……
(In Chinese culture, the dagger is a general term for a short-sized cold weapon, such as knives and swords. In modern and contemporary China, in addition to its traditional meaning, the dagger also refers to a newly developed form of sport like fencing and kendo…) 
Question 短兵的一般长度是多少?
(What is the general length of the dagger) 
Ground-truth 80-95cm 
ChatGPT 短兵的一般长度为80-95cm。
(The general length of the dagger is 80-95cm.) 
C3 Prompt 阅读文章: {Context}\n问: {Question}\n以下选项中正确的答案是: \n{Choices} (Read the article: {Context}\n Question: {Question}\n The correct answer of the following options is:\n {Choices}) 
Context 现在电视节目越来越多了,但是有意思的节目越来越少了,所以我现在很少看电
视,除了看新闻节目。
(Now there are more and more TV programs, but there are fewer and fewer interesting programs, so I seldom watch TV now, except watching news programs.) 
Question 根据这段话,可以知道:
(According to this passage, it can be known that:) 
Choices 我经常看电视\n 好看的节目很多\n 我只看新闻节目
(I often watch TV \n There are many good programs\n I only watch news programs) 
Ground-truth 我只看新闻节目
(I only watch news programs) 
ChatGPT 根据这段话,可以知道:“我现在很少看电视,除了看新闻节目。”因此,正确答案
是“我只看新闻节目”。
(According to this passage, it can be known: “I rarely watch TV now, except for news programs.” Therefore, the correct answer is “I only watch news programs.”) 
CMRC2018 DRCD Prompt 阅读文章:{Context}\n问: {Question}\n答:
(Read article: {Context}\n Question: {Question}\n Answer:) 
Context 短兵在中国文化当中是对尺寸较短的冷兵器,如短刀、剑等的统称。在现当代中
国,短兵除了具有其传统意义外,还特指一种新发展起来的类似击剑和剑道的运动
形式……
(In Chinese culture, the dagger is a general term for a short-sized cold weapon, such as knives and swords. In modern and contemporary China, in addition to its traditional meaning, the dagger also refers to a newly developed form of sport like fencing and kendo…) 
Question 短兵的一般长度是多少?
(What is the general length of the dagger) 
Ground-truth 80-95cm 
ChatGPT 短兵的一般长度为80-95cm。
(The general length of the dagger is 80-95cm.) 
C3 Prompt 阅读文章: {Context}\n问: {Question}\n以下选项中正确的答案是: \n{Choices} (Read the article: {Context}\n Question: {Question}\n The correct answer of the following options is:\n {Choices}) 
Context 现在电视节目越来越多了,但是有意思的节目越来越少了,所以我现在很少看电
视,除了看新闻节目。
(Now there are more and more TV programs, but there are fewer and fewer interesting programs, so I seldom watch TV now, except watching news programs.) 
Question 根据这段话,可以知道:
(According to this passage, it can be known that:) 
Choices 我经常看电视\n 好看的节目很多\n 我只看新闻节目
(I often watch TV \n There are many good programs\n I only watch news programs) 
Ground-truth 我只看新闻节目
(I only watch news programs) 
ChatGPT 根据这段话,可以知道:“我现在很少看电视,除了看新闻节目。”因此,正确答案
是“我只看新闻节目”。
(According to this passage, it can be known: “I rarely watch TV now, except for news programs.” Therefore, the correct answer is “I only watch news programs.”) 

3.4 Closed-book Question and Answering

Closed-book question-answering tests do not give context for the answer in the input, and the language model needs to answer based on the knowledge and common sense learned in the pre-training data. The experiments in this paper selected the test dataset of WebQA [34] and CKBQA. The evaluated metrics are EM/F1 or accuracy, and the results are shown in Table 7. Closed-book question answering requires the model to have a certain reserve of Chinese knowledge and common sense, and ChatGPT is more prone to factual errors in this task.

Table 7.

Evaluation Results of Chinese Closed Book QA.

DatasetsMetricsWeLMPANGU-αChatGPTErnie 3.0 titan
WebQA EM/F1 -/50.90 5.13/14.47 0.10/ 0.34 37.97/52.57 
CKBQA Acc. 14.21 16.47 24.12 
DatasetsMetricsWeLMPANGU-αChatGPTErnie 3.0 titan
WebQA EM/F1 -/50.90 5.13/14.47 0.10/ 0.34 37.97/52.57 
CKBQA Acc. 14.21 16.47 24.12 

We further analyzed the output text of ChatGPT on the WebQA test set and found that only 56.96% of the correct answers labeled were contained in its output text, which means that in about half of the cases, ChatGPT will make mistakes, tables 8 is an example of a closed-book question answer. We speculate that the possible reasons may be that the training corpus of the Chinese used is small in scale, or that the Chinese training corpus is of poor quality.

Table 8.

An Input/Output Example of Chinese Closed Book Question and Answer Task.

Prompt 问:{Question}\n答:
(Question: {Question}\n Answer:) 
Question 味精的化学成分是:
(The chemical composition of monosodium glutamate is:) 
Ground-truth 谷氨酸钠
(sodium glutamate) 
ChatGPT 味精的化学成分是谷氨酸钠,化学式为NaC5H8NO4。
(The chemical composition of monosodium glutamate is sodium glutamate, and its chemical formula is NaC5H8NO4.) 
Prompt 问:{Question}\n答:
(Question: {Question}\n Answer:) 
Question 味精的化学成分是:
(The chemical composition of monosodium glutamate is:) 
Ground-truth 谷氨酸钠
(sodium glutamate) 
ChatGPT 味精的化学成分是谷氨酸钠,化学式为NaC5H8NO4。
(The chemical composition of monosodium glutamate is sodium glutamate, and its chemical formula is NaC5H8NO4.) 

3.5 Chain-of-Thought Enhances Chinese language understanding

To understand cants on the Chinese Internet, one usually requires to have a certain understanding of Chinese homonyms, font shapes, and common sense. The DogWhistle dataset [41] comes from the adapted online puzzle game Decrypto. In this game, four players are divided into two groups. Each group of players needs to try to understand the code words sent by teammates and crack the code words of the enemy. We tested ChatGPT's ability to associate and interpret code words using a validation set of its Insider subset.

The task of idiom comprehension and blank filling requires a language model to fill in the appropriate idioms into the blanks of a given sentence based on the meaning of a set of candidate idioms. This task requires the model to have a fuller understanding of the meaning of Chinese idioms. Experiments are evaluated using the validation set of the Chinese idiom reading comprehension fill-in-the-blank dataset ChID [28].

Although CoT prompts are usually used in natural language tasks such as arithmetic and reasoning [37], our experiments show that CoT prompts can be used to improve the ability of ChatGPT in cants and idioms understanding. Specifically, we first ask ChatGPT to explain the meaning of each cant or idiom through prompts, and then let it choose or fill in the blanks according to the previous explanation.

Our experimental results are shown in Table 9, and the evaluation metric is accuracy.

Table 9.

Evaluation Results Using Zero-shot or CoT on Cants and Idiom Understanding Task.

Zero-shotCoTFine tuning
DogWhistle 62.2 63.1 75.4 [41
ChID 52.0 58,9 61.2 [28
Zero-shotCoTFine tuning
DogWhistle 62.2 63.1 75.4 [41
ChID 52.0 58,9 61.2 [28
Table 10.

An Example of Cant Understanding.

User 现在我们玩一个游戏,给定以下短语候选集:“矛,奇异果,音乐家,丸子”\n请发挥想象力解释候 选集中每一个短语可能的含义。 (Now we play a game, given the following candidate set of phrases: “spear, kiwi fruit, musician, meatball”\n Please use your imagination to explain the possible meaning of each phrase in the candidate set.) 
ChatGPT 好的,我来解释一下:\n\n1. 矛:矛是一种古代兵器,也可以比喻为尖锐的批评或指责。所以这 个短语可能的含义是指某人或某事被人指责或批评了。\n\n2. 奇异果:奇异果是一种水果,有着 独特的外观和味道。所以这个短语可能的含义是指某人或某事具有独特、与众不同的特点。\n\n3.
音乐家:音乐家是指从事音乐创作、演奏等相关工作的人。所以这个短语可能的含义是指某人是一 位音乐家,或者在某个场景中与音乐家有关。\n\n4. 丸子:丸子是一种中式食品,通常是用肉、
鱼、蔬菜等搅拌成团状后加工而成。所以这个短语可能的含义是指某人正在吃丸子或者在做丸子这
道菜。(Okay, let me explain:\n\n1. Spear: Spear is an ancient weapon, which can also be compared to sharp criticism or accusation. So the possible meaning of this phrase is that someone or something has been accused or criticized. \n\n2. Kiwi: The kiwi is a fruit that has a unique look and taste. So the possible meaning of the phrase is that someone or something has unique, distinctive characteristics. \n\n3. Musician: A musician refers to a person engaged in music creation, performance, and other related work. So the possible meaning of this phrase is that someone is a musician, or is related to a musician in a certain scene. \n\n4. Meatballs: Meatballs are a kind of Chinese food, which are usually processed by mixing meat, fish, vegetables, etc. into balls. So the possible meaning of this phrase is that someone is eating meatballs or cooking meatballs.) 
User 对于候选集“矛,奇异果,音乐家,丸子”\n请你找出暗号“盾牌”可能与候选集中的哪个短语有关
联。你只能输出一个最相关的结果。 (For the candidate set “spear, kiwi fruit, musician, meatball"\n Please find out which phrase in the candidate set the password “shield” might be associated with. You can output only one most relevant result.) 
ChatGPT 经过分析,我认为暗号“盾牌”最可能与候选集中的短语“矛”有关联。因为矛和盾牌在古代战争中通
常是一起使用的,可以互相配合进行攻防。所以这个短语与“盾牌”有一定的联系。
(After analysis, I think the password “shield” is most likely related to the phrase “spear” in the candidate set. Because spears and shields are usually used together in ancient warfare, they can cooperate with each other for offense and defense. So this phrase has a certain connection with “shield”.) 
User 现在我们玩一个游戏,给定以下短语候选集:“矛,奇异果,音乐家,丸子”\n请发挥想象力解释候 选集中每一个短语可能的含义。 (Now we play a game, given the following candidate set of phrases: “spear, kiwi fruit, musician, meatball”\n Please use your imagination to explain the possible meaning of each phrase in the candidate set.) 
ChatGPT 好的,我来解释一下:\n\n1. 矛:矛是一种古代兵器,也可以比喻为尖锐的批评或指责。所以这 个短语可能的含义是指某人或某事被人指责或批评了。\n\n2. 奇异果:奇异果是一种水果,有着 独特的外观和味道。所以这个短语可能的含义是指某人或某事具有独特、与众不同的特点。\n\n3.
音乐家:音乐家是指从事音乐创作、演奏等相关工作的人。所以这个短语可能的含义是指某人是一 位音乐家,或者在某个场景中与音乐家有关。\n\n4. 丸子:丸子是一种中式食品,通常是用肉、
鱼、蔬菜等搅拌成团状后加工而成。所以这个短语可能的含义是指某人正在吃丸子或者在做丸子这
道菜。(Okay, let me explain:\n\n1. Spear: Spear is an ancient weapon, which can also be compared to sharp criticism or accusation. So the possible meaning of this phrase is that someone or something has been accused or criticized. \n\n2. Kiwi: The kiwi is a fruit that has a unique look and taste. So the possible meaning of the phrase is that someone or something has unique, distinctive characteristics. \n\n3. Musician: A musician refers to a person engaged in music creation, performance, and other related work. So the possible meaning of this phrase is that someone is a musician, or is related to a musician in a certain scene. \n\n4. Meatballs: Meatballs are a kind of Chinese food, which are usually processed by mixing meat, fish, vegetables, etc. into balls. So the possible meaning of this phrase is that someone is eating meatballs or cooking meatballs.) 
User 对于候选集“矛,奇异果,音乐家,丸子”\n请你找出暗号“盾牌”可能与候选集中的哪个短语有关
联。你只能输出一个最相关的结果。 (For the candidate set “spear, kiwi fruit, musician, meatball"\n Please find out which phrase in the candidate set the password “shield” might be associated with. You can output only one most relevant result.) 
ChatGPT 经过分析,我认为暗号“盾牌”最可能与候选集中的短语“矛”有关联。因为矛和盾牌在古代战争中通
常是一起使用的,可以互相配合进行攻防。所以这个短语与“盾牌”有一定的联系。
(After analysis, I think the password “shield” is most likely related to the phrase “spear” in the candidate set. Because spears and shields are usually used together in ancient warfare, they can cooperate with each other for offense and defense. So this phrase has a certain connection with “shield”.) 

The experimental results show that ChatGPT will get higher accuracy using CoT prompts on these two difficult tasks, but it is still not as good as the supervised fine-tuning model. The following are input examples for evaluation, the first round of dialogue for each example is used to generate the inference process of CoT, while the zero-shot group without CoT directly takes the second round of dialogue as input.

In cants understanding, we first prompt ChatGPT to explain the meaning of each cant in a candidate, and then prompt to guess the meaning of the code word based on the context and the code word itself. In the task of filling in the idioms, we first prompt ChatGPT to explain the meaning of each idiom, and then fill in the blanks, and simulate the thinking process of ordinary people through the CoT prompts of these two rounds of dialogue. In both tasks, CoT prompts allow ChatGPT to be more accurate than directly giving the answers.

For idiom filling in the blanks, explaining the meaning of idioms is more important for the model to understand the task of filling in the blanks, because the meaning of idioms is highly semantically related to the following sentences, and the range of associations required for cants understanding is vaguer, since in the first round of dialogue, ChatGPT is prompted associating the meaning of the relevant phrase first is not necessarily related to the code word.

In the experiment, we noticed that the choice of CoT prompts is very important for ChatGPT to generate high-quality replies. For example, in the idiom fill-in-the-blank task, if we replace “explain the meaning of each idiom separately” with “make a sentence for each idiom separately”, then it will cause the accuracy rate to drop to 45.0%. The following is an example of using different CoT prompts under the same sample as Table 11.

Table 11.

An Example of Idiom Explanation before fill-in-the-blank.

User 给定成语候选集“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹”\n分
别解释每个成语的含义。
(Given a candidate set of idioms, “大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大 意,势如破竹”\n Explain the meaning of each idiom separately.) 
ChatGPT 1. 大逆无道:指极端残暴、不顾法纪、胡作非为的行为。常用来形容暴君、暴政等。\n2. 正经八
百:形容说话或做事很认真,不偏不倚,不偏离主题。\n3. 熙熙攘攘:形容人来人往,热闹非凡的
场面。\n4. 穿凿附会:强行把没有关联的事物硬凑在一起,牵强附会地加以解释,不合逻辑。\n5.
身临其境:形容感受非常强烈,好像亲身置身于事物之中,真实而直观。\n6. 麻痹大意:形容因得
意忘形、轻敌自大而导致精神松懈、失去警惕。\n7. 势如破竹:形容势头非常强劲,像砍竹子一样
不可阻挡。
(1. 大逆无道:Refers to acts of extreme cruelty, disregard for law and order, and misconduct. Often used to describe tyrants, tyranny, etc. \n2. 正经八百: Describes speaking or doing things very seriously, impartial, and does not deviate from the topic. \n3. 熙熙攘攘: describe the scene of people coming and going, lively and extraordinary. \n4. 穿凿附会: Forcibly put together unrelated things and explain them, which is illogical. \n5. 身临其境: Describe the feeling is very real, as if you are in the scene yourself, real and intuitive. \n6. 麻痹大意: Describes the mental relaxation and loss of vigilance due to complacency and underestimation of the enemy. \n7. 势如破竹: the momentum is very strong, as unstoppable as chopping bamboo.) 
User 根据你的解释,从成语候选集:“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大
意,势如破竹”中选出最恰当的一个替换下文中的#idiom#。\n“在本文中,Google将研究此过程背后
的技术,并演示Cinematic照片如何将过去的2D照片转换为更具#idiom#的3D动画。深度估计像许多
最新的计算摄影功能(例如人像模式和增强现实(AR))一样,电影照片也需要深度图来提供有关场景3D
结构的信息。”
(According to your explanation, choose the most appropriate one from the idiom candidate collection:
大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹灯 to replace #idiom# below. \n “In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more #idiom# 3D animations. Depth estimation like many of the latest computational photography features such as portrait mode Like augmented reality (AR), cinematic photos require depth maps to provide information about the 3D structure of the scene.”) 
ChatGPT 在本文中,Google将研究此过程背后的技术,并演示Cinematic照片如何将过去的2D照片转换为更
具“身临其境”的3D动画。深度估计像许多最新的计算摄影功能(例如人像模式和增强现实(AR))一
样,电影照片也需要深度图来提供有关场景3D结构的信息。
(In this article, Google will examine the technology behind this process and demonstrate how Cinematic photos can transform past 2D photos into more “immersive” 3D animations. Depth estimation Like many of the latest computational photography features (such as portrait mode and augmented reality (AR)), cinematic photos require depth maps to provide information about the 3D structure of a scene.) 
User 给定成语候选集“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹”\n分
别解释每个成语的含义。
(Given a candidate set of idioms, “大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大 意,势如破竹”\n Explain the meaning of each idiom separately.) 
ChatGPT 1. 大逆无道:指极端残暴、不顾法纪、胡作非为的行为。常用来形容暴君、暴政等。\n2. 正经八
百:形容说话或做事很认真,不偏不倚,不偏离主题。\n3. 熙熙攘攘:形容人来人往,热闹非凡的
场面。\n4. 穿凿附会:强行把没有关联的事物硬凑在一起,牵强附会地加以解释,不合逻辑。\n5.
身临其境:形容感受非常强烈,好像亲身置身于事物之中,真实而直观。\n6. 麻痹大意:形容因得
意忘形、轻敌自大而导致精神松懈、失去警惕。\n7. 势如破竹:形容势头非常强劲,像砍竹子一样
不可阻挡。
(1. 大逆无道:Refers to acts of extreme cruelty, disregard for law and order, and misconduct. Often used to describe tyrants, tyranny, etc. \n2. 正经八百: Describes speaking or doing things very seriously, impartial, and does not deviate from the topic. \n3. 熙熙攘攘: describe the scene of people coming and going, lively and extraordinary. \n4. 穿凿附会: Forcibly put together unrelated things and explain them, which is illogical. \n5. 身临其境: Describe the feeling is very real, as if you are in the scene yourself, real and intuitive. \n6. 麻痹大意: Describes the mental relaxation and loss of vigilance due to complacency and underestimation of the enemy. \n7. 势如破竹: the momentum is very strong, as unstoppable as chopping bamboo.) 
User 根据你的解释,从成语候选集:“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大
意,势如破竹”中选出最恰当的一个替换下文中的#idiom#。\n“在本文中,Google将研究此过程背后
的技术,并演示Cinematic照片如何将过去的2D照片转换为更具#idiom#的3D动画。深度估计像许多
最新的计算摄影功能(例如人像模式和增强现实(AR))一样,电影照片也需要深度图来提供有关场景3D
结构的信息。”
(According to your explanation, choose the most appropriate one from the idiom candidate collection:
大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹灯 to replace #idiom# below. \n “In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more #idiom# 3D animations. Depth estimation like many of the latest computational photography features such as portrait mode Like augmented reality (AR), cinematic photos require depth maps to provide information about the 3D structure of the scene.”) 
ChatGPT 在本文中,Google将研究此过程背后的技术,并演示Cinematic照片如何将过去的2D照片转换为更
具“身临其境”的3D动画。深度估计像许多最新的计算摄影功能(例如人像模式和增强现实(AR))一
样,电影照片也需要深度图来提供有关场景3D结构的信息。
(In this article, Google will examine the technology behind this process and demonstrate how Cinematic photos can transform past 2D photos into more “immersive” 3D animations. Depth estimation Like many of the latest computational photography features (such as portrait mode and augmented reality (AR)), cinematic photos require depth maps to provide information about the 3D structure of a scene.) 

The hallucination of generative models can be observed in this example, where the model generates fluent but meaningless text [42]. During training, the model memorizes some frequently occurring segments in the training data, so that similar text may be sampled with higher probability during inference. However, the generative language models built based on neural networks are hard to avoid hallucinations. A similar example of generating hallucinations dominates the entire data. ChatGPT finally uses “深度估计“ to fill “#idiom#” in the input, but this phrase does not appear in the idiom candidate set, and this phrase is not even an idiom.

Table 12.

An Example of Making a Sentence using an Idiom before Fill-in-the-blank.

User 给定成语候选集“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹”\n
分别为每个成语造句。
(Given a candidate set of idioms “大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大 意,势如破竹” \n Make a sentence for each idiom.) 
ChatGPT 1. 这个人说话总是说大话,根本不正经八百。2. 今天商场里人真的熙熙攘攘,我都被挤得透不过气
来。3. 他的推理过程总是穿凿附会,毫无说服力。4. 看这个电影,就像身临其境一样,仿佛置身于
电影情节中。5. 他平时麻痹大意,结果考试时才发现自己没复习好。6. 这支队伍一路势如破竹,连
胜数场比赛。
1. This person is always bragging, not serious at all. 2. The shopping mall is really crowded with people today, and I can hardly breathe. 3. His reasoning is always a far-fetched way of speaking and unconvincing. 4. Watching this movie is immersive, as if you are in the plot of the movie. 5. He usually drops his guard, lacks vigilance, and it is only when he takes the exam that he finds out that he has not studied well. 6. This team has been unstoppable all the way, winning several games in a row. 
User 根据你的解释,从成语候选集:“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大
意,势如破竹”中选出最恰当的一个替换下文中的#idiom#。\n“在本文中,Google将研究此过程背
后的技术,并演示Cinematic照片如何将过去的2D照片转换为更具#idiom#的3D动画。 深度估计 像
许多最新的计算摄影功能(例如人像模式和增强现实(AR))一样,电影照片也需要深度图来提供有关
场景3D结构的信息。”
According to your explanation, choose the most appropriate one from the candidate set of idioms:
大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹灯 to replace the #idiom# in the following example: \n In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more #idiom# 3D animations. Depth Estimation Like many of the latest computational photography features, such as portrait mode and augmented reality (AR), cinema photos require a depth map to provide information about the 3D structure of the scene. 
ChatGPT 在本文中,Google将研究此过程背后的技术,并演示Cinematic照片如何将过去的2D照片转换为更
具深度估计的3D动画。深度估计像许多最新的计算摄影功能(例如人像模式和增强现实(AR))
一样,电影照片也需要深度图来提供有关场景3D结构的信息。
In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more depth-estimating 3D animations. Depth Estimation Like many of the latest computational photography features, such as portrait mode and augmented reality (AR), cinematic photos require a depth map to provide information about the 3D structure of the scene. 
User 给定成语候选集“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹”\n
分别为每个成语造句。
(Given a candidate set of idioms “大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大 意,势如破竹” \n Make a sentence for each idiom.) 
ChatGPT 1. 这个人说话总是说大话,根本不正经八百。2. 今天商场里人真的熙熙攘攘,我都被挤得透不过气
来。3. 他的推理过程总是穿凿附会,毫无说服力。4. 看这个电影,就像身临其境一样,仿佛置身于
电影情节中。5. 他平时麻痹大意,结果考试时才发现自己没复习好。6. 这支队伍一路势如破竹,连
胜数场比赛。
1. This person is always bragging, not serious at all. 2. The shopping mall is really crowded with people today, and I can hardly breathe. 3. His reasoning is always a far-fetched way of speaking and unconvincing. 4. Watching this movie is immersive, as if you are in the plot of the movie. 5. He usually drops his guard, lacks vigilance, and it is only when he takes the exam that he finds out that he has not studied well. 6. This team has been unstoppable all the way, winning several games in a row. 
User 根据你的解释,从成语候选集:“大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大
意,势如破竹”中选出最恰当的一个替换下文中的#idiom#。\n“在本文中,Google将研究此过程背
后的技术,并演示Cinematic照片如何将过去的2D照片转换为更具#idiom#的3D动画。 深度估计 像
许多最新的计算摄影功能(例如人像模式和增强现实(AR))一样,电影照片也需要深度图来提供有关
场景3D结构的信息。”
According to your explanation, choose the most appropriate one from the candidate set of idioms:
大逆无道,正经八百,熙熙攘攘,穿凿附会,身临其境,麻痹大意,势如破竹灯 to replace the #idiom# in the following example: \n In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more #idiom# 3D animations. Depth Estimation Like many of the latest computational photography features, such as portrait mode and augmented reality (AR), cinema photos require a depth map to provide information about the 3D structure of the scene. 
ChatGPT 在本文中,Google将研究此过程背后的技术,并演示Cinematic照片如何将过去的2D照片转换为更
具深度估计的3D动画。深度估计像许多最新的计算摄影功能(例如人像模式和增强现实(AR))
一样,电影照片也需要深度图来提供有关场景3D结构的信息。
In this article, Google examines the technology behind this process and demonstrates how Cinematic Photos can transform past 2D photos into more depth-estimating 3D animations. Depth Estimation Like many of the latest computational photography features, such as portrait mode and augmented reality (AR), cinematic photos require a depth map to provide information about the 3D structure of the scene. 

In the performance test, it was observed that ChatGPT frequently commits errors on Chinese knowledge and common-sense Q&A, and it uses fluent sentences to illustrate fabricated false facts, for groups with limited sources of information have a certain degree of confusion. This section aims to analyze the potential usage risks of ChatGPT based on the previous performance tests.

Risk of Hallucination. The issue of hallucination in generative language models is currently a critical concern, as it undermines the credibility of the content produced by such models. Unfaithful or nonsensical hallucination text may mislead users, and lead to community concerns about the usage of language models. ChatGPT, as a generative language model, may not be able to correctly answer academic questions and some common-sense questions, which means that the general language model cannot replace the dedicated model in the professional fields for the time being. In addition, because the content generated by LLMs such as ChatGPT may contain factual errors, thus, improper use in the scientific process may distort scientific facts and spread misinformation.

Ethical Risks. The development and application of LLMs will be inevitably accompanied by some ethical issues. On the one hand, the content generated by the LLMs may lead to the creator inadvertently plagiarizing the work of others, resulting in the risk of infringement of intellectual property rights, on the other hand, the reasoning ability and creative ability of LLMs may be used to help illegal acts such as cheating in the exams. It is necessary to restrict the use of LLMs more strictly in education, this is due to the information contained in LLMs is difficult to be quantitatively analyzed, and the impact of the generated content on students’ values is uncontrollable. Additionally, because students can easily get answers from LLMs, they may be inclined to use LLMs to complete homework and cheat on exams, which will affect students’ ability to think independently.

Risks of Privacy Security. Users need to upload the content of the conversation to the server to obtain the content generated by LLMs. These conversations may contain information such as the user's personality, preferences, and even identity. Users may inadvertently disclose this information to service providers while working with LLMs. Therefore, how to balance the use of LLMs and the protection of personal information may be a problem that AGI will face in the future.

To address the challenges of LLMs and ensure trusted and controllable generated content, we are developing ChatBIT with a focus on knowledge extraction, reasoning, and reliable citation to the content. The objective is to achieve believable, instant, and transparent LLMs. The model supports domain knowledge question answering, multi-round dialogue, information retrieval, and other functions. At the same time, multiple languages such as Chinese, English, and Japanese are supported, and different languages can be supported according to requirements. Currently, the system has provided web and API services and supports multiple users.

The entire development process of ChatBIT, including data preparation, training, and reasoning is secure and dependable, In addition, we use mixed precision training [43], model quantification and pruning to achieve computational acceleration and reduce the time complexity of training and reasoning. The domain data and instructions we utilize for fine-tuning are of high quality, including the relevant information of the Beijing Institute of Technology and the Chinese knowledge encyclopedia. Fully autonomous and controllable code is utilized for model construction and training, while a controllable scheme is employed for the external service. Table 13 is an example of a trusted citation of ChatBIT.

Table 13.

An QA Example ChatBIT with trusted citation

User 讲一下北京理工大学的历史。
(Tell me about the history of Beijing Institute of Technology) 
ChatBIT 北京理工大学是中国工业和信息化部、副部级建制的全国重点大学,位列国家“双一流”、“211工
程”、“985工程”重点建设高校,是全国高校联盟、全国高等军工院校课程思政联盟、中国人工智能
教育联席会、卓越联盟高校、延河高校人才培养联盟成员的一员单位。\n 北京理工大学的历史可以
追溯到 1940 年,当时它是延安的自然科学院。1949 年,在新中国成立后,它被重新定位为北京市
的工业专门学校……
(Beijing Institute of Technology is a national key university established by the Ministry of Industry and Information Technology of China at the vice-minister level, and is one of the key universities of the “Double First Class”, “211 Project” and “985 Project”. It also is a member of the National Association of Universities, the National Higher Military Industry Colleges and Universities Curriculum Civic Government Alliance, the China Joint Association for Artificial Intelligence Education, the Alliance of Excellence Colleges and Universities, and the Yanhe University Talent Training Alliance.\n The history of Beijing Institute of Technology dates back to 1940, when it was the Academy of Natural Sciences in Yan'an, and in 1949, after the founding of the new China, it was repositioned as a specialized industrial school in Beijing) 
Citation https://baike.baidu.com/item/北京理工大学集成电路与电子学院/61867331
https://baike.baidu.com/item/龙腾/69139
https://baike.baidu.com/item/北京理工大学
https://baike.baidu.com/item/无人车技术工业和信息化部重点实验室 
User 讲一下北京理工大学的历史。
(Tell me about the history of Beijing Institute of Technology) 
ChatBIT 北京理工大学是中国工业和信息化部、副部级建制的全国重点大学,位列国家“双一流”、“211工
程”、“985工程”重点建设高校,是全国高校联盟、全国高等军工院校课程思政联盟、中国人工智能
教育联席会、卓越联盟高校、延河高校人才培养联盟成员的一员单位。\n 北京理工大学的历史可以
追溯到 1940 年,当时它是延安的自然科学院。1949 年,在新中国成立后,它被重新定位为北京市
的工业专门学校……
(Beijing Institute of Technology is a national key university established by the Ministry of Industry and Information Technology of China at the vice-minister level, and is one of the key universities of the “Double First Class”, “211 Project” and “985 Project”. It also is a member of the National Association of Universities, the National Higher Military Industry Colleges and Universities Curriculum Civic Government Alliance, the China Joint Association for Artificial Intelligence Education, the Alliance of Excellence Colleges and Universities, and the Yanhe University Talent Training Alliance.\n The history of Beijing Institute of Technology dates back to 1940, when it was the Academy of Natural Sciences in Yan'an, and in 1949, after the founding of the new China, it was repositioned as a specialized industrial school in Beijing) 
Citation https://baike.baidu.com/item/北京理工大学集成电路与电子学院/61867331
https://baike.baidu.com/item/龙腾/69139
https://baike.baidu.com/item/北京理工大学
https://baike.baidu.com/item/无人车技术工业和信息化部重点实验室 

This paper evaluates ChatGPT on 11 Chinese datasets. Our findings indicate that ChatGPT has notable performance on the classical natural language processing tasks of sentiment analysis, automatic summarization, and reading comprehension and the accuracy of its answer can be enhanced by CoT style prompts, but it is prone to factual errors in closed-book knowledge question answering. Based on the evaluation results, this paper further analyzes ChatGPT on the possible risks and challenges. Finally, this paper present ChatBIT, our dependable dialogue language model. We hope this evaluation experiment provides some inspiration for designing trustworthy and controllable language models.

The data sets generated during and/or analyzed during the current study are available in the CLUE, FewCLUE, CKBQA, dogwhistle, ChnSentiCorp, et al repository,

https://github.com/pengming617/bert_classification

https://github.com/CLUEbenchmark/FewCLUE

http://tcci.ccf.org.cn/conference/2017/taskdata.php

https://github.com/CLUEbenchmark/CLUE

https://github.com/pkumod/CKBQA,

https://github.com/JetRunner/dogwhistle.

Li Linhan: designed the research framework, designed the research framework and wrote the manuscript. Zhang Huaping: proposed the research problems. Li Chunjin: wrote and revised the manuscript. You Haowen: revised the manuscript. Cui Wenyao: collected and analyzed the data and revised the manuscript.

All evaluation experiments are based on the gpt-3.5-turbo model. The evaluation of cants understanding and idiom filling in the blank tasks were finished during the period from April 7th to April 9th, and the rest of the experiments were completed during the period from March 5th to March 12th.

[1]
Brown
,
T.
,
Mann
,
B.
,
Ryder
,
N.
, et al.
:
Language models are few-shot learners
.
Advances in Neural Information Processing Systems
33
,
1877
1901
(
2020
)
[2]
Thoppilan
,
R.
,
De Freitas
,
D.
,
Hall
,
J.
, et al.
:
Lamda: Language models for dialog applications
. arXiv preprint arXiv:2201.08239 (
2022
)
[3]
Wang
,
S.
,
Sun
,
Y.
,
Xiang
,
Y.
, et al.
:
Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation
. arXiv preprint arXiv:2112.12731 (
2021
)
[4]
Zeng
,
W.
,
Ren
,
X.
,
Su
,
T.
, et al.
:
PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
. arXiv preprint arXiv:2104.12369 (
2021
)
[5]
Su
,
H.
,
Zhou
,
X.
,
Yu
,
H.
, et al.
:
WeLM: A Well-Read Pre-trained Language Model for Chinese
. arXiv preprint arXiv:2209.10372 (
2022
)
[6]
Zeng
,
A.
,
Liu
,
X.
,
Du
,
Z.
, et al.
:
Glm-130b: An open bilingual pre-trained model
. arXiv preprint arXiv:2210.02414 (
2022
)
[7]
Kiela
,
D.
,
Bartolo
,
M.
,
Nie
,
Y.
, et al.
:
Dynabench: Dynabench: Rethinking benchmarking in NLP
. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp.
4110
4124
(
2021
)
[8]
Zhou
,
J.
,
Ke
,
P.
,
Qiu
,
X.
, et al.
:
ChatGPT: potential, prospects, and limitations
. Frontiers of Information Technology & Electronic Engineering,
1
6
(
2023
)
[9]
van Dis
,
E.A.M.
,
Bollen
,
J.
,
Zuidema
,
W.
, et al.
:
ChatGPT: five priorities for research
.
Nature
614
(
7947
),
224
226
(
2023
)
[10]
Thorp
,
H.H.
:
ChatGPT is fun, but not an author[J]
.
Science
379
(
6630
),
313
313
(
2023
)
[11]
Qin
,
C.
,
Zhang
,
A.
,
Zhang
,
Z.
, et al.
:
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
. arXiv preprint arXiv:2302.06476 (
2023
)
[12]
Bang
,
Y.
,
Cahyawijaya
,
S.
,
Lee
,
N.
, et al.
:
A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity
. arXiv preprint arXiv:2302.04023 (
2023
)
[13]
Chen
,
X.
,
Ye
,
J.
,
Zu
,
C.
, et al.
:
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks
. arXiv preprint arXiv:2303.00293 (
2023
)
[14]
Jiao
,
W.
,
Wang
,
W.
,
Huang
,
J.
, et al.
:
Is ChatGPT a good translator? A preliminary study
. arXiv preprint arXiv:2301.08745 (
2023
)
[15]
Devlin
,
J.
,
Chang
,
M.W.
,
Lee
,
K.
, et al.
:
BERT: Pre-training of deep bidirectional transformers for language understanding
. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp.
4171
4186
(
2019
)
[16]
Radford
,
A.
,
Narasimhan
,
K.
,
Salimans
,
T.
, et al.
:
Improving language understanding by generative pretraining
. Available at: https://openai.com/research/language-unsupervised. Accessed 28 May 2023
[17]
Radford
,
A.
,
Wu
,
J.
,
Child
,
R.
, et al.
:
Language models are unsupervised multitask learners
. Available at: https://openai.com/research/better-language-models. Accessed 28 May 2023
[18]
Elman
,
J.L.
:
Finding structure in time
.
Cognitive Science
14
(
2
),
179
211
(
1990
)
[19]
Hochreiter
,
S.
,
Schmidhuber
,
J.
:
Long short-term memory
.
Neural Computation
9
(
8
),
1735
1780
(
1997
)
[20]
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
, et al.
:
Attention is all you need
.
Advances in Neural Information Processing Systems
30
(
2017
)
[21]
Chen
,
M.
,
Tworek
,
J.
,
Jun
,
H.
, et al.
:
Evaluating large language models trained on code
. arXiv preprint arXiv:2107.03374 (
2021
)
[22]
Wei
,
J.
,
Bosma
,
M.
,
Zhao
,
V.Y.
, et al.
:
Finetuned language models are zero-shot learners
. arXiv preprint arXiv:2109.01652 (
2021
)
[23]
Zhang
,
Y.
,
Sun
,
S.
,
Galley
,
M.
, et al.
:
Dialogpt: Large-scale generative pre-training for conversational response generation
. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, pp.
270
278
(
2020
)
[24]
Nakano
,
R.
,
Hilton
,
J.
,
Balaji
,
S.
, et al.
:
Webgpt: Browser-assisted question-answering with human feedback
. arXiv preprint arXiv:2112.09332 (
2021
)
[25]
Ouyang
,
L.
,
Wu
,
J.
,
Jiang
,
X.
, et al.
:
Training language models to follow instructions with human feedback
. arXiv preprint arXiv:2203.02155 (
2022
)
[26]
Christiano
,
P.F.
,
Leike
,
J.
,
Brown
,
T.
, et al.
:
Deep reinforcement learning from human preferences
. Advances in Neural Information Processing Systems
4299
4307
(
2017
)
[27]
Schulman
,
J.
,
Wolski
,
F.
,
Dhariwal
,
P.
, et al.
:
Proximal policy optimization algorithms
. arXiv preprint arXiv:1707.06347 (
2017
)
[28]
Xu
,
L.
,
Lu
,
X.
,
Yuan
,
C.
, et al.
:
Fewclue: A Chinese few-shot learning evaluation benchmark
. arXiv preprint arXiv:2107.07498 (
2021
)
[29]
Hu
,
B.
,
Chen
,
Q.
,
Zhu
,
F.
:
STS: A large scale Chinese short text summarization dataset
. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, pp.
1967
1972
(
2015
)
[30]
Hua
,
L.
,
Wan
,
X.
,
Li
,
L.
:
Overview of the NLPCC 2017 shared task: single document summarization
. In: Natural Language Processing and Chinese Computing: 6th CCF International Conference, NLPCC 2017, Dalian, China, November 8-12, 2017, Proceedings 6. Springer International Publishing, pp.
942
947
(
2018
)
[31]
Cui
,
Y.
,
Liu
,
T.
,
Che
,
W.
, et al.
:
A span-extraction dataset for Chinese machine reading comprehension
. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp.
5883
5889
(
2019
)
[32]
Shao
,
C.C.
,
Liu
,
T.
,
Lai
,
Y.
, et al.
:
DRCD: A Chinese machine reading comprehension dataset
. arXiv preprint arXiv:1806.00920 (
2018
)
[33]
Sun
,
K.
,
Yu
,
D.
,
Yu
,
D.
, et al.
:
Investigating prior knowledge for challenging chinese machine reading comprehension
.
Transactions of the Association for Computational Linguistics
8
,
141
155
(
2020
)
[34]
Li
,
P.
,
Li
,
W.
,
He
,
Z.
, et al.
:
Dataset and neural recurrent sequence labeling model for open-domain factoid question answering
. arXiv preprint arXiv:1607.06275 (
2016
)
[35]
Lin
,
C.Y.
:
ROUGE: A Package for Automatic Evaluation of summaries
. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (
2004
)
[36]
OpenAI
(
2023
).:
GPT-4 Technical Report
. arXiv preprint arXiv:2303.08774 (
2023
)
[37]
Wei
,
J.
,
Wang
,
X.
,
Schuurmans
,
D.
, et al.
:
Chain of thought prompting elicits reasoning in large language models
. arXiv preprint arXiv:2201.11903 (
2022
)
[38]
Wang
,
X.
,
Wei
,
J.
,
Schuurmans
,
D.
, et al.
:
Self-consistency improves chain of thought reasoning in language models
. arXiv preprint arXiv:2203.11171 (
2022
)
[39]
Chen
,
W.
,
Ma
,
X.
,
Wang
,
X.
, et al.
:
Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks
. arXiv preprint arXiv:2211.12588 (
2022
)
[40]
Huang
,
J.
,
Gu
,
S.S.
,
Hou
,
L.
, et al.
:
Large language models can self-improve
. arXiv preprint arXiv:2210.11610 (
2022
)
[41]
Xu
,
C.
,
Zhou
,
W.
,
Ge
,
T.
, et al.
:
Blow the dog whistle: A Chinese dataset for cant understanding with common sense and world knowledge
. arXiv preprint arXiv:2104.02704 (
2021
)
[42]
Ji
,
Z.
,
Lee
,
N.
,
Frieske
,
R.
, et al.
:
Survey of hallucination in natural language generation
.
ACM Computing Surveys
55
(
12
),
1
38
(
2023
)
[43]
Micikevicius
,
P.
,
Narang
,
S.
,
Alben
,
J.
, et al.
:
Mixed precision training
. arXiv preprint arXiv:1710.03740 (
2017
)
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.