ABSTRACT
Text-to-SQL aims at translating textual questions into the corresponding SQL queries. Aggregate tables are widely created for high-frequent queries. Although text-to-SQL has emerged as an important task, recent studies paid little attention to the task over aggregate tables. The increased aggregate tables bring two challenges: (1) mapping of natural language questions and relational databases will suffer from more ambiguity, (2) modern models usually adopt self-attention mechanism to encode database schema and question. The mechanism is of quadratic time complexity, which will make inferring more time-consuming as input sequence length grows. In this paper, we introduce a novel approach named WAGG for text-to-SQL over aggregate tables. To effectively select among ambiguous items, we propose a relation selection mechanism for relation computing. To deal with high computation costs, we introduce a dynamical pruning strategy to discard unrelated items that are common for aggregate tables. We also construct a new large-scale dataset SpiderwAGG extended from Spider dataset for validation, where extensive experiments show the effectiveness and efficiency of our proposed method with 4% increase of accuracy and 15% decrease of inference time w.r.t a strong baseline RAT-SQL.
1. INTRODUCTION
Text-to-SQL unlocks the ability to translate textual questions into the corresponding SQL queries. It has many applications such as Argumented Business Intelligence (ABI), question answering and fact searching. Recently, a series of datasets and models [1, 2, 3, 4, 5, 6, 7, 8] have been proposed.
In database querying, aggregate tables are widely used to move time-intensive calculations into pre-computed storage by changing the granularity on specific dimensions and aggregating data up along these dimensions based on existing tables [9]. Although text-to-SQL receives much attention, recent studies paid little attention to this scenario. Figure 1 shows an aggregate table and its construction statement. The database contains three fact tables: Projects, Scientists and AssignedTo. The aggregate table agg avg hours of projects scientist stores the average hours of all the projects assigned to each scientist and their names. When one wants to know a specific scientist's average hours of all assigned projects, we can run it against the appropriate aggregate table.
The new scenario poses new challenges on the current complex text-to-SQL models in both effectiveness and efficiency ways. Firstly, a centric problem dealt with previous models [5, 10] is how to determine correct relations between questions and table schema. The researchers typically rely on textual matching to build the initial relations (shown in Figure 2, the token projects is matched with 6 table columns and table names, 2 with fact tables, 4 with aggregate tables, the thick link is correct). However, they lack a mechanism to select among multiple relations. The fact implies that those techniques exist a certain deficiency in the new scenario, i.e. determining correct relation is very ambiguous. As the aggregate table naming should manifest aggregate and original table information, the names of aggregate tables will refer to existing names, thus, a question token could match with multiple tables and columns of original tables. Figure 2 demonstrates the average relations for all tokens in the question sentence being 4.17. Further, Figure 3(a) compares the number of relations between Spider dataset and the dataset in the new scenario, relation selection becomes even harder to determine as SQL becomes harder. Secondly, it suffers from high computational costs. The self-attention mechanism has been widely used in many state-of-the-art text-to-SQL models in order to contextually encode question and DB schema. It has computational complexity of Ω(L2), where L is the length of the question and table schema sequence. As L grows, computational cost grows rapidly. As in Figure 3(b), the introduction of aggregate tables brings longer sequences.
To this end, we propose a new model WAGG, which includes two innovative components. Firstly, we propose Relation Selection, which uses a fully-connected with residual network that identifies candidate to be matched items and discards the original relations incorrectly matched due to textual similarity; secondly, we propose Dynamic Pruning, which holds a simple intuition that many tables are not relevant for prediction, so it can be removed. The closest research is ProbSparse mechanism proposed in Informer [11] and it retains any elements that are K-most informative by measuring sparsity, reducing the time complexity and storage complexity to Ω(L log(L)). Our method is different in two ways: first, we keep the pruned items in each head consistent, whereas Informer calculates the pruned items for each head; secondly, both the query and value length is reduced to c ln(L), therefore the time complexity further reduced to Ω(log(L)2).
Compared with the RAT-SQL [5] baseline (state-of-the-art on Spider Dataset), the overall efficiency and accuracy significantly increases.
To support training and evaluation of text-to-SQL over aggregate tables, we also construct a dataset called SpiderwAGG based on Spider. The dataset contains 406 aggregate tables and 9,074 NL-SQL data pairs on aggregate tables. We invent strategies to automatically generate the dataset and retain aggregate table diversity and coverage to the actual situation. Our text-to-SQL data pairs cover multiple domains. Considering that people often only perform simple queries on aggregate tables, we do not include nested SQL.
The contributions of this paper are three-fold:
We present a new scenario for text-to-SQL the first time and identify two challenges: difficult keyword mapping and high computational cost;
We propose a model WAGG, which contains two innovative components, which correspond to Dynamic Pruning and Relation Selection module. Therein, Dynamic Pruning im proves computation and storage efficiency and Relation Selection module helps to choose the correct one among multiple possible keywords;
We construct a large-scale dataset containing 406 aggregate tables and more than 9,000 data pairs to support learning and evaluation.
The rest of the paper is organized as follows. Section 2 introduces the related work including dataset and models of recent progress of text-to-SQL. Section 3 introduces our model and methods of constructing the dataset. Section 4 introduces dataset analysis used for experiment and experimental results. Section 5 brings the conclusions.
2. RELATED WORK
2.1 Existing Text-to-SQL Models
Recent progress of text-to-SQL models mainly resorts from two aspects. Firstly, the rich expressioness of natural language requires not only to understand syntactic information, but also semantic information. Recent advance in pretraining models [12, 13] produces better word embedding that represents better semantic information, helping to map implicitly. The second aspect derives from the fact that SQL depends on not only the question, but also the schema of the database. Bogin et al. [14] first noted the importance of understanding schema. Researchers also took the database content into the encoding. RAT-SQL [5] proposes to incorporate contextual information from NL question, schema and encode them all together by the self-attention mechanism. On the aforementioned basis, Lin et al. [10] represented table schema, question and their textual matching relationships into a single tagged sequence and leave the transformer [15] model to calculate their relationship. However, the models are pruned to map tokens into SQL keywords and schema content separately, yet the field has ignored the importance of this result because existing datasets hardly contain operator-related schema names. The scenario that we propose not only emphasizes the challenge of understanding schema, and considers a more complex type of schema which brings multiple challenges.
2.2 Existing Text-to-SQL Datasets
There have been new datasets for Text-to-SQL proposed in recent years. Zhong et al. [1] constructed a cross-domain text-to-SQL dataset WikiSQL, presenting a new challenge requiring model's generalization in database split setting. They present a new challenge requiring model's generalization in database split setting. Based on it, Yu et al. [2] further proposed a dataset Spider, which contains more complex analytical calculations (e.g., GROUPBY, JOIN) and involves more tables in a nested SQL query. Many other datasets inquire the problem in different aspects. Gu et al. [16] built a complex and diverse dataset to deal with the occurrence of out-of-distribution questions, promoting the generalization of question answering models. There are also cases where the model is required to tackle information in diverse data forms. Wolfson et al. [17] introduced QDMR formalism to decompose complex questions and allow to query over multiple information sources, which effectively improves open-domain question answering efficiency. Baik et al. [18] proposed a dual-specification query synthesis system to incorporate NLQ and programming-by-example (PBE), enabling expressive SQL queries. To hybrid multiple sources, Chen et al. [19] presented a large-scale dataset with heterogeneous data forms and designed a baseline architecture to cope with hybrid information. Wenhu et al. [20] proposed the techniques of early fusion and cross-block reader for open-domain question answering, facilitating the retrieval of evidence distributed across tabular and textual data, etc. However, the scenario with the presence of aggregate tables has been relatively less explored.
3. METHOD
3.1 Preliminary Introduction of RAT-SQL
To answer a question for unseen table schema, RAT-SQL [5] proposes relation-aware attention mechanism to contextually encode questions, table schema and relations between them.
The researchers first represent natural language question and its table schema with common NLP embedding methods such as GloVe and BERT, then adopt Bi-LSTM to process the embeddings. The final input thus is , in which xinit denotes a question token or a table schema item's representation. Then one encodes them together with relation aware attention layers, for each element xinit in the total input, a relation-aware self-attention layer that contains H heads transforms it into yi
where rij represents the relation between element i and element j. Specifically, each relation is calculated via textual matching between database schema or value in the database and the question token. Each type of relation is assigned with a specific exlusive value. The types include QUESTION-COLUMN-M, QUESTION-TABLE-M, COLUMN-QUESTION-M or TABLE-QUESTION-M. Here M is one of EXACTMATCH, PARTIALMATCH, or NOMATCH.
To capture relations between the final encoder output and the schema, they also compute a memoryschema linking matrix for columns and tables. They compute matrix for tables similarly. The matrices for columns and tables are as follows:
In the decoder, RAT-SQL employs the tree-structured architecture in [21]. The scholars adopted LSTM to generate a sequence of actions that eventually forms an abstract syntax tree(AST) of SQL. There are two types of actions, the first action is ApplyRule[r], which means the application of the grammar rule r on the last node; the second action is called SelectColumn[v] or SelectTable[v], it means selecting a column or a table v from the schema when completing a leaf node.
Therefore, the process of generating SQL can be represented formally as
where Y is the representation given by the encoder, mt is the LSTM cell state, ht is the LSTM output at step t, at-1 is the embedding of the previous action, pt is the step corresponding to expanding the parent AST node of the current node, and nft is the embedding of the current node type. Finally, zt is the context representation, computed using multi-head attention (with 8 heads) on ht1 over Y.
To determine actions, for ApplyRule[r], they compute P(at = ApplyRule[r]|a < t) = softmaxr(g(ht)), where g is a 2-layer MLP with a tanh. For SelectColumn[v] or SelectTable[v],
3.2 Relation Selection
Retaining all the relations brings both useful information and additional noise, it bears the main reason for wrong predictions. Noise is more common in this problem because a question token can be matched multiple times. Therefore, we design the relation selection module to help the model to determine the relationship between question tokens and tables’ schema.
Specifically, we provide a way to learn to select relations that is needed for further prediction and abandoning the irrelevant relations. The information for distinguishing comes from an overall understanding about SQL, table schema and the question. However, it is learned through attention mechanism or other encoding mechanism. As shown in Figure 5(b), the token “project” has several relations computed by textual matching. In this way, multiple tokens “projects”, “avg-hours-in-projects” and “number-of-scientist-in-different-project” are matched by adjusting the attention mechanism. We formalize a separate model that encodes both the target and the related tokens to learn to choose the correct relation.
where f is a two-layer full-connected network and tanh nonlinearity, which helps to select proper relations. As shown in Figure 5(b), the embedding of token projects is originally matched with multiple words (in dark orange), and then the relation can be reduced into zero for the incorrect mapping with number-of-scientists-in-different-projects.
3.3 Dynamic Pruning
Dynamic pruning reduces the number of items to be encoded (Figure 5(c)). Since the new aggregate tables in a database have increased the number of tables’ schema to be encoded. For a SQL, there are more irrelevant tables that are unnecessary to be encoded, thus could be removed. Informer proposes a method to measure sparsity between query and key and use it to find the most dominant ones. They compute a different dominant set for each head. In our case, it remains constant for a text token to match which table across all heads, so we only compute once for all heads. As defined in Informer [11], they first propose a sparsity measurement for each element in the sequence(Equation 4). Based on the measurement, they define ProbSparse
Self-attention by retaining top U most dominant queries qj′ to be attended on. Thus the queries are reduced to shape U * d. They choose U to be c* log(LK), where c is a constant scaling factor.
We further prune the relations by choosing from the index of top U queries to be r’. Also, since in the self-attention settings, the queries, values, keys are all the same, we pruned the values and keys the same way. Finally, we have a new pruned sequence X′ of length c * log(L), where L is the original length. Thus, the overall encoder equation is as follows, where xi′ are pruned queries
3.4 Data Construction
3.4.1 Overview
The generation of the dataset mainly includes two steps: generating aggregate tables and generating text-SQL data pairs. Our method is based on previous work, which is worth a beforehand introduction.
Wang et al. [22] invented a phenomenal method to generate semantic parsing dataset overnight. They first generate logical forms and their pairing canonical text by predefined determined rules, then leverage crowd-sourcing to transform the canonical text into natural language. Traditionally, people first prepare the natural text and then transform them into logical forms. There the two steps require massive manual labeling. To solve this problem, this method wisely chooses a different direction for generating data pairs. It hugely decreases the burden of human resources and speeds up the whole process. More information can be found in the paper.
We adopt similar procedures but vary in ways of creating logic forms and natural utterances. Specifically, we adopt the creating process proposed in GAZP [16]. The method samples logical forms with respect to grammar (such as SQL) and then generates utterances according to logical forms.
3.4.2 Generate Aggregate Tables
To generate realistic aggregate tables, we make use of SQLs in the Spider dataset as a useful source because they are realistic queries that interest people.
We adopt two algorithms to create them. The first one is based on the SQL template. We transfer all SQLs in Spider dataset to the template, then fill them with proper contents from other databases into the new SQL and save the new SQL query results as a new aggregate table, finally, we give the table and columns names by rule-based and SQL-to-text model. The second method directly uses all SQLs from Spider dataset, and then filters out some detail condition that is unlikey to take place in the aggregate tables. Detailed description about the two algorithms can be seen at Algorithm 1 and 2.
3.4.3 Generate NL-SQL Pairs
To create SQLs on new aggregate tables, we follow two methods. Firstly, we make extensions on the SQL used to create aggregate tables and then transform them into SQLs on aggregate tables. Extension includes adding conditions, join//and/intersect/union with other tables. Secondly, we transform SQL into templates, then fill them with the contents of the aggregate tables.
To create text accordingly, we use the model of SQL2NL to transform SQLs back to text. We also transform it into SQL again to ensure cycle consistency.
4. EXPERIMENT
We next conduct experimental evaluation on SpiderWAGG. To facilitate simultaneous training of Spider and SpiderWAGG, we remain in our dataset the same train and validation splits of databases as Spider.
We build our novel model based on the GloVe version of the SOTA model RAT-SQL as our baseline, which is implemented by the PyTorch framework. Experiment settings are consistent with the original paper, readers who want to know more knowledge may refer to [5]. The pruning factor used by Dynamic Relation Pruning is 5, and the index of pruning layer in the second layer in the stack of attention layers.
4.1 Dataset Analysis
Our dataset SpiderwAGG has 406 aggregate tables and 9,752 NL-SQL pairs. We summarize the NL-SQL statistics of SpiderwAGG and the Spider dataset in Table 2. As demonstrated in the table, we have comparable distributions on all indicators, which means we share similar SQL complexities with Spider dataset, but with significantly less manpower. Table 1 shows some SQL query examples given the database name.
DB ID consert singer |
Question how many singer are there for country? SQL select count (*) from count singer there for each country DB ID consert singer |
Question how many singer are there for each age? SQL select T1.age, count single from singer as T1 join count singer there as T2 DB ID employee hire evaluation |
Question how many employee live from each city? SQL select T1.City, count employee from employee as T1 join count employee we have as T2 |
DB ID consert singer |
Question how many singer are there for country? SQL select count (*) from count singer there for each country DB ID consert singer |
Question how many singer are there for each age? SQL select T1.age, count single from singer as T1 join count singer there as T2 DB ID employee hire evaluation |
Question how many employee live from each city? SQL select T1.City, count employee from employee as T1 join count employee we have as T2 |
Dataset . | #Sample . | #Q . | #SQL . | #DB . | #Domain . | #Table/DB . | GROUP BY . | AS . | ON . | JOIN . | INTERSECT . |
---|---|---|---|---|---|---|---|---|---|---|---|
Spider | 8659 | 4714 | 4713 | 146 | 138 | 5.1 | 1110 | 2029 | 2044 | 2046 | 145 |
SpiderwAGG | 8671 | 6248 | 6248 | 145 | 138 | 5.1 | 1056 | 2778 | 1010 | 2778 | 234 |
Dataset . | #Sample . | #Q . | #SQL . | #DB . | #Domain . | #Table/DB . | GROUP BY . | AS . | ON . | JOIN . | INTERSECT . |
---|---|---|---|---|---|---|---|---|---|---|---|
Spider | 8659 | 4714 | 4713 | 146 | 138 | 5.1 | 1110 | 2029 | 2044 | 2046 | 145 |
SpiderwAGG | 8671 | 6248 | 6248 | 145 | 138 | 5.1 | 1056 | 2778 | 1010 | 2778 | 234 |
Table and column names are critical for correctly linking, as shown in Figure 3, we find that new scenario brings 30% increase of average number of candidates to link and 10% increase of average encoding length.
4.2 Metrics
We adopt the new metric proposed in [23], which utilizes a test suite for evaluating accuracies to approximately compute semantic accuracy for SQLs of the same semantics. They first distill a small test suite of databases and then execute SQLs on them. Then at evaluation time, they could compare the execute results of gold SQL and predicted SQL. The metric gives a tight upper accuracy.
4.3 SpiderWAGG Results
Table 3 shows the accuracy of our model WAGG and RAT-SQL baseline in each difficulty level data pair and total data pair. WAGG exceeds the baseline above by a large margin of 5% of exact match accuracy and 3.8% of execute accuracy. As the difficulty of SQL increases, the performance of our model will also decrease. A point worthy of mention is that WAGG does not surpass RAT-SQL for all hardness, a possible explanation is that relation pruning makes more mistakes as hardness increases. A more accurate pruning strategy should be invented in the future.
. | EM Accuracy . | EM Accuracy . | Easy EM . | Medium EM . | Hard EM . | Extra hard EM . |
---|---|---|---|---|---|---|
RAT-SQL(baseline) | 42.7 | 54.9 | 64.7 | 44.3 | 60.7 | 53.3 |
WAGG | 47.7 | 58.7 | 71.6 | 46.3 | 57.4 | 46.7 |
. | EM Accuracy . | EM Accuracy . | Easy EM . | Medium EM . | Hard EM . | Extra hard EM . |
---|---|---|---|---|---|---|
RAT-SQL(baseline) | 42.7 | 54.9 | 64.7 | 44.3 | 60.7 | 53.3 |
WAGG | 47.7 | 58.7 | 71.6 | 46.3 | 57.4 | 46.7 |
Table 4 shows the ablations using different modules in our model on SpiderWAGG. Dynamic pruning makes statistically significant improvements by 2.7 points. The relation selection module further improves by 1.1 points. To evaluate performance of the new model on the original dataset, we further do an experiment on Spider. Table 5 shows accuracy on the Spider and SpiderWAGG dataset. WAGG has a minor increase on the Spider dataset, which is not surprise because of less ambiguity.
4.4 Discussions
4.4.1 Dynamic pruning
WAGG applies dynamic pruning module to reduce inference time. Figure 6 demonstrates the comparison of inference time of 200 queries in different hardness between RAT-SQL and our model. From the figure, we can find out that our model speeds faster than RATSQL on all hardness. Since the lengths of input sequences are reduced to c * Log(L), the inference time does not vary among hardnesses on our model. We also inspect how different index of inserting dynamic prune will infect the result (see Table 6), where the prune layer index = 0 means inserting dynamic pruning at the very first layer. The best index is 2. A possible explanation to this result is that values on the first layer are not yet contextualized with meaningful information, so pruning too early could cause a drop of performance.
4.4.2 Error Analysis
We randomly select 50 samples of mis-predicted SQL and found that the errors can be classified into two kinds: 64% of incorrect column or table in SQL and 36% requiring extra knowledge. The first error results from wrong relation selections or exessive dynamic pruning. The main reason is that some questions can be expressed by multiple SQLs, so there should be different group of relations, meaning there is a correlation between the relation of two words and the other words. Regarding whether this model can produce group relations, we plan to do more research in the future.
For the second error type, it could be potentially solved by defining rule-based operator mappings or add pre-training models that has the knowledge.
5. CONCLUSION
In this paper, we discover a new scenario for text-to-SQL that is ignored by current text-to-SQL community that brings new challenges. To inspect on this, we construct a large scale dataset SpiderwAGG for the scenario based on Spider dataset. Methods of constructing rely lightly on human resource, largely exploit the capability of machine automation. To solve the challenges, we propose a new model for the new scenario, which has two innovative modules: relation selection and dynamic pruning. Experiments show our model has mitigated the challenges.
We also point out drawbacks that we plan to optimize in the future. Since the dataset is built fully by automation, there are some irrational items. For example, there is SQL doing aggregation on id. We plan to include human resource to check the meaningfulness.
AUTHOR CONTRIBUTIONS
S.Q. Li ([email protected]) discovers the problem, design and implement the total framework, and does experiments. K.B. Zhou ([email protected]) and Z.Y. Zhuang ([email protected]) implement and run the code for constructing dataset, do analysis on it and do helped the overall experiment. H.F. Wang ([email protected]) gives many solid advice for the whole paper from the begining to the end, and helps to proofread the paper. J. Ma (majun [email protected]) gives advice.
ACKNOWLEDGEMENTS
Many thanks to my supervisor Mr. Wang for giving me much advice and opportunities. Also thanks to Huifang Du, Xing Huang, Siqi Wang in our laboratory for the open discussions.