Dusql
Dusql
                      Abstract
    Due to the lack of labeled data, previous re-
    search on text-to-SQL parsing mainly focuses
    on English. Representative English datasets in-
    clude ATIS, WikiSQL, Spider, etc. This paper
    presents DuSQL, a larges-scale and pragmatic
    Chinese dataset for the cross-domain text-to-
    SQL task, containing 200 databases, 813 ta-
    bles, and 23,797 question/SQL pairs. Our new
    dataset has three major characteristics. First,
    by manually analyzing questions from several
    representative applications, we try to figure out
    the true distribution of SQL queries in real-life
    needs. Second, DuSQL contains a consider-
    able proportion of SQL queries involving row
    or column calculations, motivated by our analy-
                                                                 Figure 1: Illustration of the text-to-SQL task.
    sis on the SQL query distributions. Finally, we
    adopt an effective data construction framework
    via human-computer collaboration. The basic
    idea is automatically generating SQL queries
                                                            training and evaluation, such as ATIS (Iyer et al.,
    based on the SQL grammar and constrained                2017), GeoQuery (Popescu et al., 2003), WikiSQL
    by the given database. This paper describes in          (Zhong et al., 2017), and Spider (Yu et al., 2018b).
    detail the construction process and data statis-           Formally, given a natural language (NL) ques-
    tics of DuSQL. Moreover, we present and com-            tion and a relational database, the text-to-SQL task
    pare performance of several open-source text-           aims to produce a legal and executable SQL query
    to-SQL parsers with minor modification to ac-
                                                            that leads directly to the correct answer, as depicted
    commodate Chinese, including a simple yet ef-
    fective extension to IRNet for handling calcula-        in Figure 1. A database is composed of multiple
    tion SQL queries.                                       tables and denoted as DB = {T1 , T2 , ..., Tn }. A ta-
                                                            ble is composed of multiple columns and denoted
1   Introduction                                            as Ti = {col1 , col2 , ..., colm }. Tables are usually
In the past few decades, a large amount of research         linked with each other by foreign keys.
has focused on searching answers from unstruc-                 The earliest datasets include ATIS (Iyer et al.,
tured texts given natural questions, which is also          2017) , GeoQuery (Popescu et al., 2003), Restau-
known as the question answering (QA) task (Burke            rants (Tang and Mooney, 2001), Academic (Li and
et al., 1997; Kwok et al., 2001; Allam and Hag-             Jagadish, 2014), etc. Each dataset only has a sin-
gag, 2012; Nguyen et al., 2016). However, a lot of          gle database containing a certain number of ta-
high-quality knowledge or data are actually stored          bles. All question/SQL pairs of train/dev/test sets
in databases in the real world. It is thus extremely        are generated against the same database. Many in-
useful to allow ordinary users to directly inter-           teresting approaches are proposed to handle those
act with databases via natural questions. To meet           datasets (Iyer et al., 2017; Yaghmazadeh et al.,
this need, researchers have proposed the text-to-           2017; Finegan-Dollak et al., 2018).
SQL task with released English datasets for model              However, real-world applications usually in-
                                                        6923
      Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6923–6935,
                       November 16–20, 2020. c 2020 Association for Computational Linguistics
volve more than one database, and require the
model to be able to generalize to and handle unseen
databases during evaluation. To accommodate this
need, the WikiSQL dataset is then released by
Zhong et al. (2017). It consists of 80,654 ques-
tion/SQL pairs for 24,241 single-table databases.
They propose a new data split setting to ensure that
databases in train/dev/test do not overlap. However,
they focus on very simple SQL queries containing
one SELECT statement with one WHERE clause.
In addition, Sun et al. (2020) released TableQA, a
Chinese dataset similar to the WikiSQL dataset.         Figure 2: The SQL query distributions of the three ap-
   Yu et al. (2018b) released a more challenging        plications. Please kindly note that a query may belong
Spider dataset, consisting of 10,181 question/SQL       to multiple types.
pairs against 200 multi-table databases. Compared
with WikiSQL and TableQA, Spider is much more
complex due to two reasons: 1) the need of select-        generate a pseudo question by traversing it in the
ing relevant tables; 2) many nested queries and ad-       execution order and then ask annotators to para-
vanced SQL clauses like GROUP BY and ORDER                phrase it into a NL question.
BY.                                                     • We conduct experiments on DuSQL using
   As far as we know, most existing datasets are          three open-source parsing models. In par-
constructed for English. Another issue is that they       ticular, we extend the state-of-the-art IRNet
do not refer to the question distribution in real-        (Guo et al., 2019) model to accommodate
world applications during data construction. Tak-         the characteristics of DuSQL. Results and
ing Spider as an example. Given a database, anno-         analysis show that DuSQL is a very chal-
tators are asked to write many SQL queries from           lenging dataset. We will release our data at
scratch. The only requirement is that SQL queries         https://github.com/luge-ai/luge-ai/
have to cover a list of SQL clauses and nested            tree/master/semantic-parsing.
queries. Meanwhile, the annotators write NL ques-
tions corresponding to SQL queries. In particular,      2 SQL Query Distribution
all these datasets contain very few questions involv-
ing calculations between rows or columns, which         As far as we know, existing text-to-SQL datasets
we find are very common in real applications.           mainly consider the complexity of SQL syntax
   This paper presents DuSQL, a large-scale and         when creating SQL queries. For example, Wik-
pragmatic Chinese text-to-SQL dataset, contain-         iSQL has only simple SQL queries containing SE-
ing 200 databases, 813 tables, and 23,797 ques-         LECT and WHERE clauses. Spider covers 15 SQL
tion/SQL pairs. Specifically, our contributions are     clauses including SELECT, WHERE, ORDER BY,
summarized as follows.                                  GROUP BY, etc, and allows nested queries.
                                                           However, to build a pragmatic text-to-SQL sys-
• In order to determine a more realistic distribution   tem that allows ordinary users to directly interact
  of SQL queries, we collect user questions from        with databases via NL questions, it is very impor-
  three representative database-oriented applica-       tant to know the SQL query distribution in real-
  tions and perform manual analysis. In particular,     world applications, from the aspect of user need
  we find that a considerable proportion of ques-       rather than SQL syntax. Our analysis shows that
  tions require row/column calculations, which are      Spider mainly covers three types of SQL queries,
  not included in existing datasets.                    i.e., matching, sorting, and clustering, whereas
                                                        WikiSQL only has matching queries. Neither of
• We adopt an effective data construction frame-         them contains the calculation type, which we find
  work via human-computer collaboration. The ba-        composes a large portion of questions in certain
  sic idea is automatically generating SQL queries      real-world applications.
  based on the SQL grammar and constrained by              To find out the SQL query distribution in real-
  the given database. For each SQL query, we first      life applications, we consider the following three
                                                   6924
                                                         Row Calculation
representative types of database-oriented applica-
                                                         How much bigger is Guangzhou than Shenzhen?
tions, and conduct manual analysis against user
                                                         SELECT a.area(km2 ) - b.area(km2 ) FROM
questions. We ask annotators to divide user ques-
                                                         (SELECT area(km2 ) FROM T1 WHERE name = ‘Guangzhou’) a,
tions into five categories (see Appendix B for de-
                                                         (SELECT area(km2 ) FROM T1 WHERE name = ‘Shenzhen’) b
tails), i.e., matching, sorting, clustering, calcula-
tion, and others.                                        Column Calculation
   Information retrieval applications. We use            What is the population density of Hefei?
Baidu, the Chinese search engine, as a typical in- SELECT population / area(km2 ) FROM T1 WHERE name = ‘Hefei’
                                                    6925
as illustrated in Figure 4. The key idea is to auto-                      Source                     Proportion
                                                                          Baike                           40.3
matically generate SQL queries paired with pseudo                         Vertical domain websites        31.3
questions given a database. Then pseudo ques-                             Annual report                   23.4
tions are paraphrased to NL questions by humans.                          Others                            5.0
Finally, to guarantee data quality, low-confidence                   Table 1: The distribution of database sources.
SQL queries and NL questions detected according
to their overlapping and similarity metrics, and are
further checked by humans.                                    joined tables. We choose 200 graphs to create
                                                              databases, and manually check and correct foreign
3.1 Database Creation                                         keys for each database.
Most of mature databases used in industry are not                Overall, we create 200 databases with 813 ta-
publicly available. So we collect our databases               bles, covering about 70% of Baike entries from
mainly from the Internet. However, databases                  more than 160 domains such as movies, actors,
available on the Internet are in the form of inde-            cities, animals, foods, etc. Since some tables are
pendent tables, which need to be linked with other            sensitive, we use the column header of each table,
tables. We create databases in three steps: table ac-         and populate it with randomly selected values from
quisition, table merging, and foreign key creation.           the original table.
    We collect websites to crawl tables, ensuring
that they cover multiple domains. As the largest              3.2 Automatic Generation of SQL Queries
Chinese encyclopedia, Baidu Baike contains more               Given a database, we want to generate as many
than 17 million entries across more than 200 do-              common SQL queries as possible. Both manu-
mains. We start with all the entries in Baike as the          ally writing SQL queries and quality-checking take
initial sites, and extend the collection based on the         a significant amount of time. Obviously, SQL
reference sites in each entry page. We keep sites             queries can be automatically generated from the
where tables are crawled. The final collection con-           grammar. We utilize production rules from the
tains entries of Baike, annual report websites2 , ver-        grammar to automatically generate SQL queries,
tical domain websites3 , and other websites such as           instead of asking annotators to write them. Accord-
community forums4 . Table 1 shows the data distri-            ing to the difficulty5 and semantic correctness of a
bution regarding database sources.                            SQL query, we prune the rule paths in the genera-
    To make a domain correspond to a database, we             tion. Then, we sample the generated SQL queries
merge tables with the same schema to a new ta-                according to the distribution in Figure 2 and carry
ble with a new schema, e.g., tables about China               out the follow-up work based on them.
cities with the schema of {population, area, ...}                As illustrated in Figure 5, the SQL query can
are merged to a new table with the schema                     be represented as a tree using the rule sequence
of {termid, name, population, area, ...}, where               of {SQLs = SQL, SQL = Select Where, Select
termid is randomly generated as primary key and               = SELECT A, Where = WHERE Conditions, ...},
name is the name of the city. Meanwhile, we add               all of which are production rules of the grammar.
a type for each column according to the form of               Guided by the SQL query distributions in real ap-
its value, where the column type consists of text,            plications, we design production rules to ensure
number and date.                                              that all common SQL queries can be generated,
    We create foreign keys between two tables via             e.g., the rule of {C = table.column mathop ta-
entity linking, e.g., a table named “Livable cities           ble.column} allows calculations between columns
in 2019” with the schema of {city_name, ranker,               or rows. By exercising every rule of the grammar,
...} joins to a table named “China cities” with the           we can generate SQL queries covering patterns of
schema of {term_id, name, area, ...} through the              different complexity.
links of entities in “city_name” and “name”. Ac-                 We consider two aspects in the automatic SQL
cording to foreign keys, all tables are split into            generation: the difficulty and semantic correct-
separate graphs, each of which consists of several
                                                                 5
   2
                                                                   We observe that very complex queries are rare in search
     QuestMobile, 199it, tianyancha, etc.                     logs. Since our SQL queries are automatically gener-
   3
     State Statistical Bureau, China Industrial Information   ated, without complexity control, the proportion of complex
Network, Shopping websites, Booking websites, etc.            queries would dominate the space, thus deviating from the
   4
     Baidu Tieba, Newsmth, Hupu, etc.                         real query distribution.
                                                          6926
Figure 5: An example of SQL query generation from the grammar. We show a part of production rules (all rules
are shown in Appendix A). The leaf nodes in red are from the database.
                                                             6927
for each component, e.g., list for SELECT, average
for the aggregator of avg. Appendix D shows the
descriptions for all components. To ensure that the
pseudo question is clear and reflects the meaning
of the SQL query, intermediate variables are intro-
duced to express sub-SQL queries, e.g., “v1” in
the example of Figure 6 represents the result of the
nested query and is used as a value in other expres-
sions.
   We ask two annotators8 to reformulate pseudo
questions into NL questions9 , and filter two kinds
of questions: 1) incomprehensible ones which are
semantically unclear; 2) unnatural ones which are
not the focus of humans10 . During the process of
paraphrasing, 6.7% of question/SQL pairs are fil-                           Figure 7: SQL query examples.
tered, among which 76.5% are complex queries.
Then we ask other annotators to check the correct-
ness of reformulated questions, and find 8% of                   examples in Figure 7. DuSQL contains enough
questions are inaccurate.                                        question/SQL pairs for all common types. Wik-
                                                                 iSQL and TableQA are simple datasets, only con-
3.4 Review and Checking
                                                                 taining matching questions. Spider and CSpider
To guarantee data quality, we automatically detect               (Min et al., 2019) mainly cover matching, sort-
low-quality question/SQL pairs according to the                  ing, clustering and their combinations. There are
following evaluation metrics.                                    very few questions in the calculation type, and
• Overlap. To ensure the naturalness of our                      all of them only need column calculations. Spi-
  questions, we calculate the overlap between the                der does not focus on questions that require the
  pseudo question and the corresponding NL ques-                 common knowledge and math operation. Accord-
  tion. The question with an overlap higher than                 ing to our analysis in Figure 2, the calculation
  0.6 is considered to be of low quality.                        type is very common, accounting for 8% to 65%
                                                                 in different applications. DuSQL, a pragmatic
• Similarity. To ensure that the question contains               industry-oriented dataset, conforms to the distribu-
  enough information for the SQL query, we train                 tion of SQL queries in real applications. Mean-
  a similarity model based on question/SQL pairs.                while, DuSQL is larger, twice the size of other
  The question with a similarity score less than 0.8             complex datasets. DuSQL contains 200 databases,
  is considered to be of low quality.                            covering about 70% of entries in Baike and more
   In the first round, about 18% of question/SQL                 than 160 domains, e.g., cities, singers, movies, an-
pairs are of low quality. We ask annotators to check             imals, etc. We provide content for each database.
these pairs and correct the error pairs. This process            All the values of a SQL query can be found in the
iterates through the collaboration of human and                  database, except for numeric values. All table and
computer until the above metrics no longer chang-                column names in the database are clear and self-
ing. It iterates twice in the construction of DuSQL.             contained. In addition, we provide English schema
                                                                 for each database, including table names and col-
3.5 Dataset Statistics                                           umn headers.
We summarize the statistics of DuSQL and other
cross-domain datasets in Table 2, and give some                  4 Benchmark Approaches
   8
      They are full-time employees and familiar with SQL lan-    All existing text-to-SQL works focus on English
guage. Meanwhile, they have lots of experience in annotating
QA data.                                                         datasets. Considering that DuSQL is the most sim-
    9
      Some values in SQL queries are rewritten as synonyms.      ilar with Spider, we choose the following three rep-
   10
      E.g., “When province is Sichuan, list the total rank of    resentative publicly available parsers as our bench-
these cities.” for the SQL query {SELECT sum(rank) From
T2 WHERE province = ‘Sichuan’} is considered as an unnat-        mark approaches, to understand the performance
ural question, as the total rank would not be asked by humans.   of existing approaches on our new Chinese dataset.
                                                            6928
                                                                                               Calculation
  Dataset     Size      DB      Table/DB    Matching     Sorting    Clustering                                           Others
                                                                                   Column        Row      Constant
 WikiSQL     80,654   26,251        1        80,654        0            0             0            0         0              0
 TableQA     49,974    5,291        1        49,974        0            0             0            0         0              0
  Spider      9,693     200        5.1       6,450        863         1,059           13           0         0            1,308
 CSpider      9,691     166        5.3       6,448        862         1,048           13           0         0            1,318
   Ours      23,797     200        4.1       6,440       2,276        3,768         1,760        1,385     1,097          7,071
Table 2: Statistics and comparisons of all existing cross-domain text-to-SQL datasets. The statistics of Spider are
based on published data, only containing train and development sets. Others consists of combinations between
matching, sorting and clustering types.
                                                       6929
                                         Calculation
              Methods                                            Matching   Sorting   Clustering   Others
                          Column        Row Constant     All
              IRNet          0            0         0     0        25.0       32.8       34.2        8.7
              IRNetExt     22.0         34.3      37.9   29.7      52.1       68.7       60.8       52.5
                                                         6930
tions. Overnight focuses on logic form (LF) based        Acknowledgments
semantic representation, while our work on SQL
representation. The differences are two-fold. First,      We thank three anonymous reviewers for their
databases of Overnight are much simpler, com-            helpful feedback and discussion on this work.
posed of a set of entity-property-entity triples. Sec-   Zhenghua Li and Min Zhang were supported by
ond, LF operations of Overnight are much simpler,        National Natural Science Foundation of China
consisting of only matching and aggregation opera-       (Grant No. 61525205, 61876116), and a Project
tions, such as count, min, max. Our dataset is more      Funded by the Priority Academic Program Devel-
complex and thus imposes more challenges on the          opment (PAPD) of Jiangsu Higher Education Insti-
data construction.                                       tutions.
                                                    6931
Tong Guo and Huilin Gao. 2019. Content enhanced            Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong
  bert-based text-to-sql generation. arXiv preprint          Cao, Xiaocheng Feng, Bing Qin, Ting Liu, and Ming
  arXiv:1910.07179.                                          Zhou. 2018. Semantic parsing with syntax-and table-
                                                             aware sql generation. In Proceedings of the 56th An-
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and               nual Meeting of the Association for Computational
  Weizhu Chen. 2019.     X-sql: reinforce schema             Linguistics (Volume 1: Long Papers), pages 361–
  representation with context.    arXiv preprint             372.
  arXiv:1908.08113.
                                                           Lappoon R Tang and Raymond J Mooney. 2001. Us-
Jonathan Herzig and Jonathan Berant. 2019. Dont para-        ing multiple clause constructors in inductive logic
  phrase, detect! rapid and effective data collection for     programming for semantic parsing. In European
  semantic parsing. In Proceedings of the 2019 Confer-       Conference on Machine Learning, pages 466–477.
  ence on Empirical Methods in Natural Language Pro-         Springer.
  cessing and the 9th International Joint Conference
  on Natural Language Processing (EMNLP-IJCNLP),           Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
  pages 3801–3811.                                           Polozov, and Matthew Richardson. 2019. Rat-sql:
                                                             Relation-aware schema encoding and linking for text-
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant
                                                             to-sql parsers. arXiv preprint arXiv:1911.04942.
   Krishnamurthy, and Luke Zettlemoyer. 2017. Learn-
   ing a neural semantic parser from user feedback. In     Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
   Proceedings of the 55th Annual Meeting of the As-         Building a semantic parser overnight. In Proceed-
   sociation for Computational Linguistics (Volume 1:        ings of the 53rd Annual Meeting of the Association
   Long Papers), pages 963–973.                              for Computational Linguistics and the 7th Interna-
Cody CT Kwok, Oren Etzioni, and Daniel S Weld. 2001.         tional Joint Conference on Natural Language Pro-
  Scaling question answering to the web. In Proceed-         cessing (Volume 1: Long Papers), pages 1332–1342.
  ings of the 10th international conference on World       Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
  Wide Web, pages 150–161.                                   Generating structured queries from natural language
Fei Li and HV Jagadish. 2014. Constructing an in-            without reinforcement learning. arXiv preprint
  teractive natural language interface for relational        arXiv:1711.04436.
  databases. Proceedings of the VLDB Endowment,
                                                           Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
  8(1):73–84.
                                                             Thomas Dillig. 2017. Sqlizer: query synthesis from
Percy Liang. 2013. Lambda dependency-based compo-            natural language. Proceedings of the ACM on Pro-
  sitional semantics. arXiv preprint arXiv:1309.4408.        gramming Languages, 1(OOPSLA):1–26.
Wang Ling, Phil Blunsom, Edward Grefenstette,              Pengcheng Yin and Graham Neubig. 2017. A syntac-
  Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang,           tic neural model for general-purpose code generation.
  and Andrew Senior. 2016. Latent predictor networks         In Proceedings of the 55th Annual Meeting of the As-
  for code generation. In Proceedings of the 54th An-        sociation for Computational Linguistics (Volume 1:
  nual Meeting of the Association for Computational          Long Papers), pages 440–450.
 Linguistics (Volume 1: Long Papers), pages 599–
  609.                                                     Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,
                                                             Dongxu Wang, Zifan Li, and Dragomir Radev.
Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019. A             2018a. Syntaxsqlnet: Syntax tree networks for com-
  pilot study for chinese sql semantic parsing. In Pro-      plex and cross-domain text-to-sql task. In Proceed-
  ceedings of the 2019 Conference on Empirical Meth-         ings of the 2018 Conference on Empirical Methods
  ods in Natural Language Processing and the 9th In-         in Natural Language Processing, pages 1653–1663.
  ternational Joint Conference on Natural Language
  Processing (EMNLP-IJCNLP), pages 3643–3649.              Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
                                                             Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,           ing Yao, Shanelle Roman, et al. 2018b. Spider: A
   Saurabh Tiwary, Rangan Majumder, and Li Deng.             large-scale human-labeled dataset for complex and
   2016. Ms marco: a human-generated machine read-           cross-domain semantic parsing and text-to-sql task.
   ing comprehension dataset.                                In Proceedings of the 2018 Conference on Empiri-
                                                             cal Methods in Natural Language Processing, pages
Ana-Maria Popescu, Oren Etzioni, and Henry Kautz.            3911–3921.
  2003. Towards a theory of natural language inter-
  faces to databases. In Proceedings of the 8th in-        Victor Zhong, Caiming Xiong, and Richard Socher.
  ternational conference on Intelligent user interfaces,     2017.    Seq2sql: Generating structured queries
  pages 149–157.                                             from natural language using reinforcement learning.
                                                             arXiv preprint arXiv:1709.00103.
Ningyuan Sun, Xuefeng Yang, and Yunfeng Liu. 2020.
  Tableqa: a large-scale chinese text-to-sql dataset
  for table-aware sql generation.   arXiv preprint
  arXiv:2006.06434.
                                                      6932
                                                                    Matching
     SQLs     ::=   SQL intersect SQLs | SQL union SQLs             List cities with a population less than 10 million.
                    | SQL except SQLs | SQL                         SELECT name FROM T1 WHERE population < 10000000
      SQL     ::=   Select | Select Where                           Sorting
                    | Select GroupC | Select Where GroupC           Give the top 5 cities with the largest population.
                    | Select OrderC | Select Where OrderC           SELECT name FROM T1 ORDER BY population DESC LIMIT 5
                    | Select from SQL, SQL
                                                                    Clustering
     Select   ::=   select A | select A A
                                                                    Give the total population of each province.
                    | select A A A | select A A A A
                                                                    SELECT province, sum(population) FROM T1 GROUP BY province
     Where    ::=   where Conditions
    GroupC    ::=   group by C                                      Figure 10: Examples of types in Figure 2. All of them
                    | group by C having Conditions                  are based on the database in Figure 1.
                    | group by C OrderC
    OrderC    ::=   order by C Dir | order by C Dir limit value
                    | order A Dir limit value                       do not have corresponding SQL queries.
       Dir    ::=   asc / desc
Conditions    ::=   Condition | Condition and Conditions            C Preconditions in SQL Generation
                    | Condition or Conditions
 Condition    ::=   A op value | A op SQL
                                                                    To ensure the semantic correctness of the gener-
         A    ::=   min C | max C | count C | sum C | avg C | C     ated SQL query, we define the preconditions for
         C    ::=   table.column                                    each production rule, and abide by these precondi-
                    | table.column1 mathop table.column2            tions in the SQL query generation.
                    | table1.column mathop table2.column
    mathop    ::=   +|-|*|/                                         • For the generation of SQL query with multiple
        op    ::=   = | != | > | >= | < | <= | like | in | not in     SQLs, e.g., {SQLs ::=SQL union SQLs}: the
                                                                      columns in the select clause of the previous SQL
  Figure 9: The production rules for SQL generation.                  match the columns in the select clause of the
                                                                      subsequent SQL, i.e., the columns of the two se-
                                                                      lect clauses are the same or connected by foreign
A The Grammar for SQL Generation                                      keys.
Figure 9 shows production rules used for SQL gen-
eration. All kinds of SQL queries can be generated                  • For the rule of generating GroupC: the C is gen-
by exercising each rule, e.g., the rule of {Condition                 erated from the rule of {C ::= table.column},
= A op SQL} for nested query generation, the rule                     where the column can perform the clustering op-
of {C = table.column1 mathop table.column2} and                       eration, that is to say, the table can be divided
{C = table1.column mathop table2.column}for cal-                      into several sub-tables according to the values of
culation query generation.                                            this column.
                                                             6933
                                                                                                  Calculation
      Dataset     Size      DB      Table/DB   Order       Group    Having     Nest
                                                                                         Column     Row      Constant
     WikiSQL     80,654   26,251       1         0           0        0          0          0         0         0
     TableQA     49,974   5,291        1         0           0        0          0          0         0         0
      Spider     10,181    200        5.1      1,335       1,491     388        844        13         0         0
     CSpider      9,691    166        5.3      1,052       1,123     505        913        13         0         0
       Ours      23,797    200        4.1      4,959       3,029    3,432      2,208      1,760     1,385     1,097
Table 5: Comparisons of cross-domain text-to-SQL datasets. The statistics of Spider are from Yu et al. (2018b).
The statistics of CSpider are based on the released data.
                                                       6934
                Components         Pseudo Descriptions
          SQL intersect SQLs       SQL, as set1, SQLs, as set2,
                                   belong to set1 and set2
             SQL union SQLs        SQL, as set1, SQLs, as set2,
                                   belong to set1 or set2
            SQL except SQLs        SQL, as set1, SQLs, as set2,
                                   belong to set1 but not belong to set2
                 select A ... A    list A, ... and A
            where Conditions       when Conditions
                   group by C      for each C
group by C having Conditions       the C that Conditions
           group by C OrderC       the C with OrderC
               order by C Dir      sorted by C Dir
    order by C Dir limit value     the top value sorted by C Dir
    order by A Dir limit value     the top value sorted by A Dir
                   A op value      A op value
                   A op SQL        SQL as v1, A op v1
                         agg C     agg C
                       count *     the number of table
                T1.C + T2.C        the sum of T1.C and T2.C
                T1.C − T2.C        the difference between T1.C and T2.C
                 T1.C ∗ T2.C       the product of T1.C and T2.C
                 T1.C / T2.C       T1.C is times of T2.C
6935