SQL Generation from Natural Language: A Seq2Seq Model – Transformers Architecture

Novelis technical experts have once again achieved a new state-of-the-art in science. Discover our study SQL Generation from Natural Language: A Sequence-to-Sequence Model Powered by the Transformers Architecture and Association Rules, puplished on Journal of Computer Science.

Thanks to the Novelis Reasearch Team for their knowledge and expertise.


Using natural language (NL) to interact with relational databases allows users of any background to easily query and analyze large amounts of data. This requires a system that understands user questions and automatically translates them into structured query languages ​​(such as SQL). The best-performing Text-to-SQL system uses supervised learning (usually expressed as a classification problem) and treats this task as a sketch-based slot filling problem, or first converts the problem into an intermediate logical form (ILF) and then converts it Convert to the corresponding SQL query. However, unsupervised modeling that directly translates the problem into SQL queries has proven to be more difficult. In this sense, we propose a method to directly convert NL questions into SQL statements.

In this research, we propose a sequence-to-sequence (Seq2Seq) parsing model for NL to SQL tasks, supported by a converter architecture that explores two language models (LM): text-to-text transfer converter (T5) ) And multi-language pre-trained text-to-text converter (mT5). In addition, we use transformation-based learning algorithms to update aggregation predictions based on association rules. The resulting model implements a new state-of-the-art technology on the WikiSQL data set for weakly supervised SQL generation.

About the study

“In this study, we treat the Text-to-SQL task with WikiSQL1 (Zhong et al., 2017). This DataSet is the first large-scale dataset for Text-to-SQL, with about 80 K human-annotated pairs of Natural Language question and SQL query. WikiSQL is very challenging because tables and questions are very diverse. This DataSet contains about 24K different tables.

There are two leaderboards for the WikiSQL challenge: Weakly supervised (without using logical form during training) and supervised (with logical form during training). On the supervised challenge, there are two results: Those with Execution Guided (EG) inference and those without EG inference.”

Journal of Computer Science – Volume 17 No. 5, 2021, 480-489 (10 pages)

Journal of Computer Science aims to publish research articles on the theoretical basis of information and computing, and practical technologies for implementation and application in computer systems.