Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications

Discover the first version of our scientific publication “Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team for their know-how and expertise.

Abstract

In the rapidly evolving landscape of Natural Language Processing (NLP),Large Language Models (LLMs) have demonstrated remarkable capabilitiesin tasks such as question answering (QA). However, the accessibility andpracticality of utilizing these models for industrial applications pose signif-icant challenges, particularly concerning cost-effectiveness, inference speed,and resource efficiency. This paper presents a comprehensive benchmarkingstudy comparing open-source LLMs with their non-open-source counterpartson the task of question answering. Our objective is to identify open-source al-ternatives capable of delivering comparable performance to proprietary mod-els while being lightweight in terms of resource requirements and suitable forCentral Processing Unit (CPU)-based inference. Through rigorous evalua-tion across various metrics including accuracy, inference speed, and resourceconsumption, we aim to provide insights into selecting efficient LLMs forreal-world applications. Our findings shed light on viable open-source al-ternatives that offer acceptable performance and efficiency, addressing thepressing need for accessible and efficient NLP solutions in industry settings.

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text

Discover our scientific publication “A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text” published in Elsevier and reviewed in ScienceDirect.

Thanks to the Novelis research team – notably Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, El Hassane Ettifouri, Walid Dahhane – for their know-how and expertise.

Abstract

Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers’ productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder–decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.

Elsevier is a data analytics company that helps institutions, health and science professionals improve their performance for the benefit of humanity.

ScienceDirect is the world’s leading source for scientific, technical and medical research.

Top 10 great language models that have transformed NLP in the last 5 years

GPT-4, released by OpenAI in 2023, is the language model that holds one of the largest neural networks ever created, far beyond the language models that came before it. It is also the latest large multimodal model capable of processing images and text as input and producing text as output. Not only does GPT-4 outperform existing models by a considerable margin in English, but it also demonstrates great performance in other languages. GPT-4 is an even more powerful and sophisticated model than GPT-3.5, showing unparalleled performance in many NLP (natural language processing) tasks, including translation and Q&A.

In this article, we present the ten Large Language Models (LLMs) that have had a significant impact on the evolution of NLP in recent years. These models have been specifically designed to tackle various tasks in the Natural Language Processing (NLP), such as question answering, automatic summarization, text-to-code generation, and more. For each model, we have provided an overview of its strengths and weaknesses as compared to other models in its category.

A LLM (Large Language Model) model is trained on a large corpus of text data, designed to generate text like humans. The emergency of LLMs such as GPT-1 (Radford et al., 2018) and BERT (Devlin et al., 2018) was a breakthrough for artificial intelligence.

The first LLM, developed by OpenAI, is GPT-1 (Generative Pretrained Transformer) in 2018 (Radford et al., 2018). It is based on Transformer (Vaswani et al., 2017) neural network, but it has 12 layers and 768 hidden units per layer. The model was trained to predict the next token in a sequence, given the context of the previous tokens. GPT-1 is capable of performing a wide range of language tasks, including answering questions, translating text, and generating creative writing. Since it is the first LLM, GPT-1 has some limitations, for example:

  1. Bias- GPT-1 is trained on a large corpus of text data, which can introduce biases into the model;
  2. Lack of common sense: being trained from texts it has difficulties in linking knowledge to some form of understanding of the world;
  3. Limited interpretability: since it has millions of parameters, it is difficult to interpret how it makes decisions and why it generates certain outputs.

In the same year as GPT-1, Google IA introduced BERT (Bidirectional Encoder Representations from Transformers). Unlike GPT-1, BERT (Devlin et al., 2018) focused on pre-training the model on a masked language modeling task, where the model was trained to predict missing words in a sentence given the context. This approach allowed BERT to learn rich contextual representations of words, which led to improved performance on a range of NLP tasks, such as sentiment analysis and named entity recognition. BERT shares with GPT-1 some limitations, for example, the lack of common sense knowledge about the world, and the limitation in the interpretability to know how it takes decisions and the reason behind to generate some outputs. Moreover, BERT only uses a limited context to make predictions, which can result in unexpected or nonsensical outputs when the model is presented with new or unconventional information.

In the early 2019, surged the third LLM introduced by OpenAI, known as GPT-2 (Generative Pretrained Transformer 2). GPT-2 (Radford et al., 2019) was designed to generate coherent and human-like text by predicting the next word in a sentence based on the preceding words. Its architecture is based on a transformer neural network, similar to its predecessor GPT-1, which uses self-attention to process input sequences. However, GPT-2 is a significantly larger model than GPT-1, with 1.5 billion parameters compared to GPT-1’s 117 million parameters. This increased size enables GPT-2 to generate higher quality text and perform well on a wide range of natural language processing tasks. Additionally, GPT-2 can perform a wider range of tasks, such as summarization, translation, and text completion, compared to GPT-1. However, one limitation of GPT-2 is its computational requirements, which can make it difficult to train and deploy on certain hardware. Additionally, some researchers have raised concerns about the potential misuse of GPT-2 for generating fake news or misleading information, leading OpenAI to initially limit its release.

GPT-2 has been followed by other models such as XLNet and RoBERTa. XLNet (Generalized Autoregressive Pretraining for Language Understanding) was introduced by Google IA. XLNet (Yang et al., 2019) is a variant of the Transformer-based architecture. XLNet is different from traditional

Transformer-based models, such as BERT and RoBERTa, because it uses a permutation-based training method that allows the model to consider all possible word orderings in a sequence, rather than just a fixed left-to-right or right-to-left order. This approach leads to improved performance on NLP tasks such as text classification, question answering, and sentiment analysis. It has state-of-the-art results of NLP benchmark datasets, but like any other model has some limitations. For instance, it has a complex training algorithm (it uses a permutation-based training algorithm), and it needs a large amount of high-quality, diverse training data to perform well.

Simultaneously, RoBERTa (Robustly Optimized BERT Pretraining Approach) was also introduced in 2019, but by Facebook AI. RoBERTa (Liu et al., 2019) improves upon BERT by training on a larger corpus of data, dynamic masking , and training with the whole sentence, rather than just the masked tokens. These modifications lead to improved performance on a wide range of NLP tasks, such as question answering, sentiment analysis, and text classification. RoBERTa is a highly performance LLM, but it has also some limitations. For example, since RoBERTa has a large number of parameters, the inference can be slow; the model is has better proficiency in English, but it does not have the same performance in other languages.

Few months later, Salesforce Research Team released CTRL (Conditional Transformer Language Model). CTRL (Keskar et al., 2019) is designed to generate text conditioned on specific prompts or topics, allowing it to generate coherent and relevant text for specific tasks or domains. CTRL is based on a transformer neural network, similar to other large language models such as GPT-2 and BERT. However, it also includes a novel conditioning mechanism, which allows the model to be fine-tuned for specific tasks or domains. One advantage of CTRL is its ability to generate highly relevant and coherent text for specific tasks or domains, thanks to its conditioning mechanism. However, one limitation of its that it may not perform as well as more general-purpose language models on more diverse or open-ended tasks. Moreover, the conditioning mechanism used by CTRL may require additional preprocessing steps or specialized knowledge to set up effectively.

In the same month as CTRL model, NVIDIA introduced MEGATRON-LM (Shoeybi et al., 2019). MEGATRON-LM is designed to be highly efficient and scalable, enabling researchers and developers to train massive language models with billions of parameters using distributed computing techniques. Its architecture is similar to other large language models such as GPT-2 and BERT. However, Megatron-LM uses a combination of model parallelism and data parallelism to distribute the workload across multiple GPUs, allowing it to train models with up to 8 billion parameters. However, one limitation of Megatron-LM is its complexity and high computational requirements, which can make it challenging to set up and use effectively. Additionally, the distributed computing techniques used by Megatron-LM can introduce additional overhead and communication costs, which can affect training time and efficiency.

Subsequently, a few months later, Hugging Face developed a model called DistilBERT (Aurélien et al., 2019). DistilBERT is a light version of BERT model. It was designed to provide a more efficient and faster alternative to BERT, while still retaining a high level of performance on a variety of NLP tasks. The model is able to achieve up to 40% smaller model sizes and 60% faster inference times compared to BERT, without sacrificing much of its performance accuracy. DistillBERT can perform well in tasks such as sentiment analysis, question answering, and named entity recognition. However, DistillBERT does not perform as well on some NLP tasks as BERT. As well, it has been pre-trained on a smaller dataset compared to BERT, which limits its ability to transfer its knowledge to new tasks and domains.

Simultaneously, Facebook AI released BART (Denoising Autoencoder for Regularizing Translation) in June 2019. BART (Lewis et al., 2019) is a sequence-to-sequence (Seq2Seq) pre-trained model for natural language generation, translation, and comprehension. BART is a denoising autoencoder that uses a combination of denoising objectives in the pre-training. The denoising objectives help the model to learn robust representations. BART has limitations for multi-language translation, its performance can be sensitive to the choice of hyperparameters, and finding the optimal hyperparameters can be a challenge. Additionally, the autoencoder of BART has limitations, such as a lack of ability to model long-range dependencies between input and output variables.

Finally, we highlight the T5 (Transfer Learning with a Unified Text-to-Text Transformer) model, which was introduced by Google AI. T5 (Raffel et al., 2020) is a sequence to-sequence transformer-based. It uses the MSP (Masked Span Prediction) objective in the pre-training, which it consists in randomly masking spans of text with arbitrary lengths. Later, the model predicts the masked spans. Although T5 achieved results in the state of the art, T5 is designed to be a general-purpose text-to-text model, which can sometimes result in predictions that are not directly relevant to a specific task or are not in the desired format. Moreover, T5 is a large model, and it requires a high memory usage, and sometimes takes long time in the inference.

In this article, we have point out the pros and cons of the ten groundbreaking LLMs that have emerged over the last five years. We have also delved into the architectures that these models were built upon, showcasing the significant contributions they have made in advancing the NLP domain.

Anonymization of sensitive data by the combined approach of NLP and neural models

Data exploitation is more than ever a major issue within any type of organization. Several use cases are covered, from exploration to extraction of relevant and usable information, in order to :

  • Understand the environment of an organization
  • Better understand its employees
  • Improve its services, products and processes (use case of production data in a test and/or development environment)

Handling this mass of information is not without consequences. It contains sensitive information whose disclosure may harm legal entities and/or individuals. This is why the European Parliament adopted in May 2016, the General Data Protection Regulation (GDPR) aiming to frame the processing of data in an equal way throughout the European Union. Its objectives: to strengthen the rights of individuals, to make actors processing data more accountable and to promote cooperation between data protection authorities. Pseudonymization/anonymization thus appears to be an indispensable technique for protecting personal data and promoting compliance with regulations.

What is Pseudonymization and Anonymization?

ENISA [1] (the European Union’s cybersecurity agency) defines pseudonymization as a de-identification process. It is the processing of sensitive data in such a way that a natural person can no longer be directly identified without additional information. Whereas anonymization is a process by which personal data are irreversibly altered in such a way that the data subject can no longer be identified, directly or indirectly, either by the controller alone or in collaboration with other third parties [1].

When considering the following text: “Emmanuel MACRON is the eighth President of the Fifth French Republic. Founder of the “En Marche!” movement, created on April 6, 2016, he led it until his first victory in the presidential election on May 7, 2017.”

There are three types of information:

  • the named entities: Emmanuel MACRON, April 6, 2016, May 7, 2017, En Marche, eighth
  • the mentions: President of the French Fifth Republic, Founder
  • Other identifying morphemes: first victory, the presidential election

The following table summarizes the expected result when applying these two techniques

A third category of approach for processing sensitive data is emerging with the advances of neural algorithms on natural language exploitation: advanced pseudonymization. The latter is capable of processing the vast majority of sensitive “identifying” information in a text. However, there are still cases at the margin that can be detected if the context of the subject is known. This is the example of the following text “LinkedIn is a social network. In France, in 2022, LinkedIn has more than 25 million members and 12 million estimated monthly active members, making it the 6th largest social network” where the term 6th largest social network, difficult to detect, can identify LinkedIn when doing some research on the Internet.

What is “sensitive data”?

Sensitive data is information that can identify a natural or legal person. This is the case of the following information when associated with a physical person: full name (surname and first name), location, organization, date of birth, addresses (email, housing), identifying numbers (credit card, social security, telephone) …. or information related to a legal person such as the name of the company, its address, its SIREN and SIRET identifiers, ….

How to pseudonymize data?

The CNIL [2] describes two types of pseudonymization techniques: those that rely on the creation of relatively basic pseudonyms (counter, random number generator) and those that rely on cryptographic techniques (secret key encryption, hash function). All of these methods explain how sensitive data should be handled in the context of pseudonymization. They do not explain how to identify it. The identification process can be simple when the data is tabular. In this case, it is sufficient to delete or encrypt the contents of the relevant columns.

At Novelis, we are working on advanced pseudonymization of sensitive data contained in free text. Identification in this context is complex and is often performed manually by humans, which imposes a cost in time and skilled human resources. Artificial intelligence (AI) and automatic language processing (NLP) techniques are however sufficiently robust to automate this task. We will thus generally distinguish two types of approaches for sensitive data extraction: neural approaches and rule-based approaches. Although they provide excellent results, especially with the emergence of Transformers (deep learning model), neural approaches require large datasets to be relevant, which is not always the case in the industrial world.  They also require an annotation task by experts in order to provide the models with a quality dataset for training. As for rule-based models, they suffer from generalization problems. A rule-based model will indeed tend to have a good accuracy on the sample used as a training base but will be more difficult to apply to a new dataset not studied in the initial assumptions

The approach proposed by the Novelis R&D team

We propose a hybrid approach exploiting the strengths of NLP techniques and neural models. First, we built a corpus containing addresses, to train a neural model able to detect an address in a text. A benchmarking of the models was performed in order to choose the adequate model. The model is then improved using a fine-tuning strategy. Combined with NLP python libraries, the model provides a robust solution for extracting addresses and named entities such as people’s names, places and organizations. Patterns (regular expressions) were designed, by Novelis experts, for the extraction of other identified sensitive data. Finally, heuristics were used to disambiguate and correct the extracted information.

With this approach, we have built a reliable and robust system to process sensitive information contained in any type of document (pdf, word, email, …). The goal is to remove low value-added tasks from the data processors by automated assistance.

References:

  • [1] : https://www.enisa.europa.eu/news/enisa-news/enisa-proposes-best-practices-and-techniques-for-pseudonymisation
  • [2] : https://www.cnil.fr/fr/recherche-scientifique-hors-sante/enjeux-avantages-anonymisation-pseudonymisation

SQL Generation from Natural Language: A Seq2Seq Model – Transformers Architecture

Novelis technical experts have once again achieved a new state-of-the-art in science. Discover our study SQL Generation from Natural Language: A Sequence-to-Sequence Model Powered by the Transformers Architecture and Association Rules, puplished on Journal of Computer Science.

Thanks to the Novelis Research Team for their knowledge and expertise.

Abstract

Using natural language (NL) to interact with relational databases allows users of any background to easily query and analyze large amounts of data. This requires a system that understands user questions and automatically translates them into structured query languages ​​(such as SQL). The best-performing Text-to-SQL system uses supervised learning (usually expressed as a classification problem) and treats this task as a sketch-based slot filling problem, or first converts the problem into an intermediate logical form (ILF) and then converts it Convert to the corresponding SQL query. However, unsupervised modeling that directly translates the problem into SQL queries has proven to be more difficult. In this sense, we propose a method to directly convert NL questions into SQL statements.

In this research, we propose a sequence-to-sequence (Seq2Seq) parsing model for NL to SQL tasks, supported by a converter architecture that explores two language models (LM): text-to-text transfer converter (T5) ) And multi-language pre-trained text-to-text converter (mT5). In addition, we use transformation-based learning algorithms to update aggregation predictions based on association rules. The resulting model implements a new state-of-the-art technology on the WikiSQL data set for weakly supervised SQL generation.

About the study

“In this study, we treat the Text-to-SQL task with WikiSQL1 (Zhong et al., 2017). This DataSet is the first large-scale dataset for Text-to-SQL, with about 80 K human-annotated pairs of Natural Language question and SQL query. WikiSQL is very challenging because tables and questions are very diverse. This DataSet contains about 24K different tables.

There are two leaderboards for the WikiSQL challenge: Weakly supervised (without using logical form during training) and supervised (with logical form during training). On the supervised challenge, there are two results: Those with Execution Guided (EG) inference and those without EG inference.”

Read the full article

Journal of Computer Science – Volume 17 No. 5, 2021, 480-489 (10 pages)

Journal of Computer Science aims to publish research articles on the theoretical basis of information and computing, and practical technologies for implementation and application in computer systems.

Novelis ranks 2nd in international NLP Research Challenge

One more step towards the democratization of Artificial Intelligence and NLP (Natural Language Processing), Challenge SPIDER

Paris, March 25, 2021 – Novelis, an innovative consulting and technology company, is currently taking part in two international research challenges aiming to automatically generate SQL queries thanks to natural language. Following the recent publication of its work, Novelis is positioned alongside Artificial Intelligence leaders, such as Microsoft, Salesforce, Google and others.

The worldwide volume of data processed daily has never been so big. These data are mostly gathered in so-called relational databases, which require mastering a Structured Query Language SQL to store or manipulate the aforementioned data. Novelis’ project aims to democratize access to these data by automatically generating technically complex queries from human language, also known as Natural Language Processing (NLP).

Novelis in major international challenges SPIDER and WikiSQL

Led by Yale University, the Spider Challenge brings together a large-scale complex cross-domain semantic data set and SQL queries. The goal is to transform natural English text into executable SQL-queries, also called “Text-to-SQL task”. The Challenge consists of 10,181 questions, 5,693 unique complex SQL-queries on 200 databases with multiple tables covering 138 domains. Following the publication of its work and at the time of publication of this article, Novelis is ranked 2nd in the world, alongside Salesforce, only 2.9 points behind the first (Tel-Aviv University & Allen Institute for AI). It is important to note that this type of challenge is evolving and results may change. Find out more and discover the results: Spider: Yale Semantic Parsing and Text-to-SQL Challenge (yale-lily.github.io)

The objective of the WikiSQL Challenge is the same as for Spider but with different constraints and contexts. Here, the participants only deal with one table from models with unsupervised learning (where the machine works on its own) or with supervised learning (where the machine relies on hints from which it generates predictions). Leading companies in Artificial Intelligence and NLP are taking part in this challenge along with reknowned universities: Microsoft, Google, Alibaba, Salesforce, the Universities of California, Berkeley, Fudan… For this event, Novelis has developed a hybrid learning model that ranks 7th out of 31 scientific projects. Follow the link for more information and complete results: GitHub – salesforce/WikiSQL: A large annotated semantic parsing corpus for developing natural language interfaces.

Innovation and R&D: A strategic priority for Novelis’ development

Since its beginning, Novelis has been investing massively (30% of its turnover) in Research and Development. According to Mehdi Nafe, CEO of Novelis: “Beyond the impact on fundamental research, our objective is to change the software design model to achieve operational excellence, change the relationship we have with technologies, and have a sustainable impact on innovation processes within society. In the last years, the major progress of data science, AI and, more recently, NLP, represents a huge potential in terms of business process optimization and use. The creation of an R&D Lab is one of Novelis’ founding acts. For a technology company, engaging in research is a key element. It is essential for better serving our customers.”

NL2Code: A Corpus and Semantic Parser for Natural Language to Code

Discover our conference paper NL2Code: A Corpus and Semantic Parser for Natural Language to Code – International conference on smart Information & communication Technologies part of the  Lecture Notes in Electrical Engineering book series (LNEE, volume 684).

Thanks to the Novelis Research Team for their knowlegde and experience.

Abstract

In this work, we propose a new semantic analysis and data method that allows automatic generation of source code from specifications and descriptions written in natural language (NL2Code). Our long-term goal is to allow any user to create applications based on specifications that describe the requirements of the complete system. It involves researching, designing, and implementing intelligent systems that allow automatic generation of computer projects by answering user needs (skeleton, configuration, initialization scripts, etc.) expressed in natural language. We are taking the first step in this area to provide a new data set specifically for our Novelis company and implement a method that enables machines to understand the needs of users and express them in natural language in specific areas.

About the study

“The dream of using Frensh or any other natural language to generate a code in a specific programming language has existed for almost as long as the task of programming itself. Although significantly less precise than a formal language, natural language as a programming medium would be universally accessible and would support the automation of an application. However, the diversity and ambiguity of the texts, the compositional nature of the code and the layered abstractions in the software make it difficult to generate this code from functional specifications (natural language). The use of artificial intelligence offers interesting potential for supporting new tools in almost all areas of software engineering and program analysis. This work presents new data and semantic parsing method on a novel and ambitious domain — the program synthesis.

Our long-term goal is to enable any user to generate complete web applications frontend / backend based on Java / JEE technology and which respect a n-tier architecture (multilayer). For that, we take a first step in this direction by providing a dataset (Corpus) proposed by the company Novelis based on the dataset that contains questions / answers of the Java language of the various topics of the website ”Stack OverFlow” with a new semantic parsing method.”

Read the full article

Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 684) 

SpringerLink provid researchers with access to millions of scientific documents from journals, books, series, protocols, reference works and proceedings.

Doctor in AI/ML/NLP – M/F

Lab R&D – Permanent Contract – Paris – PhD

Novelis at the Ecole Polytechnique Féminine (EPF) for Research Day

Research Day at EPF: organized for 20 years, this day is dedicated to research and innovation.

On the occasion of the EPF Research Day, Novelis will be present at the school to host a round table on innovation in digital technology. Following this presentation, students will be able to meet our team at our stand and learn more about the work of Novelis’ internal R&D laboratory by talking directly with members of the research and recruitment teams.

At Novelis, we aim to use new technologies to meet our client’ business needs and thus offer them adapted solutions to support them in their digital transformation.
This is reflected in our R&D Lab, in which we invest more than 25% of our revenue. Our doctoral researchers work daily on fundamental and experimental research around AI (machine learning, image processing and NLP) with the objective of exceeding the state of the art in AI and NLP.

We are very proud to invest in scientific research to help build our future, so we are delighted to be able to share the results of our work with the students of the EPF engineering school.