Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

Discover the first version of our scientific publication “Low-cost deep language models: Survey and performance evaluation on Python code generation” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team – including Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, Merieme Bouhandi, Walid Dahhane, El Hassane Ettifouri – for their know-how and expertise.

Abstract

“Large Language Models (LLMs) have become the go-to solution for many Natural Language Processing (NLP) tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.”

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text

Discover our scientific publication “A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text” published in Elsevier and reviewed in ScienceDirect.

Thanks to the Novelis research team – notably Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, El Hassane Ettifouri, Walid Dahhane – for their know-how and expertise.

Abstract

Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers’ productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder–decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.

Elsevier is a data analytics company that helps institutions, health and science professionals improve their performance for the benefit of humanity.

ScienceDirect is the world’s leading source for scientific, technical and medical research.

JaCoText: A Pretrained Model for Java Code-Text Generation

Discover our article on JaCoText: A Pretrained Model for Java Code-Text Generation published in the journal “World Academy of Science, Engineering and Technology“.

Our research team also presented its work at the International Conference on Code Generation and Implementation: Watch the replay

Thanks to the Novelis Research Team for their knowlegde and experience.

Authors: Jessica Lòpez EspejelMahaman Sanoussi Yahaya AlassanWalid DahhaneEl Hassane Ettifouri

Abstract

Pretrained transformer-based models have shown high performance in natural language generation task. However, a new wave of interest has surged: automatic programming language generation. This task consists of translating natural language instructions to a programming code. Despite the fact that well-known pretrained models on language generation have achieved good performance in learning programming languages, effort is still needed in automatic code generation. In this paper, we introduce JaCoText, a model based on Transformers neural network. It aims to generate java source code from natural language text. JaCoText leverages advantages of both natural language and code generation models. More specifically, we study some findings from the state of the art and use them to (1) initialize our model from powerful pretrained models, (2) explore additional pretraining on our java dataset, (3) carry out experiments combining the unimodal and bimodal data in the training, and (4) scale the input and output length during the fine-tuning of the model. Conducted experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.

About the article

“In this paper, we present JaCoText, a pretrained model based on Transformers [5]. First, we initialize our model from pretrained weights of CoTexT-1CC and CoTexT-2CC, instead of performing a training from scratch. Later, we conduct an additional pretraining step using data that belongs to a specific programming language (Java in our case). Moreover, unlike works that based their pretraining on CodeSearchNet [18] such as CodeBERT [19] and CoTexT [2], we use more java data in the pretraining stage of our model, as [13] and [14] have shown that Transformers neural network improves its performance significantly from increasing the amount of pretraining data. Furthermore, we carry out experiments to measure the impact of the input and output sequences length on code generation task. Finally, we test the unimodal data and study its impact on the model’s performance. This study is crucial to evaluate the model in the pretraining stage.”

Read the full article

World Academy of Science, Engineering and Technology is a predatory publisher of open access academic journals.

Novelis ranked 1st in the Microsoft CodeXGLUE international challenge

Novelis ranks 1st worldwide in the international CodeXGLUE challenge, organized by Microsoft, on Java code generation from natural language.

Last March, Novelis was already in the spotlight thanks to its participation in two international challenges: the Spider Challenge organized by Yale University and the WikiSQL Challenge organized by Cornell University. In both challenges, Novelis took second and seventh place, alongside the biggest leaders in AI and RPA.

The Novelis R&D Lab team won 1st place in the international CodeXGLUE challenge on Java code generation from natural language:

The CodeXGLUE challenge – General Language Understanding Evaluation benchmark for CODE – organized by Microsoft, brings together large companies such as IBM or Microsoft and international universities such as Case Western Reserve University, UCLA/Columbia University, or INESC-ID/Carnegie Mellon University.

With CodeXGLUE, Microsoft seeks to “support the development of models that can be applied to various code intelligence problems, with the goal of increasing the productivity of software developers”. Microsoft wants to encourage researchers to take part in current challenges to further advance code intelligence.

According to Evans Data Corporation, there will be 23.9 million professional developers in 2019 worldwide, and the number is expected to reach 28.7 million by 2024. “With the growing population of developers, code intelligence, which aims to leverage AI to help software developers improve the productivity of the development process, is growing increasingly important in both communities of software engineering and artificial intelligence.” Github.com

The Challenge includes 14 datasets for 10 diverse programming language tasks covering:

  • Code-Code (redundancy/clone detection, code error detection, gap code (or text) completion, code autocompletion, code correction and code-to-code translation),
  • Text-Code (natural language code search, Text-Code generation),
  • Code-Text (code summary),
  • Text-Text (documentation translation).

Novelis has participated in the Text-Code task, which consists in automatically generating Java source code from natural language.

Currently, the Text-Code task leaderboard has 9 participants. Once we had built a model that met our expectations, we submitted our test results for official evaluation by the Microsoft community based on 3 criteria:

  • The Exact Matching (EM),
  • The BLUE Score,
  • CodeBLEU.

The Microsoft community then updated the ranking on the leaderboard that you can find below.

Novelis ranked 1st in the Microsoft CodeXGLUE international challenge

We have been working for more than two years on the problem of generating code in programming language from a need described in natural language. Our work adopts several approaches, designed and implemented by the Novelis R&D Lab team and has led to several results in the task of generating business code in Python and Java. Until now, we did not have a benchmark or a challenge that would allow us to evaluate our results in an objective way. Microsoft’s CodeXGLUE challenge allows us to gain this credibility because we could officially evaluate our results. Moreover, the 1st place obtained in the code generation task proves that we are on the right track. Note that the results published in this challenge are not very high because on the one hand the code generation task is very complex and on the other hand the proposed models are not yet mature enough.”

Novelis has placed innovation and R&D at the heart of its development strategy

Since its creation, Novelis has chosen to invest massively (30% of its turnover) in research and development.

For El Hassane Ettifouri, CIO and Director of the Novelis R&D Lab, this is no small matter:

Today very few companies are willing to invest ¼ of their turnover in research. It is this risk-taking that differentiates Novelis from other companies. We want to have a foot in the future and participate in the construction of this future by investing in research on technologies. Innovation is an integral part of Novelis’ DNA.

Moreover, our research work is concrete and has a real impact on our customers – who reap all the benefits of our technologies for the automation of their processes – but also for our employees who evolve in an innovative working environment.”

International Conference on Code Generation and Implementation

During the ” International Conference on Code Generation and Implementation “, the team of PhDs from our internal research laboratory presented the results of their work on a new approach to generate Java code from descriptions written in natural language.   

This international research conference aims to bring together academic scientists, researchers and leading academics to exchange and share their experiences and research results on all aspects of code generation and implementation.  

It also helps to create an interdisciplinary platform for researchers and practitioners to discuss innovations, trends and solutions in the fields of code generation and implementation. 

Our PhD team composed of Jessica Lopez Espejel, Mahaman Sanoussi Yahaya Alassan, Walid Dahhane, El Hassane Ettifouri contributed to the conference by submitting the results of their research.   

In particular, Jessica Lopez Espejel, PhD Research and Development Engineer in Novelis’ R&D Lab, presented our new approach to generate Java code from descriptions written in natural language. This is also the approach we showcased at the international CodeXGLUE challenge organized by Microsoft on Java code generation from natural language for which we were ranked 1st.  

In this conference we intoduced JaCoText, a model based on the neural network Transformer. It aims at generating Java source code from natural language texts. JaCoText exploits the advantages of both code generation and natural language models. Specifically, we study some of the state-of-the-art results and use them to (1) initialize our model from powerful pre-trained models, (2) explore additional pre-training on our java dataset, (3) conduct experiments combining unimodal and bimodal data in training, and (4) scale the length of the input and output during model fine-tuning. Experiments conducted on the CONCODE dataset show that JaCoText achieves new state-of-the-art results. 

Watch the replay of our presentation.  

Jessica Lopez Espejel was proud to receive the “Best Presentation Award” from the program committee in accordance with the conference awards program. 

CodeWEup, you’ve already got the codes.

Register for the first ever Novelis Code Competition on February 17, 2022!

On February 17th, enter the virtual arena and participate in the CodeWEup challenge, the code competition organized by Novelis.

From 8pm to 10pm, you will be able to challenge other developers during a 100% Java code evening!

 

The principle is simple

During 2 hours, all participants will be able to challenge themselves on 4 programming exercises of different levels of difficulty. The most difficult exercise will count for 45% of the overall score.

The originality of the contest: the coders will be able to challenge themselves on a single programming language: Java.

 

How to win the CodeWEup?

By getting the best score as quickly as possible!

An important point: we offer participants the possibility to start with the exercise of their choice and to divide the 2 hours allowed between the different problems as they wish.

 

Several prizes to be won

1st place: an iPhone 13 256 GB worth €1029

 

Are you looking for a job opportunity?

By participating in our CodeWEup competition you can also, if you wish, have the opportunity to be spotted by one of our talent recruiters and maybe get a job within our teams.

So, will you dare to take the challenge? If you want to test your skills before the big day, we suggest you complete this test exercise in less than 20 minutes, accessible from your registration confirmation.

Good luck to all of you!

>>Read the competition rules