Anonymization of sensitive data by the combined approach of NLP and neural models

Data exploitation is more than ever a major issue within any type of organization. Several use cases are covered, from exploration to extraction of relevant and usable information, in order to :

  • Understand the environment of an organization
  • Better understand its employees
  • Improve its services, products and processes (use case of production data in a test and/or development environment)

Handling this mass of information is not without consequences. It contains sensitive information whose disclosure may harm legal entities and/or individuals. This is why the European Parliament adopted in May 2016, the General Data Protection Regulation (GDPR) aiming to frame the processing of data in an equal way throughout the European Union. Its objectives: to strengthen the rights of individuals, to make actors processing data more accountable and to promote cooperation between data protection authorities. Pseudonymization/anonymization thus appears to be an indispensable technique for protecting personal data and promoting compliance with regulations.

What is Pseudonymization and Anonymization?

ENISA [1] (the European Union’s cybersecurity agency) defines pseudonymization as a de-identification process. It is the processing of sensitive data in such a way that a natural person can no longer be directly identified without additional information. Whereas anonymization is a process by which personal data are irreversibly altered in such a way that the data subject can no longer be identified, directly or indirectly, either by the controller alone or in collaboration with other third parties [1].

When considering the following text: “Emmanuel MACRON is the eighth President of the Fifth French Republic. Founder of the “En Marche!” movement, created on April 6, 2016, he led it until his first victory in the presidential election on May 7, 2017.”

There are three types of information:

  • the named entities: Emmanuel MACRON, April 6, 2016, May 7, 2017, En Marche, eighth
  • the mentions: President of the French Fifth Republic, Founder
  • Other identifying morphemes: first victory, the presidential election

The following table summarizes the expected result when applying these two techniques

A third category of approach for processing sensitive data is emerging with the advances of neural algorithms on natural language exploitation: advanced pseudonymization. The latter is capable of processing the vast majority of sensitive “identifying” information in a text. However, there are still cases at the margin that can be detected if the context of the subject is known. This is the example of the following text “LinkedIn is a social network. In France, in 2022, LinkedIn has more than 25 million members and 12 million estimated monthly active members, making it the 6th largest social network” where the term 6th largest social network, difficult to detect, can identify LinkedIn when doing some research on the Internet.

What is “sensitive data”?

Sensitive data is information that can identify a natural or legal person. This is the case of the following information when associated with a physical person: full name (surname and first name), location, organization, date of birth, addresses (email, housing), identifying numbers (credit card, social security, telephone) …. or information related to a legal person such as the name of the company, its address, its SIREN and SIRET identifiers, ….

How to pseudonymize data?

The CNIL [2] describes two types of pseudonymization techniques: those that rely on the creation of relatively basic pseudonyms (counter, random number generator) and those that rely on cryptographic techniques (secret key encryption, hash function). All of these methods explain how sensitive data should be handled in the context of pseudonymization. They do not explain how to identify it. The identification process can be simple when the data is tabular. In this case, it is sufficient to delete or encrypt the contents of the relevant columns.

At Novelis, we are working on advanced pseudonymization of sensitive data contained in free text. Identification in this context is complex and is often performed manually by humans, which imposes a cost in time and skilled human resources. Artificial intelligence (AI) and automatic language processing (NLP) techniques are however sufficiently robust to automate this task. We will thus generally distinguish two types of approaches for sensitive data extraction: neural approaches and rule-based approaches. Although they provide excellent results, especially with the emergence of Transformers (deep learning model), neural approaches require large datasets to be relevant, which is not always the case in the industrial world.  They also require an annotation task by experts in order to provide the models with a quality dataset for training. As for rule-based models, they suffer from generalization problems. A rule-based model will indeed tend to have a good accuracy on the sample used as a training base but will be more difficult to apply to a new dataset not studied in the initial assumptions

The approach proposed by the Novelis R&D team

We propose a hybrid approach exploiting the strengths of NLP techniques and neural models. First, we built a corpus containing addresses, to train a neural model able to detect an address in a text. A benchmarking of the models was performed in order to choose the adequate model. The model is then improved using a fine-tuning strategy. Combined with NLP python libraries, the model provides a robust solution for extracting addresses and named entities such as people’s names, places and organizations. Patterns (regular expressions) were designed, by Novelis experts, for the extraction of other identified sensitive data. Finally, heuristics were used to disambiguate and correct the extracted information.

With this approach, we have built a reliable and robust system to process sensitive information contained in any type of document (pdf, word, email, …). The goal is to remove low value-added tasks from the data processors by automated assistance.

References:

  • [1] : https://www.enisa.europa.eu/news/enisa-news/enisa-proposes-best-practices-and-techniques-for-pseudonymisation
  • [2] : https://www.cnil.fr/fr/recherche-scientifique-hors-sante/enjeux-avantages-anonymisation-pseudonymisation

TECH500: Novelis is positioned among the tech companies to join in 2022!

TECH500 2022, the ranking of the 500 tech companies that recruit the most in France, positions Novelis in the TOP 200 tech companies to join in 2022!

Data Recruitment, the recruitment agency specialized in tech, has just published the 1st edition of its ranking of the 500 companies that recruit the most and which you should join in 2022.

With 51% growth in headcount, Novelis is ranked 191st on the list, once again demonstrating its strong growth potential, its desire to recruit and its key position in the FrenchTech ecosystem.

Indeed, 4 months earlier, in December 2021, Novelis was already ranked 47th out of 500 in the prestigious FW500 ranking of the top 500 growing companies in French Tech.

This position in the TECH500 ranking of Data Recruitment allows us to really assert ourselves alongside the fast-growing French tech companies. We are proud to count among our staff talents from all over the world. At Novelis, more than 10 nationalities are represented! In 1 year we have recruited nearly 40 employees and it’s not going to stop there… Join us!

Linda Mefidene, Human Resources Manager at Novelis.
TECH 500

TECH500 : the ranking that reveals new players for candidates to join  

Based on the sole criteria of headcount growth from March 2021 to March 2022, the first edition of this ranking honors companies of all sizes in the Tech sector: start-ups, scale-ups, SMEs and ETIs.

25,000 companies solicited – 2,533 companies studied – 500 companies highlighted.

This ranking also represents 23,849 jobs created over 12 months by these 500 companies.

In the top 5 representative keywords of the top 500 companies, we find:

  • Saas (13%)
  • Health (5,13%)
  • Fintech (4,82%)
  • Marketing (4,04%)
  • Big Data (3,91%)

[Webinar] Cybersecurity and RGPD: How to prevent data breaches

76% of organizations suffered data breaches in 2023

Attacks and data breaches are multiplying at an alarming rate. This statistic highlights the scale of the problem, and underlines the urgent need to strengthen data security. The average financial cost of a cyber attack to an SME is estimated at between €300,000 and €500,000 (CISCO study). For small and medium-sized businesses, this can represent a huge financial impact that is difficult to overcome.

With the rapid development of technology and the proliferation of digital data, it is essential to guarantee the confidentiality, integrity and availability of this data.

We therefore decided to organize a webinar with Dipeeo to explore the challenges of data security, tackling both the technical and legal aspects for a complete and optimized vision.

Our expert Oussama Hamdi, Senior Consultant in Cybersecurity, will address the technical side: data security relies on several measures and technologies to protect sensitive information against potential threats.

On the legal side, it’s former RGPD lawyer & certified DPO – CEO & Co-founder Dipeeo, Raphaël Buchard will give us the keys to understanding data privacy laws and regulations.

On the agenda for this webinar:

  • How to prevent: the technical and organizational security measures to put in place
  • How to manage responsibility: from legal texts to the risks involved
  • How to act: What to do in the event of a problem, and how? Illustration with a case study

The replay is only available in French.

[Webinar] Facilitate and accelerate GDPR compliance with data anonymization

Technological advancements (connected objects, development of 5G) have made the exchange of massive data in our society more fluid. Today, this data represents a real wealth in terms of the quantities of information that could be used to analyze political climates, predict crises, and improve services, products, and processes, for example. This phenomenon of massification and circulation of data thus raises the question of the risk of privacy violations due to the exposure of personal data.

According to the latest report by the American giant McAfee, a survey of cloud service users shows that : 

  • 91% of respondents do not encrypt inactive data,  
  • 87% do not delete data immediately after closing an account. 

While GDPR currently requires companies to do everything possible to secure their personal data without risking heavy fines, anonymization is not a general obligation. But this technique, coupled with AI and automation, is increasingly being seen as the most effective means of compliance.

Anonymization will allow companies to continue processing personal data while respecting the rights and freedoms of individuals, thus significantly reducing their exposure to potential attacks. This also strengthens system security and reduces the risk of data theft, as once anonymized, the data has no value.

Where previously GDPR could be a constraint around data, it now becomes an opportunity to better protect oneself.

Former RGPD lawyer & certified DPO – CEO & Co-founder of Dipeeo, Raphaël Buchard, will give us the keys to stay RGPD compliant. 

Our technical and business experts at Novelis – Sanoussi Alassan, Data Scientist and Raphaël Brunel, Data Analyst – will talk about the technical solution we propose: data anonymization coupled with AI for structured data processing and automation for unstructured data processing.  

On the agenda of this webinar:  

  • RGPD & Compliance  
  • Presentation of use cases 
  • Knowing the different anonymization methods and equipping yourself with a professional solution  
  • Demonstration