Deciphering the datalake: databases, datawarehouses... - DEEP

Back to articles

Deciphering the datalake: databases, datawarehouses...

28 September 2022

First introduced in the 2000s, the datalake still raises many questions. That's why we've put them together below to give you the keys to understanding its uses, the difference with a DataWarehouse, on Premise or in the Cloud?

What can you do with a datalake?

The datalake is the place where all an organisation's data can be stored. It is subject to the regulations applicable to data, in particular the RGPD and the CNIL.

It serves as a data source or reservoir. Data can be stored there for later use. Before it is set up, it is important to decide how it will be used, because the datalake may or may not be relational. This raises the question of whether to use an SQL or noSQL database.

Which database should I use: SQL or NoSQL?

NoSQL databases generally have no predefined structure or purpose, unlike SQL databases, which are generally relational databases that can be queried by users. The best-known SQL databases are MySQL or PostgreSQL, but there are others... These databases store data using a pre-defined schema. They also have the advantage of being vertically and horizontally scalable to adapt to the volume of data.

NoSQL databases, which are generally non-relational, associate data with attributes (or fields) placed on demand in real time, which can then be used by queries. These include databases such as Mongo Database, Apache Cassandra, Redis, Apache, Neo4j and Amazon Dynamo Database. As the choice is not always easy, it may be advisable to seek help in selecting a database solution.

What data is involved in datalake?

You can't talk about a datalake without talking about structured and unstructured data. The datalake can store all data, whether structured or unstructured, unlike a datawarehouse, which requires structured data.

Structured data

Structured data is qualified, high-quality data that is predefined and formatted, meaning that we know in advance what is inside. For example, it could be a PDF file corresponding to a defined structure and containing surname, first name and address fields. This data is stored in its original format and is not processed. It is easily searchable. The set of schemas used to find this data is very often predefined in advance in the data warehouse.

Unstructured data

This is raw data, in its original format, which is dumped into the datalake and is not labelled. It may be e-mails, posts on social networks or images, for example. Processing this data requires the intervention of experts to prepare it for use according to business needs. In this case, the various business units need to define the key elements to be analysed beforehand.

What is the difference between a datalake and a datawarehouse?

When it comes to implementing data management tools within a company, we recommend the use of a datalake or datawarehouse. In all cases, it is above all a strategic choice for organisations. Each solution has its advantages and disadvantages. They each serve different needs and uses.

The datalake is a warehouse for structured and unstructured data, whereas the datawarehouse can only receive structured data. The datalake ingests data quickly and distributes it on the fly. It is agile and capable of managing structured and unstructured data, but the data is not necessarily of high quality. It is a data foundation, which enables data to be pre-processed. The data is stored and prepared, for the use of tags. It can also be used to cross-reference data from different sources to improve data quality.

In a data warehouse, the data is organised by business line and is of good quality. Quality will have been ensured upstream in the datalake. The datawarehouse contains data that has been carefully prepared in advance and is therefore less agile. Reprocessing the information takes time, but the data is of higher quality and more reliable than that in the datalake.

One nuance to what we said above is that it is possible to create a draft datawarehouse in the datalake. However, this has an impact on uptime and increases the effort required to classify the data. However, this datawarehouse will only be a first level of repository: a bit of cleaning and preparation of the data according to the repositories before transferring it to the datawarehouse or repository (MDM).

Where to build a datalake: on Premise or in the Cloud?

Here again, it's mainly a matter of organisations making a choice based on their needs, but also on the skills available in-house.

On Premise

For this option, the most important thing will be to know whether companies have the skills to set up, maintain and enhance the infrastructure. If this is not the case, and particularly if there is a lack of in-house skills to maintain the infrastructure, this choice could prove complicated. The risks include loss of data and availability, technical debt, and the impossibility of developing new data-related services.

Cloud

If the SaaS option is chosen, maintenance of the infrastructure will be included and the company will only need to load the data, process it and query it. Although sometimes more expensive, this option allows you to concentrate on the value-added part. For Iaas and, to a lesser extent, PaaS, the issues are similar to those for On Premise.

Depending on the size of the company, it may be worth setting up its own on-premise infrastructure and investing in upgrading the skills of the staff who will maintain this infrastructure.

The above questions need to be addressed before launching a data project. Don't hesitate to enlist the support of experts who can help you define your data strategy and implement it.

Contact us

Do you have any questions about an article? Do you need help solving your IT issues?

Contact an expert

Our experts answer your questions

Do you have any questions about an article? Do you need help solving your IT issues?

Contact an expert

IoT and M2M in Luxembourg: Unlocking Data Value

Learn how to leverage IoT and M2M data through an integrated approach combining connectivity, cloud, AI and cybersecurity to drive operational efficiency.

Read this article

Published on

21 July 2025

Data & AI Gouvernance

Key Challenges in Successful Digital Transformation for Public Sector

Explore the key challenges and success factors of digital transformation for local authorities: budgets, skills, cybersecurity, inclusion, data sovereignty and sustainability.

Read this article

Published on

13 May 2025

Data & AI Gouvernance

Federated Governance: A Key Pillar for Successful Data Mesh Implementation

Learn why federated governance is a critical organizational pillar in a Data Mesh architecture. A strategic issue for data-driven companies.

Read this article

Published on

12 December 2023

Got a project? Questions?

Send us a message and our experts will get back to you quickly.

Contact the DEEP experts

DEEP? Your digital ally!

With DEEP, turn your IT projects into measurable and sustainable growth drivers.

About DEEP