Back to articles

Deciphering the datalake: databases, datawarehouses...

28 September 2022

First introduced in the 2000s, the datalake still raises many questions. That's why we've put them together below to give you the keys to understanding its uses, the difference with a DataWarehouse, on Premise or in the Cloud?

What can you do with a datalake? 

The datalake is the place where all an organisation's data can be stored. It is subject to the regulations applicable to data, in particular the RGPD and the CNIL.  

It serves as a data source or reservoir. Data can be stored there for later use. Before it is set up, it is important to decide how it will be used, because the datalake may or may not be relational. This raises the question of whether to use an SQL or noSQL database. 

Which database should I use: SQL or NoSQL?

NoSQL databases generally have no predefined structure or purpose, unlike SQL databases, which are generally relational databases that can be queried by users. The best-known SQL databases are MySQL or PostgreSQL, but there are others... These databases store data using a pre-defined schema. They also have the advantage of being vertically and horizontally scalable to adapt to the volume of data.  

NoSQL databases, which are generally non-relational, associate data with attributes (or fields) placed on demand in real time, which can then be used by queries. These include databases such as Mongo Database, Apache Cassandra, Redis, Apache, Neo4j and Amazon Dynamo Database. As the choice is not always easy, it may be advisable to seek help in selecting a database solution. 

What data is involved in datalake? 

You can't talk about a datalake without talking about structured and unstructured data. The datalake can store all data, whether structured or unstructured, unlike a datawarehouse, which requires structured data.  

Structured data

Structured data is qualified, high-quality data that is predefined and formatted, meaning that we know in advance what is inside. For example, it could be a PDF file corresponding to a defined structure and containing surname, first name and address fields. This data is stored in its original format and is not processed. It is easily searchable. The set of schemas used to find this data is very often predefined in advance in the data warehouse.  

Unstructured data 

This is raw data, in its original format, which is dumped into the datalake and is not labelled. It may be e-mails, posts on social networks or images, for example. Processing this data requires the intervention of experts to prepare it for use according to business needs. In this case, the various business units need to define the key elements to be analysed beforehand.  

What is the difference between a datalake and a datawarehouse?  

When it comes to implementing data management tools within a company, we recommend the use of a datalake or datawarehouse.  In all cases, it is above all a strategic choice for organisations. Each solution has its advantages and disadvantages. They each serve different needs and uses.  

The datalake is a warehouse for structured and unstructured data, whereas the datawarehouse can only receive structured data. The datalake ingests data quickly and distributes it on the fly. It is agile and capable of managing structured and unstructured data, but the data is not necessarily of high quality. It is a data foundation, which enables data to be pre-processed. The data is stored and prepared, for the use of tags. It can also be used to cross-reference data from different sources to improve data quality. 

In a data warehouse, the data is organised by business line and is of good quality. Quality will have been ensured upstream in the datalake. The datawarehouse contains data that has been carefully prepared in advance and is therefore less agile. Reprocessing the information takes time, but the data is of higher quality and more reliable than that in the datalake.  

One nuance to what we said above is that it is possible to create a draft datawarehouse in the datalake. However, this has an impact on uptime and increases the effort required to classify the data. However, this datawarehouse will only be a first level of repository: a bit of cleaning and preparation of the data according to the repositories before transferring it to the datawarehouse or repository (MDM). 

Where to build a datalake: on Premise or in the Cloud? 

Here again, it's mainly a matter of organisations making a choice based on their needs, but also on the skills available in-house.  

On Premise 

For this option, the most important thing will be to know whether companies have the skills to set up, maintain and enhance the infrastructure. If this is not the case, and particularly if there is a lack of in-house skills to maintain the infrastructure, this choice could prove complicated. The risks include loss of data and availability, technical debt, and the impossibility of developing new data-related services. 

Cloud

If the SaaS option is chosen, maintenance of the infrastructure will be included and the company will only need to load the data, process it and query it. Although sometimes more expensive, this option allows you to concentrate on the value-added part. For Iaas and, to a lesser extent, PaaS, the issues are similar to those for On Premise. 

Depending on the size of the company, it may be worth setting up its own on-premise infrastructure and investing in upgrading the skills of the staff who will maintain this infrastructure.  

The above questions need to be addressed before launching a data project. Don't hesitate to enlist the support of experts who can help you define your data strategy and implement it. 

Our experts answer your questions

Do you have any questions about an article? Do you need help solving your IT issues?

Other articles in the category Data & AI

Connected objects, new sources of usable intelligence

Connected objects are proliferating in all sectors. The challenge now is to make effective use of the data they generate. Thanks to an integrated approach combining connectivity, the cloud, artificial intelligence and security, DEEP can help organisations to set up ecosystems that make the most of data, transforming IoT and M2M into powerful levers of efficiency.

Read this article

Published on

21 July 2025

Federated Governance: A Key Pillar for Successful Data Mesh Implementation

Learn why federated governance is a critical organizational pillar in a Data Mesh architecture. A strategic issue for data-driven companies.

Read this article

Published on

12 December 2023

Do you have any other questions? 

Call us free of charge on 8002 4000 or +352 2424 4000 for international calls, from Monday to Friday, 8 am to 6 pm.

About DEEP

Discover DEEP, your unique partner for your digital transformation.