Data (lake, warehouse, mart) principles and guidelines

These can avoid getting things out of control.

DATA ENGINEERING

9/9/20243 min read

We had looked at what data lake, warehouse and mart are in the previous article. In this article we will look into the the guideline, principles and best practices to build and maintain these data stores. Even though there are lots differences between these, they do have some common principles/best practices to follow.

Datalake:

Immutable storage for all the data received from different sources. It also allows you to rebuild the state of your business at a point in time, which can be useful for testing predictive models
Apply data masking to make sure the PII and other sensitive data is protected before it reaches the storage layer.
Decoupling storage and compute resources is one of the key design goals of a cloud-based data lake, and enables you to optimise costs, innovate quickly, and experiment with different processing technology and vendors.
To make sure there is a Governance and Security in place by following below steps:

Appoint a person/team to be accountable for the data lake.
There is a centralised a data catalog and metadata.
Appropriate user and responsible matrix defined.
By performing regular audit and supervision of the storage and computation.

Applying cost optimisation techniques to store the data based on the usage, as they can be archived to the low cost storage layers for rarely accessed data.
Apply the appropriate retention policy based on legal/cost optimisation obligation.

Datawarehouse:

Datawarehouse generally consists of multiple layers such as Raw/Staging (This layer contains the data received from the sources systems in its raw format), Enterprise/Source (This layer contains the data transformed from the raw layer to a consistent single format data such as parquet, avro etc.) and Integrated/Prepared layer (This layer contains the lowest granular data which is integrated across different source systems).

Below are some of the principles and guidelines related to each of these layers either specifically or in general.

Any data which lands in the raw/staging layer should be from an authorised/golden source.
There should be an agreed SLA for the data and the controls around it to make sure it is not missed.
Accountable source support team to be contacted for any contract breach such as format, SLA etc.
The data landed should adhere to the compliance and classification such GDPR, PCI-DSS, Regulatory rules applied etc.
Data received in raw/stage layer as well in the transformed Enterprise layer should be a true and complete representation of the source system without any other transformation.
There should not be duplicate data from the same source system.
An agreed and consistent format should be applied to the Enterprise layer such as parquet, avro etc.
The integrated layer should always be prepared from the Enterprise layer data to avoid any inconsistency or data quality issues.
Data in integrated layer should be as lowest granular as possible.
There should be a centralised catalog maintained along with the business and technical lineage in place across the layers.
All the tables or data files should have the datawarehouse mandatory columns such as loaddate, isactive, createdon or updatedon etc created.
There is should be an appropriate partition logic applied across the layers for efficient data fetching/processing.
Access mechanism should be followed based on the individual users, system users, user-groups and functional groups to avoid the uncontrolled access/abuse of the data.
Consistent physical structure to be followed across the layers, this will avoid duplicate/inconsistent data loading, data exploration can be efficient etc.
Apply appropriate SCD logic to the data to make sure the historic and current states are maintained.
Dedicated Data governance team to be nominated to control/audit the data prior to it is loaded and after it is loaded.
Apply the appropriate retention policy based on legal/cost optimisation obligation.

Datamart:

Datamarts are often created using the data either from the integrated/enterprise layer data, directly from the sources systems or from the combination of datawarehouse and source systems. So below guidelines or principles can be applied while building datamarts:

Need to make sure the data mart is not dependent on other datamarts as there are short lived data sources which can be deleted/archived if there is no further business value from it.
Appropriate naming and physical structure should be followed.
String access mechanism should be followed to make sure it is not over-utilised/abused which may impact the actual business use case.
Standard access layer must be defined such as approved SQL clients, dashboards, report extraction processes etc.
Appropriate data modelling design such as Star or Snowflake schema should be followed.
Apply appropriate SCD logic to the data to make sure the historic and current states are maintained.
Apply the appropriate retention policy based on legal/cost optimisation obligation.

Hopefully I have covered most of the points I could think, please let me know if there are anything I have missed or can be improved further. In the next article, I will cover the complexities/challenges faced in Big-data engineering.

References:

https://www.getdbt.com/blog/five-principles-that-will-keep-your-data-warehouse-organized/

https://www.width.ai/post/data-lake-implementation

https://pages.matillion.com/rs/992-UIW-731/images/2019C2 — Data Lakes eBook.pdf.

Data (lake, warehouse, mart) principles and guidelines

Datalake:

Datawarehouse:

Datamart:

Gururaj Kulkarni