Domain Oriented Distributed Data Architecture — Data Mesh

Chanukya Pekala
4 min readNov 13, 2021
Pic Credits: trginternational blog

In this article, I am trying to brief the latest industry trend in data management ie., data mesh.

Still today in big enterprises, there has been a lot of demand for creating a “centralized” data lake where various datasets across domains are expected to be available in a single location and data is expected to be made available with all the transformation done for data consumption or for some other use cases in the downstream applications.

In principle, we will be having a following setup in most of the data analytics units:
1) data ingest step which will extract “structured/unstructured” raw data from different source systems
2) data transformation step to perform cleansing, transformations and aggregations on top of the raw data
3) data consumption step which will help business users or analysts to perform their queries for reports/dashboards, to implement ML models to derive patterns and in some cases to share the data over APIs

This setup is something most of the organizations have been used to. Large organizations will have several sources corresponding to a single domain (for eg., mortgages, markets — there will be n number of sources feeding multi dimensional information of respective domains to data lake) and similarly several other domains will be sharing the same space in the centralized data lake. This centralized lake will have data engineers who will be held up responsibility over transformations, enrichments across “all” datasets irrespective of domains, which is unlikely not the best approach. This process has been the case Enterprise Data Warehouses (EDWs) in the past and now towards Enterprise Data Lake (EDL).

We have few problems in continuing this setup with the huge volumes of datasets are pouring in at every step of event/transaction/search.

If we see through some specific ones:
1) Rapid increase in the number of sources and users — we are missing the ability to understand the specific domain data, because ingest/staging layers have been composed of multiple sources with cross domain datasets and ultimately limits to get value out of the ingested sources corresponding to raise in the number of sources to the specific domain. Similarly data consumers/analysts/users, who will be scanning out the specific information might have difficulties in expecting the desired insights in that specific domain.
2) data-as-a-product — Often we see ETL pipelines serve most of the use cases across different domains but it becomes complicated when there are specific questions raised against a particular domain’s transformation logic/process of owning the entire slice of a business area or domain for example. This product team will handle end 2 end solution of creating pipelines, maintaining quality and most importantly usage/consumption. Engineers will be equipped also with the domain level knowledge in addition to owning and maintaining the complex spark transformation code.
3) Data Availability, Self Serve — If the slice of business area is handled end-to-end by a single team, end users will be able to get through the relevant features’ insights without relying on the centralized data platform engineers to enable or create a virtualized or consumption layer for BI requirements. End users will then able to create relevant insights through Power BI dashboards on their own as they understand the data and domain, if needed, they know whom to reach out to solve ETL problems.

A way to achieve control over these domains, there is a need for enterprise to shift from centralization to decentralization of data lakes, to create fragmented data products isolated by domain would leverage the capabilities of domain oriented architecture.

Pic Credits: Image courtesy of Monte Carlo

This is called data mesh, developed at ThoughtWorks by Zhamak Dehghani.

High Level Overview of Data Mesh and its Advantages
1) Each mesh can be understood as a node which serves ETL processes of its own domain reading data from different sources.
2) Each data product team should be able to manage their own platform with data product owners, data engineers and the success of this team can be measured on the data discoverability of the end users to put the data into use in that specific domain.
3) Instead of huge monolith setup as one-stop-shop for every dataset across the domains, we can still have independently managed nodes, each to represent a domain. For eg., Mortgages will have their own data domain setup, which will ensure ingestion, transformation and more importantly consumption of this data. Similarly, there can be multiple domains like markets, cards etc.,
4) As mentioned in diagram above, Domain 1 can represent “mortgages”, domain 2 can represent “markets” and so on to several other domains.

Fine! There are still some things like
1) attribute correlation across domains, for eg., address field can be possibly represented separately in different domains and it is a concern, when there is no standardization in place, so author recommends to ensure that these entities are categorized as federated entities with unique and global identity.
2) when we wanted per se, create base which needed to combine multiple domains by applying filters, aggregations — commonly used entities or keys needs a standardization.
Such standardizations should be part of data governance implementations.

So, having this data mesh setup will solve “all” the enterprise wide data management problems? Yes and No. We need to understand what fits better to our organization, but having this fragmented domain oriented data architecture thinking will definitely help to implement next gen data products!

There have been lot of references already — which I tried to grasp from below:

--

--