Why Is A Canonical Data Model An Anti-Pattern

A canonical data model is defined in the Enterprise Integration Patterns as the solution to minimize dependencies when integrating applications that use different data formats. In other words, a component (an application or a service) should communicate with another component through a data format that would be independent of both component data formats.

Two things to highlight here. First I defined a component as either an application or a service because such models are/were used in the context of application integration and service orientation. Secondly, we are talking here about a concept which should be used solely as a transport data format. You should not use a canonical data format as the internal structure of your data store.

Theoretically the advantages are pretty obvious. A canonical data format reduces the coupling between applications, reduces the number of translations to be developed for integrating a set of components etc. Pretty interesting right? One single data model understandable by the whole IT landscape and a set of people (developers, system analysts, business stakeholders etc.) who can share the same vision of a given concept.

For practical purposes, implementing such models is rarely efficient though.

A person is not the same concept for a marketing and a support department in an insurance company. A flight for an air traffic management system has a different meaning depending on if it was filled by a pilot or if it is an ongoing flight on top of your airspace. A PLM part has a completely different representation depending if it has just been designed or if it has to be maintained by a support team.

In most of the contexts, designing a canonical data model results in a large data model with a full set of optional attributes and very few mandatory ones (like the identifiers). Even though it was primarily designed to ease component integration, you will just complicate it. In the meantime your model will create a lot of frustration among the users because of its inherent complexity (in terms of utilization and management). Furthermore, regarding the coupling issue, you are just shifting it somewhere else. Instead of being coupled to one component data format, you become tightly coupled to one common data format that will be used by the whole IT landscape and subject to very frequent changes.

In a nutshell, in most of the contexts, a canonical data model should be considered as an anti-pattern. What is the other option than considering you should still want to minimize the dependencies between two components exchanging data?

Domain Driven Design (DDD) recommends to introduce the concept of bounded context. A bounded context is simply an explicit context in which a model applies with clear boundaries with other bounded contexts. Depending on your organization a bounded context could refer to a functional domain in which an object is utilized, it could refer to the object state itself etc.

In that case the canonical data model as such would simply be the yelow part in the following diagram:

The intersection of A, B and C represents the set of attributes that must be there regardless of the context (basically the mandatory attributes of the previous large CDM). This part should still be carefully designed due to its central role. Yet it is important to remain pragmatic. If in your context is does not make sense to have such common model, you should simply discard it.

What is also important is an intersection between two contexts (A and B, B and C, C and A). How to map one object representation from one context to another? What are the explicit attributes that should be shared across two contexts? What are the common business constraints between two contexts? These questions should still be answered by a transverse team but from a business perspective, it makes sense to raise them. That was not always the case with one single CDM shared across potentially opposite contexts.

Nonetheless, regarding the parts which are not shared with other contexts (in white), it should not be part of an enterprise data model. You should be pragmatic. For example, if a subset is specific to a given domain, it should be up to the domain experts to model it themselves.

One key challenge, though, is to identify those bounded contexts and it might be worth reminding here the Conway’s law:

“Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization’s communication structure.”

The bounded contexts shall not be necessarily mapped onto the current organization. For example, a bounded context can encompass several departments. Breaking organizational silos should still be an objective for most companies.

In addition DDD introduces the concept of Anti-Corruption Layer (ACL). This pattern can refer to a solution for a legacy migration (by introducing an intermediation layer between the old and the new system to prevent data quality issues etc.). But in our context when we talk about corruption it is related to data modeling debt you can introduce to solve short-term problems.

Let us take the example of two systems in charge to manage a given state of a PLM part (a part is a physical item produced or purchased and then assembled like a helicopter rotor for instance). One legacy system (let’s call it SystemA) is in charge to manage the design phase and you must implement a new system (let’s call it SystemZ) in charge to manage the maintenance phase. The whole IT application landscape shares the same common part identifier (partId) except for SystemA which is not aware of it. Instead of the partId, SystemA manages its own identifier, systemAId. Because SystemZ needs to be call SystemA using systemAId, a heuristic could be to integrate systemAId as part of the SystemZ data model.

This is a common mistake you should avoid. You simply corrupted your data model because of a short-term situation.

The ACL pattern could have been a solution here. SystemZ could have implemented its own data format (without any external corruption like the systemAId). Then it would have been up to an intermediation layer to manage the translation between the partId and the systemAId.

Applied to our topic, the ACL pattern enforces to implement a layer in between two different bounded contexts. A component is not aware of how to call another component which is not part of its own bounded context. Instead, a component is only aware of how to map its own data structure on the data format of the bounded context it belongs to.

By the way, this is a rule of thumb. A component shall belongs to only one bounded context. This is also why DDD is a great fit for microservices architecture. Because of the fine-grained granularity, it is easier to comply with this rule.

To summarize, a canonical model as such should be considered as an anti-pattern in most cases. You should try to implement the bounded context concept meaning one model per context with explicit boundaries between the contexts. Two components that are part of different bounded contexts should communicate through an Anti-Corruption Layer to prevent data modeling corruption.

Further reading