The Modern Data Engineer Part 3: Modeling for Analytics, Not Transactions
Data professionals often default to normalized semantic models, and it tends to come back to bite them. Here is why the pattern keeps showing up, and what actually works better for analytics.
The Modern Data Engineer Part 3: Modeling for Analytics, Not Transactions
In Part 1, I argued that understanding data is the core skill. In Part 2, I explored how treating data as cargo instead of state changes how you build pipelines.
Part 3 is about a pattern I keep running into: teams default to normalized semantic models.
Normalization solves write-correctness problems. Analytics is mostly a read problem.
That mismatch is where agility dies. Not because normalization is wrong, but because it solves a different problem than the one analytics platforms actually have.
1. The Pattern: “Let’s Normalize Everything”
It tends to start the same way on most platform projects. Someone (often the most experienced person in the room) says: “We need a proper data model. Something sustainable. Something that will stand the test of time.”
Then comes a Data Vault. Hubs for every business entity. Links for every relationship. Satellites tracking every attribute change over time. Perfectly normalized. Completely traceable. Theoretically flexible.
Or third normal form. Every entity gets its own table. No redundancy. No duplication. Clean, academically correct.
The justification is usually the same: “This is proper data modeling. This is how you build something that lasts.”
The problem is structural: databases are optimized for writes and transactional correctness. Platforms are optimized for analytical reads, and those are almost opposite design goals.
Databases need to prevent update anomalies. Platforms need to answer “what was revenue last quarter?” quickly, without rebuilding meaning through chains of joins.
When a platform gets designed like a database, it ends up optimized for the wrong thing.
2. Why This Keeps Happening
The pattern comes from somewhere real. Most data professionals were trained on relational database theory. They learned normalization as a best practice, and internalized the idea that duplication is bad, dependencies are dangerous, and proper design means clean, normalized structures.
That training is right for OLTP systems. It prevents update anomalies. It ensures consistency. It makes writes safe and predictable.
It also does not go away. A lot of the engineers building data platforms today came up writing application code against relational databases. They still do, for parts of their job. The reflex that redundancy is a code smell is deeply embedded, because in that world it genuinely is.
But there is a deeper reason this pattern persists: business logic disagreements.
In a normalized model, semantic ambiguity hides in the joins. If nobody agrees on whether “active user” includes trial signups, each analyst makes their own assumption at query time. Different reports return different numbers, and the inconsistency stays hidden until it becomes a business problem.
Denormalization removes that hiding place. Teams have to define metrics up front, materialize those definitions, and align before the metric spreads across dashboards.
That discomfort is exactly why teams avoid it. Normalized complexity lets organizations postpone hard conversations about what metrics actually mean.
There is also a vendor angle worth naming. Data Vault has built a serious ecosystem around itself: certifications, consulting practices, books, conferences, thought leaders. It gets marketed as the answer to every platform problem: flexibility, auditability, traceability, all in one model. It does deliver on those promises. But it also delivers a lot of complexity, the kind that makes platforms slow, rigid, and harder to use over time.
3. The Hidden Cost: Complexity as Technical Debt
Structural complexity is when the architecture forces people to understand model internals before they can answer a business question.
Normalized models create that debt by default. Every business question requires assembling multiple tables. To see customer orders, join customers, orders, order lines, and products. To track address history, navigate satellite tables with additional joins.
Analysts end up needing to understand the full model before they can write a working query: which hub holds identity, which link connects entities, which satellite holds history. That is a lot of upfront investment just to answer a simple question.
Adding a new metric becomes a multi-step project. More tables to touch, more joins to keep consistent, more failure points to test.
Platform teams end up becoming gatekeepers. Analysts submit requests. The platform team translates them into SQL, because navigating the normalized structure requires specialist knowledge. The platform team becomes a bottleneck, and agility quietly dies.
This is technical debt, just not the kind that is easy to recognize. The code looks clean. The model is “correct.” But the system is slow, hard to change, and inaccessible to the people who need it most.
4. The Flexibility Assumption
The common pitch for normalized models is flexibility. “Design it right once, and you can answer any question.”
But flexibility in analytics has two meanings. One is storage flexibility: can we store new attributes safely? The other is consumption flexibility: can analysts answer new questions quickly and consistently?
Normalized models deliver the storage kind. Analytics needs the consumption kind.
Adding a new column to a satellite table is easy. Answering a new business question without navigating four extra joins is the hard part.
That is why the flexibility claim breaks down in practice. The model is structurally flexible for writes, but operationally rigid for reads.
5. The Medallion Problem
The Medallion architecture (Bronze, Silver, Gold) has become the default framing for many modern platforms, and it inherits some of the same issues. Bronze makes sense as an immutable raw layer. Gold makes sense as a denormalized consumption layer. Silver is where the normalized-thinking trap resurfaces: it tries to apply business rules while staying source-aligned, and those two goals pull in opposite directions.
The fundamental question Silver must answer is: should it be cleaned-but-neutral, or should it encode semantic opinions?
If Silver is just cleaned raw data (deduplicated, typed, quality-checked), then business logic still lives in Gold or in application code. That delays decisions but preserves flexibility.
If Silver tries to be “business-ready” while staying source-aligned, it ends up as a kind of normalized limbo: structured enough to be confusing, opinionated enough to be rigid. That is where the normalized-mindset trap really bites.
The gap between Silver and Gold determines whether the platform stays fast and maintainable or slowly turns into the bottleneck this whole post is about. That gap deserves its own treatment, so I cover it in Part 4A and Part 4B.
6. What Actually Works: Model for Consumption
What works is a layered model with clear intent:
- Bronze and Silver: source-aligned, cleaned, and trustworthy.
- Semantic integration layer (denormalized): core business entities and definitions materialized once.
- Consumption marts: domain specific outputs for BI tools, dashboards, and team workflows.
This semantic integration layer is the missing middle that normalized designs fail to provide: a single, aligned, materialized interpretation of business reality for read-heavy analytics.
This matters because normalized models often hide business rules inside join choices. Two analysts can answer the same question, use different join paths, and both be technically valid while producing different numbers.
A denormalized semantic layer removes that ambiguity earlier. The data team must define shared business rules up front, such as what counts as an active customer, which customer record is authoritative, and how time boundaries are applied. Those decisions become reusable platform logic instead of one-off analyst interpretation.
This is not about making everything a final mart. It is about creating a stable semantic layer between source-shaped data and purpose-built marts, so downstream work is faster and consistent.
7. Where Normalization Does Belong
Normalization is not the wrong idea. It is the right idea applied to the wrong context.
Normalization is ideal for systems where writes are unpredictable and correctness overrides performance. OLTP databases. Systems that handle transactions. Anything where a write anomaly is a correctness failure.
Master data management is another fit. A single source of truth for customer identity or product hierarchy should be normalized and curated, with updates flowing outward.
Compliance-driven lineage is a real requirement in some industries. Data Vault delivers full traceability of every change. If regulatory requirements demand it, you have good reasons to normalize.
The key is keeping these concerns isolated. Build normalized models where they solve real problems, then build denormalized consumption layers on top for everything else. That separation works. Forcing the entire analytics platform to query through a normalized foundation does not.
Normalization is a tool for bounded problems, not a universal architectural principle.
8. The Practical Shift
A data platform is not a database. It is an analytics engine. The design should reflect that.
Modeling for consumption means materializing semantics: embedding business logic where it will be queried, denormalizing joins, and pre-aggregating common patterns. This can happen at multiple levels:
- Logical semantic layers (LookML, MetricFlow, DBT Semantic Layer) define metrics and relationships for downstream tools.
- Physical semantic layers (materialized views, aggregated marts) pre-compute those definitions so they are consistent and fast.
Modern platforms need both. The logical layer makes intent clear. The physical layer makes execution fast.
DBT manages dependencies and makes logic testable. Platforms like Databricks and BigQuery optimize incremental updates. The job of the platform engineer is to design models that analysts can actually query, not to demonstrate structural purity.
The most useful metric for a platform’s quality is a simple one: how long does it take someone to answer a business question without asking for help? If the answer is “a while,” the model is probably working against the people it is supposed to serve.
9. The Takeaway
Normalized models are elegant solutions to problems that analytics platforms mostly do not have. They solve write anomalies. They ensure consistency. They make a certain kind of architectural review very clean.
But platforms exist to answer questions, not to demonstrate correctness. And there is a precise way to recognize when a platform has failed: it becomes harder to query than the raw data.
If your model is harder to query than your raw data, the architecture has failed. The model should serve the people who use it.
Join the Discussion
Thought this was interesting? I'd love to hear your perspective on LinkedIn.