The Modern Data Engineer Part 10: When the Source Drifts

When a helpful change broke disaster recovery

At a previous client where I worked as a data and platform engineer, one of our most important sources was an asset-management system. The integration worked like this. Every night the source application exported a selected subset of its database into an operational data store as a full drop-and-replace, and that operational data store served two roles at once. It was the integration layer that other applications read from, and it was the source the data warehouse loaded from.

The catch was that the contents of that export were not fixed. Which columns it carried were configured inside the source application through a GUI, with checkboxes and drop-down menus. The application managers were responsive and helpful to the business, and when someone asked for different fields in the export, they happily reconfigured the selection. None of them was thinking about the data warehouse as a consumer, because nothing in their workflow surfaced that dependency. Protecting that dependency was no one’s explicit job either, so each change to that selection arrived ungoverned and unquestioned, landing in our pipeline as schema drift.

Most of those changes were cosmetic and cost us nothing. But every so often a change was big enough that we spent many hours rewriting the older Bronze data into the new schema so the history would still line up. I disliked that work for a reason past the wasted hours. Bronze was our archive. The extraction framework wrote each night’s export to storage as parquet and left it untouched, and that capture was what recovery rested on. Silver, the source-aligned Delta layer that fed the star schema, was rebuilt by replaying those Bronze captures back through the merge, historization and all. Rewriting the old captures into the new schema meant they no longer matched what the source had actually sent, so a rebuild from those captures no longer reproduced the history we thought we had.

Nobody ever sat down and decided to weaken disaster recovery. It degraded quietly, one schema rewrite at a time, as a side effect of chasing the source. The first rebuild that came out wrong, traced back to a rewrite from months earlier, was when the cost of the old approach stopped being abstract. We had been trading away the one guarantee we most needed, to keep up with a stream of GUI checkbox changes that nobody upstream considered significant.

This was the predictable behavior of cargo, even though in the moment it rarely gets diagnosed that way. Earlier parts of the series already explained why.

Drift is the nature of cargo

I argued back in Part 2 that source data is cargo your code does not control. Ingestion software moves it from somewhere else into your platform, but it remains the output of a system you do not own and whose evolution you do not get a vote on. When you couple your transforms to the exact shape of that payload, you are betting that an external team will never change something they have every right to change. They will. No amount of upstream hygiene eliminates this.

The drop-and-replace mechanism in the story sharpens the problem to a point. A nightly full drop-and-replace announces nothing. There is no event on a bus, no version bump in a header, no changelog entry, no deprecation notice. The schema is one thing on Monday night and a different thing on Tuesday morning, and the load runs against whatever it finds.

That silence is the real problem to solve. When the source gives you no change signal and you have placed no check at the boundary, the first thing that tells you the schema drifted is a transform that errors out, or worse, a transform that keeps running and produces wrong numbers that nobody flags until a report looks off weeks later.

Before getting to detection, it helps to separate the kinds of drift, because they fail in different ways and call for different defenses.

The three kinds of drift

Additive drift is a new column appearing in the export. This is the common case, and most of the time it is a non-event. When ingestion treats the payload as cargo and you opt in to only the columns the conformed entity actually depends on, a new source column lands in Bronze and is simply not mapped forward. It changes nothing downstream by design, and on a modern storage layer a newly added column is often absorbed automatically before you even decide whether to map it. You can add a presence-and-type check that flags the new column if you want the visibility, but it should rarely rise to the level of an incident.

Breaking drift is a rename, a removal, or a type change on a column you depend on. This is the painful case and the body of this post. A renamed required field, a dropped column, or an integer that becomes a string breaks the transforms that read it. Schema validation is built precisely for this. The limit of schema validation is that it tells you the structure broke and stops there. It says nothing about what to do with all the history you already captured under the old shape.

Semantic drift is the quiet one. The column keeps its name and keeps its type, but its meaning changes underneath you. A status enum gains a value the downstream logic has never seen. A monetary field switches currency. A unit changes from grams to kilograms. A nullable field starts arriving default-filled instead of empty. Every presence-and-type check passes, the load runs clean, and the conformed entity goes on bucketing reality into categories that no longer mean what they used to. Schema validation cannot see this at all, because at the structural level nothing changed.

Match the defense to the drift

The three kinds of drift do not share one fix. Each answers to a different part of the platform, and the rest of the post follows that split.

Additive drift is the easiest. A modern table format like Delta or Iceberg can often absorb a new column on its own, evolving in place as a metadata change that leaves the rows already captured untouched. It is not always free, though. Plenty of setups still trip over a new column: an ingestion framework that chokes on it, a dbt source with a strict definition, a downstream view that expects an exact shape, a schema registry that rejects the record. Treat additive drift as usually cheap rather than automatically handled.

Breaking drift is what versioning is for. A rename, a removal, or a type change means you cut a new Bronze version and reconcile the versions downstream, which is the body of this post.

Semantic drift is the awkward one. A column keeps its name and its type but changes what it means, so it slips past every structural defense. Neither a schema diff nor a version boundary can see it. Detecting it takes a contract with value-level rules, the final third of the post. But detecting and absorbing are separate jobs. Once the contract tells you the meaning changed at a known point, a status code retired and reused, or a measure that switched units, you usually absorb it in the Silver merge with a transform keyed to the cutover date, so the two eras reconcile instead of silently blending. You only cut a Bronze version for it when you want that cutover recorded in Bronze rather than buried in a date filter. Either way, the contract is what sees the change.

The instinct when breaking drift hits is to rewrite Bronze into the new schema. That is the move that broke recovery in the story, and it fails for a reason deeper than the hours it burns.

Why rewriting Bronze rots the warehouse

The old way was direct enough to feel reasonable. A rename or a type change arrives, so you rewrite the old Bronze tables into the new schema and then walk every downstream model to match. It is laborious, it is risky, and it touches everything, but in the moment it looks like the honest way to keep the warehouse consistent.

The first thing wrong with it is that it mutates history that was supposed to be immutable. The whole value of Bronze is that it is a faithful, append-only record of what the source actually sent, captured exactly as it arrived. Rewriting it destroys the audit trail and, as the story showed, breaks the rebuild from Bronze, which was the recovery guarantee the whole platform leaned on. In Part 9 I made Bronze immutability the load-bearing property that GDPR erasure depends on, because crypto-shredding only works if Bronze is never edited in place. A post that rewrites Bronze in Part 10 would quietly demolish the foundation Part 9 stands on. Nothing here is allowed to rewrite or reinterpret the records Bronze already captured. Adding a column in place is fine, since the rows already written do not change. Reshaping or recasting those rows is what versioning exists to prevent. A new schema cut produces a new version table, and the records in the existing one stay exactly as they were.

There is a quieter version of the same mistake, one I have shipped myself at more than one company. Instead of rewriting Bronze after the fact, you fix the drift on the way in, by defining column aliases or doing small in-memory column operations inside the extractor or loader you wrote yourself. A rename upstream becomes a one-line change to an alias map, the Bronze table keeps a steady shape, and day to day it genuinely makes drift easier to live with. It costs you the same guarantee the rewrite did, only earlier in the pipeline. Bronze is the only copy of the source you keep, so a column the loader reshaped on the way in is gone from the record for good, and the archive holds what the loader chose to write rather than what the source sent. This is the cargo-as-code coupling from Part 2 in a more helpful disguise, and it leaves Bronze unable to serve as a faithful record of what arrived, because the loader edited it before it ever landed.

The second thing wrong with it is where the reconciliation ends up living. If you absorb the drift inside the conformed entity by widening it with branches for the old shape and the new shape, that entity slowly fills with CASE WHEN ladders and COALESCE chains until nobody can read it. In Part 4A I described the conformed entity as the shared semantic contract, the one place where business meaning is committed once and trusted everywhere.

Leave the capture exactly as it arrived, whether the temptation is to reshape it on the way in or to rewrite it after the fact, and keep drift out of the conformed entity. Version Bronze instead.

Version Bronze instead of rewriting it

When the schema drifts in a breaking way, append a new Bronze version table and leave the existing one exactly as it is. The old version still exists and might still receive data, so a late-arriving extract in the old schema can still be routed to the table built for it.

How concrete the versioning gets is up to you, and a simple convention carries most of the weight. A versioned table name works well, something like bronze.assets_v1 and bronze.assets_v2, where each new schema cut produces a new table and leaves the previous one untouched. The other piece is a small record of where the boundary sits. A metadata row or registry entry that captures the version, the timestamp of the cutover, and the schema diff that triggered the cut gives you a durable account of what changed and when, which the reconciliation downstream will need.

None of this depends on a particular transformation tool. Whether you build the reconciliation in dbt, in SQLMesh, in plain SQL views, or in something your platform team wrote in-house, the shape of the solution is the same.

The payoff is the part the war story was missing. Because Bronze is never rewritten, the replay path that quietly rotted under the old approach stays intact under this one. Keeping the recovery guarantee true is the whole reason to version Bronze rather than rewrite it, and it is the part the team in the story gave away without ever choosing to.

Why not evolve the Bronze table in place?

If you run on a modern table format, the obvious objection is that this versioning is unnecessary. Delta, Iceberg, and Hudi all evolve schema as a metadata operation. They add a column, and on these formats they can even rename or drop one, without rewriting the data files, with time travel still reaching the older snapshots. So why not keep a single Bronze table and let the format evolve it?

The answer is replay. Recovery means rebuilding the layers above Bronze by replaying it back through them, and that only reconstructs the real history if each row still carries the schema it was captured under. Evolve a single Bronze table in place and you keep the rows, but you lose the explicit record of which ones arrived under which schema. The rebuild can no longer tell the eras apart, so the history it reconstructs is not the history that happened.

Versioning keeps what the rebuild needs. Each capture stays in the schema it was written under, the version boundary records exactly where one era ends and the next begins, and the source-aligned Silver step reconciles the versions on the way through.

How much any of this matters depends on whether you actually rebuild from Bronze. Where that rebuild is a guarantee you exercise, the version boundary is what keeps it honest. Where Silver or Gold is your real source of truth and Bronze is rarely replayed, evolving in place costs you less and the case for versioning is weaker.

This is also why the storage layer’s cheap additive evolution does not remove the need for versions. It lightens the load, because additive changes never have to become a new table. But the breaking and semantic cutovers, the ones that actually threaten the rebuild, are the ones it cannot absorb without losing that boundary.

Versioning multiplies your Bronze tables, and the conformed entity must never have to deal with that multiplication. The reconciliation happens once, in the source-aligned Silver layer.

Reconcile the versions in a Silver union

The reconciliation is a single source-aligned Silver model that unions all the Bronze versions and lands every one of them in the same conformed shape. A few transformations cover what differs between versions.

For a type change, pick one canonical type in Silver and cast each version to it. Usually that means widening, an integer to a bigint for instance, so no value is lost. When a clean cast would be lossy, casting to string and handling the parse downstream is the safer default. Whatever you choose, record the choice next to the model so the next engineer knows it was deliberate.

For a rename, map both the old name and the new name onto the same Silver column, so a column that was called asset_ref in one version and asset_code in the next reads as one column in Silver.

For a removal, the column is simply absent from the later versions, so the union supplies a null or an agreed default for the rows that come from those versions.

For a change in meaning, translate the old values into the new vocabulary with a transform keyed to the cutover, so a status stored as short codes before it and full words after lands as one consistent set of values in Silver. The schema never changed here, so a date filter in the merge is usually enough. You cut a Bronze version only when you want the boundary explicit in Bronze rather than carried in transform logic.

A compact sketch shows them together.

-- source-aligned Silver: assets
select
    cast(asset_id as bigint)        as asset_id,      -- v1 was int, widened
    asset_ref,                                        -- v1 name
    cast(null as string)            as region,        -- added later, null for v1
    case status                                       -- v1 used short codes
        when 'A' then 'active'
        when 'I' then 'inactive'
        when 'D' then 'decommissioned'
    end                             as status
from bronze.assets_v1
union all
select
    asset_id                        as asset_id,     -- v2 bigint
    asset_code                      as asset_ref,    -- v2 renamed column
    region                          as region,       -- v2 introduced region as a new column
    status                          as status        -- v2 already uses words
from bronze.assets_v2

That is the whole trade the pattern makes. The cost of breaking drift moves from rewriting history and chasing every downstream model to adding one version branch in one source-aligned Silver model. The conformed entity that reads from this staging model never learns that the source drifted at all.

One detail matters when the conformed entity carries slowly changing history. The union must conform renamed and widened columns to identical values before any SCD Type 2 comparison runs, or a cosmetic change opens a spurious new version row for every entity at the cutover, the same phantom history Part 6 flagged with hard deletes.

Reconciliation handles what you already know changed. Noticing that a column’s meaning shifted while its name and type stayed put is the part versioning cannot do on its own, and that is what a data contract is for.

Data contracts

Look back at the root cause in the story. A producer reconfigured an export without any awareness that the warehouse was downstream, and no one on the warehouse side was positioned to close that gap. No amount of clever Silver modeling fixes that, because the problem is that the warehouse was never a named, visible consumer with stated expectations. In Part 5 I argued that governance scales only when the pipeline itself produces and enforces it, so that the guarantees hold without depending on anyone to administer them by hand. A data contract is that argument applied to the producer boundary.

A data contract is an explicit, versioned, machine-readable agreement between a data producer and its consumers. It puts the warehouse on the record as a consumer with declared expectations about the data it receives, which means a producer can no longer silently reconfigure an export and quietly make it someone else’s problem. The fix does not depend on anyone upstream remembering the warehouse exists. The contract and its automated checks carry the enforcement that the org chart did not.

There is a habit in this profession worth naming. Data engineers are too compliant. A contract has two parties, and a data contract is a bilateral agreement like any other, which cuts both ways. A change the producer raises ahead of time is ordinary work: the two sides agree on it, the contract version moves, and you schedule the new Bronze version to land with it. Drift only becomes an incident when the change is unilateral, the export quietly reconfigured with nobody downstream consulted. Then the producer has broken the agreement, and the honest response is to hand the problem back: the export no longer matches what we agreed, so fix it at the source, and ingestion resumes once it does. Quietly rebuilding your pipeline around an unannounced change treats their violation as your burden, and it trains the producer to keep treating the warehouse as a place where anything goes.

How far you can take that depends on leverage. Against a third-party API you do not control, the stance from Part 2 still holds: the data is cargo, you absorb the change, and you version Bronze because there is no one to send the bill to. Against an internal source, where the contract is a real agreement between two teams under the same roof, you almost always have more standing to push back than you use. The patterns in this post are how you cope when you have to absorb drift. The contract is also what earns you the right to refuse it.

The standard most teams reach for is the Open Data Contract Standard, ODCS. It is open, written in YAML, and version controlled and reviewed in git like any other artifact in the platform. A contract can carry many sections, and two of them matter for drift. The schema section declares each column, its type, and whether it is required, which is where a rename, a removal, or a type change shows up as a contract violation. The quality section holds value-level rules, and this is where semantic drift gets caught, because a rule can pin a column to an enumerated domain, a range, or a distribution that a schema check cannot express. A rule that constrains a status column to a known set of values is exactly what notices the morning a new enum value appears.

# a quality rule (value-level: this is what catches semantic drift)
- metric: invalidValues
  arguments:
    validValues: ['active', 'inactive', 'decommissioned']
  mustBe: 0
  dimension: conformity
  severity: error
  schedule: 0 20 * * *
  scheduler: cron

Those are the same values the Silver union conformed the old status codes into, so the contract guards the vocabulary the reconciliation produces. One thing to keep straight: ODCS is a specification rather than an enforcement engine. It defines the contract and what a valid one looks like, but it does not run checks against your data by itself. You still need tooling to execute the rules, whether a dedicated data-contract CLI, Soda, the test framework in your transformation tool, or a schema registry at the ingestion boundary. The contract says what to check. One of those tools does the checking.

How the defenses lock together

The contract is what turns the silent drop-and-replace into a signal. The export announces nothing on its own, so you put an explicit check at the ingestion boundary, a schema diff or an ODCS breaking-change check that compares what arrived against what the contract declares. When it trips on a change you did not plan for, you respond deliberately instead of finding the break weeks later in a wrong report: push it back to the producer and pause ingestion, or, when you have to live with it, cut a new Bronze version.

That is the whole loop. The storage layer absorbs the additive changes nobody needed to see. The contract catches the structural break and the meaning change alike, you cut a version for the ones worth absorbing, the Silver union reconciles them, and the conformed entity downstream stays stable through all of it.

The honest trade-offs

A professor of mine during my thesis kept coming back to one line, and it has held up better than most of what I learned that year: there ain’t no such thing as a free lunch. Every option in this post buys you something and charges you for it somewhere else, so the trades are worth laying out plainly.

The Silver union grows without a natural bound. Version one, version two, on through version N, each one another branch in the same staging model. Eventually someone will ask whether version one can finally be dropped. You can, but only with eyes open: collapsing an old version is the rewrite this whole post argued against, so it is safe only once you have accepted you will never need to rebuild that era from Bronze. Until then the trade is real, and you are buying immutability and replayability with storage and query complexity that only grows.

Versioning is blind to meaning. A change that keeps a column’s name and type sails through a version boundary unseen, which is the entire reason the contract’s quality rules are part of the design instead of an optional extra you can skip.

Cutting a version is not free either. Each new version needs its mapping in the Silver union, written and tested. That cost is small next to rewriting history and chasing every downstream model, but it is not zero, and it recurs every time the source drifts and it may even compound. Imagine the code example for 20 versions of the asset-management export, each one a few lines of mapping in the Silver union. The cost is real, and it is worth weighing against the cost of rewriting Bronze and chasing every downstream model.

Last, the detection only exists if someone actually wrote the contract or the schema-diff check. Drop-and-replace gives you nothing for free. Without an explicit check at the boundary, the first sign of drift is still a broken model, or wrong numbers that no one questions until far too late.

The asset-management export never stopped changing shape, and it was never going to. But that was never the real problem. The real problem was quieter: every time we rewrote Bronze to keep pace, we chipped away at the replay path we believed we were protecting, and no one ever decided to. Versioning Bronze, reconciling the versions in Silver, and holding the producer to a contract come down to one thing, keeping the recovery guarantee true while the source drifts instead of trading it away one rewrite at a time.

The Modern Data Engineer Part 10: When the Source Drifts

The Modern Data Engineer Part 10: When the Source Drifts

When a helpful change broke disaster recovery

Drift is the nature of cargo

The three kinds of drift

Match the defense to the drift

Why rewriting Bronze rots the warehouse

Version Bronze instead of rewriting it

Why not evolve the Bronze table in place?

Reconcile the versions in a Silver union

Data contracts

How the defenses lock together

The honest trade-offs

Join the Discussion