The Modern Data Engineer Part 9: GDPR Erasure in the Layered Warehouse

Parts 6 and 7 worked through a specific problem. How do you make a hard delete in the source system propagate faithfully through Silver history, and how do you build a rebuild deletion registry that makes the Silver SCD Type 2 model rebuildable even when the source no longer carries the deleted row? That problem was about analytics correctness and pipeline fidelity. The driving question was whether your history model accurately reflects reality.

Part 7 closed by naming what it deliberately deferred. It stated that a rebuild deletion registry and a GDPR erasure registry are not the same artifact and should not be conflated, and that the topic deserved its own treatment. This is that treatment.

The deferral reflected how sharply the two topics pull against each other, even though they operate on the same layers. Parts 6 and 7 argued for immutability, append-only history, and durable raw-layer artifacts that a rebuild can replay. GDPR’s right to erasure says that a specific person’s data must be gone everywhere, including from those durable artifacts. The engineering challenge is that the properties that make the warehouse architecturally sound are exactly the ones that make this obligation difficult to fulfill. Getting the architecture right requires thinking about that tension before the first row lands in Bronze, not after the first erasure request arrives.

The pattern this post arrives at has four parts that work together. Sensitive fields are encrypted at ingestion under a key that belongs to one person, and alongside each encrypted field a generalized companion value is pre-computed and stored in plain text. When someone exercises their right to erasure, destroying that one person’s key renders their original values unreadable everywhere they were stored, an approach known as crypto-shredding. A rebuild from Bronze then falls through to the pre-computed companion values and verifies that nothing identifying survives. Each of these pieces depends on the others, and all of them have to be decided before ingestion rather than bolted on once a request arrives.

1. Why this is a different kind of deletion

The deletions that Parts 6 and 7 addressed originate inside the pipeline. A record disappears from the source system, and the question is whether that absence propagates correctly through the warehouse layers. The trigger is a source-system event where a row was removed, and the pipeline either detected it or did not. The engineering response was about fidelity, whether the warehouse accurately reflects what the source contains.

A GDPR right-to-erasure request originates outside the pipeline entirely. A person exercises a legal right, usually through a web form, an email to a data protection team, or a regulatory process. The source system may still have the record. The request asks what your system stores about that individual, and the obligation is to remove it from your system regardless of the source state. The scope of “your system” is the full warehouse, all three layers, plus any derived artifacts that were built from it.

This matters architecturally because the deletion does not arrive as a source event that the ingestion pipeline can detect and propagate. It arrives as an obligation that must be executed against every layer that holds the person’s data, including layers that were explicitly designed to be immutable. The layered warehouse was built to be trustworthy in a specific way. Bronze is write-once and replayable, Silver history is append-only, Gold entities are shared across marts. Those properties are exactly what the erasure obligation runs against.

Two examples from practice illustrate what happens when this tension is not addressed at design time.

One client ran an erasure engine across their company-wide data platform, which served multiple domain teams consuming from a central layer. The erasure script ran row deletion and tokenization on those downstream layers, while the Bronze and archive layers were treated as immutable. Legal and the Information Security Officer have reviewed the approach and accepted it. From a practical standpoint it works. PII is removed from the layers where analysts and BI tools operate.

The vulnerability the approach carries is in the rebuild path. If Silver is dropped and rebuilt from the archive layer, the PII data re-materializes in the downstream layers because the archive was never modified. The erasure script is then scheduled to run again immediately after every rebuild to close that window. The window is short, and the operational people involved have accepted the residual risk. In my view it is not fully GDPR compliant by design not in the least because any data engineer can always restore data from the archives. Despite these drawbacks, it is an honest arrangement with explicit compensating controls and stakeholder sign-off.

What it demonstrates is the structural problem. The erasure depends on the erasure script running after the rebuild, every time, without exception. Because the architecture does not prevent re-materialization, every rebuild creates a window where PII is back downstream until the script runs again.

Another client’s data platform took a different path: they deferred erasure intentionally. The platform was built under time pressure for the HR domain, and the choice was to ship fast and address erasure later. The company’s privacy officers strongly recommended applying data minimization at design time regardless of what erasure infrastructure the team eventually built. Employee addresses were not stored, only city. Dates of birth were truncated to January 1st of the birth year rather than the full date. Personal details were minimized wherever the reporting requirements did not explicitly require them. The partners field from the source HR system was one concrete case, examined in Part 8: rather than load it, the team derived a boolean has_partner_registered column before ingestion. This reduced the surface area of PII in the warehouse without solving the erasure problem itself.

When employees reached the seven-year retention policy threshold and erasure was legally required, the team had no erasure infrastructure to invoke. Without a key to delete and without a structured registry of where that person’s data lived, the process became manual archaeology. Custom scripts scanned through hundreds of gigabytes of raw text files, located every record belonging to those employees, and removed them, a process that was slow, expensive, and difficult to verify as complete. The data minimization measures they had built in from the start helped since there was less PII to find, but they did not eliminate the scanning problem for the fields that had been ingested.

Both situations share the same root. Erasure was not designed into the architecture, so it became a remediation task. The first client’s approach reached a workable steady state through compensating controls. The second situation shows what the absence of any forward-looking design looks like when the legal obligation eventually arrives.

2. What PII actually is

Personal data under GDPR reaches well beyond the obvious cases of names and emails, and fields that look harmless on their own can combine into Article 9 special-category data when they sit on the same record. Part 8 covers that material in full: the definition of personal data, the list of Article 9 special categories, and how combinations of innocuous fields force these decisions to ingestion design time. This post leans on that ground throughout, so readers who are new to it should study Part 8 first.

3. What erasure means per layer

The answer to what erasure requires is different at each layer of the warehouse, and the differences follow from the design properties that each layer is built around.

Bronze

Bronze was designed to be immutable and replayable. Every row that arrived from a source lands in Bronze and stays there, preserving the full ingestion history. This is the property that makes a full Silver rebuild possible. The right-to-erasure obligation says that a person’s data must be gone from this layer, which appears to create a direct contradiction, since the layer was explicitly designed to make row-level modification impossible.

The resolution is field-level encryption with a per-subject key, applied at ingestion, where the key is stored outside the warehouse in a dedicated key management system. The rows themselves are never modified. What changes is the readability of their content. When erasure is requested, the key for that person is deleted from the external key store. The encrypted ciphertext that was stored in Bronze remains physically intact, but without the key it is now unreadable noise, which means the immutability guarantee survives while the content it was protecting becomes unrecoverable, at least insofar as the key is genuinely gone and no copy of it lingers in a backup.

This requires a decision made at ingestion design time about which fields warrant field-level encryption and what granularity defines a “per-subject” key. The decision is not purely technical. It involves identifying which fields constitute PII or special-category data for a given source, how the person identifier maps to rows across that source, and how the key management system is organized. These are questions that need to be answered before the pipeline runs, not after the first erasure request.

A naming convention makes these decisions visible in the schema itself. Sensitive fields are prefixed with pii_, so a date of birth field is stored as pii_date_of_birth in Bronze, encrypted with the person’s key. Alongside it sits a companion column date_of_birth carrying the pre-computed generalized value in plain text: the first day of the birth year, or a year-level cohort, depending on the reporting requirement. The same pattern applies across all PII columns: every pii_X column has a corresponding X column with its pre-computed fallback. The generalization decision is made once, at ingestion time, so when an erasure request later arrives the companion values are already sitting in Bronze and nothing has to be computed from the encrypted originals.

A home-grown extractor and loader can implement this pattern programmatically using a source spec file. The spec declares which columns are PII, what category of PII each one carries, and what generalization to apply. The extractor reads the spec, names the columns accordingly, encrypts the pii_ fields with the per-subject key, and writes the companion generalized value in the same row. The spec file is the auditable record of PII decisions for that source, reviewable by privacy officers independently of the pipeline code.

Silver

Silver’s SCD Type 2 history is built to be append-only and rebuildable from Bronze. After a GDPR erasure, a rebuild from Bronze must produce a correct temporal history for that person without any individually identifying information. This is only possible if the rebuild has something to work from.

That something is already present in Bronze in the form of the companion columns written at ingestion. At erasure time, an anonymization tombstone marker is written to Bronze to signal that this entity has been erased. The tombstone is structurally analogous to the rebuild deletion registry artifact from Part 6 in that it is a Bronze record written specifically to guide the rebuild. The difference is what it signals. The deletion registry tombstone says that this entity no longer exists in the source. The anonymization tombstone says that this entity has been erased, and it records the entity identifier so the erasure can be audited and the rebuild can be scoped to that entity’s rows. It does not instruct the rebuild on how to replace values, since that substitution happens on its own through the COALESCE fallback described next.

The Silver pipeline queries Bronze using COALESCE(pii_date_of_birth, date_of_birth) for each sensitive field. Before erasure, the encrypted pii_ column decrypts to the original value and the COALESCE returns it. After the key is deleted, the decryption returns NULL, and the COALESCE falls through to the companion column, returning the generalized value. This behavior requires that the encryption layer returns NULL on a missing key rather than raising an error, which is a prerequisite to confirm before relying on it.

When the Silver pipeline rebuilds from Bronze, it processes both the original rows for that person (whose pii_ columns now decrypt to NULL) and the anonymization tombstone marker. The SCD Type 2 history survives the rebuild, but its PII fields have been replaced by generalized equivalents. The Conformed Entity row in Silver-integrated continues to exist, keyed on its stable conformed identifier, with identity attributes replaced. Downstream Gold marts and BI reports that consume this entity find the same row structure they always did, with generalized attributes where PII used to be.

Gold

Gold marts contribute individual rows to aggregates. Simply removing a person’s row from Gold would corrupt counts and metrics, which makes erasure by deletion the wrong approach at this layer. A person born in 1963 who asked to be forgotten should not make the age-group aggregates silently wrong.

Replacing the PII fields with their generalized companion values preserves the row’s contribution to aggregate analytics while removing the ability to identify the individual from the row. The person’s exact date of birth becomes the birth year, 1963, the year-level value stored in the companion column. Any aggregate that buckets by birth decade or by a five-year cohort reads that year-level value and buckets it exactly as it would have before, so queries counting employees by age band still produce correct numbers. What changes is the resolution of one attribute from individual-precise to year-level, and as long as the resulting buckets are large enough that membership cannot single anyone out, the individual is no longer identifiable from the output. The row’s contribution to aggregate analytics survives the erasure, which is why generalizing the record is a more useful frame for this problem than deleting it outright.

The same principle applies to tokenization, where a stable opaque token replaces a personally identifying value and preserves join-ability across tables that reference the same person without revealing their identity.

4. The erasure registry as a distinct artifact

Part 6 introduced the rebuild deletion registry, a technical table in the Bronze layer that records every detected hard delete as a durable fact. For each deletion it stores the primary key and a timestamp. Its purpose is rebuild fidelity. When Silver is rebuilt from Bronze, the deletion registry tells the pipeline which source entities no longer exist so it can close the appropriate SCD records and write deletion marker rows. The whole point of the registry is to make the rebuild deterministic even when the source no longer carries the deleted row.

The GDPR erasure registry serves a different purpose and carries different information. It is a key registry. For each person who has exercised their right to erasure, it records the person identifier and the encryption key identifier for that person’s per-subject key in the external key store. Deleting the key record there is what executes the erasure at the Bronze level. The erasure registry is the link between the legal obligation and the cryptographic mechanism.

The erasure registry also carries a reference to the anonymization tombstone written to Bronze at erasure time. The tombstone is a marker record that signals to the rebuild pipeline that this entity has been erased, and it does not need to carry replacement values because the companion-column fallback described earlier already supplies them. Its function is twofold: it anchors the audit record, tying the legal obligation to the specific Bronze record written at the moment of erasure, and it carries the affected entity identifier so the cascade can run a selective rebuild that touches only that entity’s rows rather than rebuilding all of Silver.

The two registries should remain separate artifacts even if they live in the same metadata schema. A person who appears in the rebuild deletion registry is represented in Silver by a closed SCD record with a deletion marker row. They no longer exist in the source. A person who appears in the GDPR erasure registry still exists in the warehouse, in generalized form. They have a valid Silver history, a Conformed Entity row, and a contribution to Gold aggregates. These are different states with different downstream semantics, and conflating the two registries would corrupt both the rebuild logic and the erasure audit trail.

There is also a difference in retention and audit requirements. The rebuild deletion registry needs to survive as long as a Silver rebuild might need to draw from it, which is tied to the data retention policy for Bronze. The GDPR erasure registry needs to survive as long as the legal obligation to demonstrate compliance might need to be satisfied, which is a different and typically longer horizon. They should be owned, governed, and retained according to those different requirements.

5. Anonymization in practice

Deciding what the anonymized version of a record looks like is a per-field decision, and it requires choosing the right technique for each field based on whether the field contributes to aggregate analytics, whether it is used as a join key across tables, and whether any useful generalization is possible.

Generalization replaces a specific value with a less specific but still analytically meaningful one. A date of birth becomes the first day of the birth year, which is what one team applied by default at ingestion time and which removed the need to generalize it at erasure time. A full street address becomes a city or region. A precise salary figure becomes a band. The generalized value still contributes to demographic or geographic or financial analysis at an aggregate level, and it does so without allowing the individual to be identified from the field alone. With the companion column approach, this decision is made at ingestion design time and encoded in the source spec file. The erasure request does not trigger any generalization computation. It finds the companion column already in Bronze and deletes the key.

Generalization only provides a privacy guarantee when the resulting group is large enough that membership in it cannot identify the individual. Year of birth, city of residence, and gender together describe thousands of people in Amsterdam. The same combination describes probably just one in Hogebeintum (Friesland). The coarseness of the generalization bucket has to be calibrated against the population it represents, which means evaluating the chosen values in combination and in context, not field by field in isolation. A generalization that passes the field-level check can still fail the population-level check, and the combination sensitivity review described below is where that failure surfaces.

Tokenization replaces a personally identifying value with a stable opaque token. If two tables in the warehouse reference the same person by an identifier, both can be updated to the same anonymous token, and joins across those tables remain valid. The token is not reversible without a separate mapping, and that mapping should itself be access-controlled and subject to its own retention policy. Tokenization is the appropriate technique for fields that need to preserve join-ability but carry identity information, such as a customer identifier that is used across multiple Gold marts.

Nulling removes the field value entirely and writes null. This is appropriate for free-text fields, biometric values, or any field where there is no useful generalization and no join-ability requirement to preserve. A comment field, a biometric hash, or a detailed address line in a raw-format field should typically be nulled. A field that ends up in this category is also worth scrutinizing at ingestion design time. If it carries no analytical value and cannot be generalized, GDPR’s data minimization principle raises the question of whether it should have been ingested at all. Nulling at erasure time is a valid fallback, but reaching for it during the source spec review is the better outcome.

It is worth being precise about what these techniques actually achieve, because they sit on a spectrum rather than all delivering the same thing. Generalization into a cohort that is genuinely large enough can reach anonymization, where the data no longer relates to an identifiable person and falls outside the scope of GDPR. Tokenization is different. As long as a reversible mapping from token back to identity exists somewhere, the data is pseudonymized rather than anonymized, and under GDPR pseudonymized data remains personal data carrying the full set of obligations. The mapping is the personal data, and the obligation persists until that mapping is itself destroyed. The crypto-shredding approach in this post deletes the per-subject key so that the pseudonymization becomes irreversible, which is the step that moves the encrypted fields from pseudonymized toward genuinely unrecoverable, again to the extent that no surviving copy of the key exists.

There is a combination sensitivity check to apply at the anonymization stage as well. When deciding how to generalize individual fields, verify that the combination of the chosen generalized values does not still allow re-identification. A 1960–1965 age cohort combined with a city-level location combined with a specific job title might uniquely identify a person in a small team or a specialized role. If it does, the generalization is insufficient and either a coarser bucket or a null is appropriate. The goal is for the anonymized record to be incapable of identifying the individual even when all generalized attributes are read together.

Data minimization, discussed in section 2, also lightens this stage directly: every field left out or coarsened at ingestion is one fewer per-field anonymization decision to make at erasure time. It is worth applying deliberately and early, but it does not substitute for having an erasure infrastructure at all.

6. The erasure cascade

The engineering deliverable is a script or API endpoint that accepts a person identifier and executes the full erasure cascade reliably and verifiably. The goal is for legal or compliance to be able to invoke it without requiring an engineer in the loop for each individual request. This is the same principle that ran through Part 5 on governance automation.

The cascade is a sequence of steps where the order matters, and two of the steps are irreversible.

Locate the person’s encryption key identifier in the GDPR erasure registry. If the person is not in the registry, the cascade cannot proceed, which indicates either that the person was never ingested or that the key was managed outside the registry (which would itself be a design gap to surface).
Write the anonymization tombstone marker to Bronze. The companion values were already computed at ingestion, so there is no work to do here beyond writing the marker, and writing it before the key is deleted keeps the audit trail complete from the moment of erasure.
Delete the encryption key from the external key store. This is the point of no return discussed earlier, and after it the pii_ fields for this person decrypt to NULL while the companion-column fallback supplies the generalized values.
Record the erasure completion in the GDPR erasure registry. The record carries the person identifier, the key identifier that was deleted, the timestamp of deletion, and a reference to the anonymization tombstone, which together form the audit record that the obligation was fulfilled and when.
Trigger a selective Silver rebuild for the affected entity, scoped by the entity identifier recorded in the tombstone. Until the rebuild runs, Silver and Gold may still carry rows derived from the pre-anonymization Bronze data, so the rebuild is what propagates the anonymization through every downstream layer. Until it completes, the erasure is incomplete downstream even though Bronze is already correct.

After the rebuild completes, the state is verifiable. A scan of Bronze for this person’s records finds only unreadable ciphertext and the anonymization tombstone. Silver history for this person shows only generalized attributes across all temporal records. Gold aggregates include the anonymized row’s contribution but do not expose PII. The erasure registry carries the audit record with the completion timestamp and key reference. Together these constitute a verifiable completion state that can be reviewed by legal or a data protection officer without re-running the cascade.

This verifiable completion state is also what distinguishes the crypto-shredding approach from the first client’s compensating-control arrangement. In the compensating-control approach, the rebuild re-materializes PII in downstream layers and the erasure script is scheduled to remove it again, so the erasure holds only as long as the scheduling holds. In the crypto-shredding approach there is no key left to decrypt the original rows, so a rebuild triggered six months or six years after the erasure produces the same generalized output every time.

7. The rebuild as verification and guidelines for good architecture

When Silver is rebuilt from Bronze after a completed GDPR erasure, the output for the affected person is deterministic. The pii_ columns decrypt to NULL and the companion-column fallback described earlier supplies the generalized history. There is no path by which PII can re-enter the history model, because the key that would decrypt it no longer exists. Running the rebuild and inspecting its output is how you confirm that the anonymization has propagated correctly through Silver history and the Gold layer, rather than treating the key deletion alone as sufficient.

Bronze completeness is verifiable separately. The person’s encryption key should no longer exist in the external key store, a query against the pii_ fields for that person should return only NULL values, and the companion columns should carry the generalized values alongside. The anonymization tombstone marker ties the audit record to the specific Bronze records and the moment of erasure. The erasure registry’s audit record ties the Bronze verification and the rebuild verification together into a single traceable completion state.

There is one residual gap that the rebuild does not reach. Derived artifacts outside the main Bronze-Silver-Gold pipeline are not touched by it. Exports sent to third-party systems, BI snapshots sent to external teams, cached extracts that were materialized before the erasure ran, and data shared with downstream applications outside the warehouse all require separate controls.

The right response to this gap is a data inventory that tracks where derived copies exist and what controls apply to them. The warehouse is the layer where most of the volume and most of the sensitive data lives, and making it erasure-safe by design is the biggest part of the problem. The derived-artifact gap is a separate and important control, but it requires a different instrument than the rebuild.

The design guidelines that follow from the argument in this post are worth stating as a set, not because they are surprising but because each one represents a decision that has to be settled at design time, while the schema is still being shaped, rather than discovered later under the pressure of a request.

Evaluate PII and special-category data at ingestion design time, including combination sensitivity. Evaluating fields in isolation is not enough. Evaluate them in the context of the other fields on the same record and the identities those combinations can imply. Defer this evaluation and you will discover the exposure after the data has been ingested and the remediation is expensive.

Use field-level encryption with a per-subject key at ingestion, managed outside the warehouse. Use a consistent naming convention: prefix encrypted columns with pii_ and store the pre-computed generalized value in a companion column with the plain name. A source spec file can declare which columns are PII and how to generalize each one, making this automatable at the extractor level and auditable by privacy officers independently of the pipeline code. Do not treat erasure as a problem you will solve by scanning raw files when the time comes. As the HR-platform example showed, that route is slow, expensive, and impossible to verify as complete.

Write an anonymization tombstone marker to Bronze before deleting the key. The tombstone is the audit record that erasure was executed, and it records the entity identifier so the rebuild can be scoped to that entity’s rows. It does not need to carry replacement values, since the companion columns already hold them, and it is not what drives the value substitution, which the COALESCE fallback handles on its own.

Keep the GDPR erasure registry separate from the rebuild deletion registry. They record different facts, serve different purposes, and carry different retention and audit requirements. Conflating them creates a structure that serves neither purpose correctly.

Build the erasure cascade as an operable artifact. Legal and compliance should be able to invoke it against a person identifier without requiring an engineer each time.

Treat the rebuild as the final step in the cascade and as the verification of completeness. Do not declare an erasure complete until the rebuild has run and the output has been inspected for the affected entity. Treating the rebuild as optional or deferrable misunderstands its function, because until it runs the anonymization exists only in Bronze while Silver and Gold still carry the derived form produced before the key was deleted.

Part 6 and Part 7 built the case for treating deletion as an architectural concern rather than a pipeline edge case. The solution they described (a rebuild deletion registry, a deletion marker row, a rebuildable Silver model) was built for pipeline correctness. Part 9 starts from the same warehouse layers but introduces an obligation with different origins and wider reach.

The four-part response described here is the structural answer to the tension between the warehouse’s design for durability and the law’s requirement to forget. The principles that make the warehouse trustworthy for analytics (immutability, append-only history, shared entities) can be preserved through the erasure if that erasure is designed in from the start, provided the key destruction is genuinely irreversible and reaches any backups. What cannot be designed in after the fact is the key management setup, the field-level encryption at ingestion, the companion column schema, and the source spec decisions that fix how each PII field is generalized. Those decisions need to happen before the first row lands. By the time an erasure request arrives, the architectural choices that determine whether you can honor it cleanly have already been made or missed, which is why this work belongs in the design phase rather than in an eventual scramble to remediate.

The Modern Data Engineer Part 9: GDPR Erasure in the Layered Warehouse

The Modern Data Engineer Part 9: GDPR Erasure in the Layered Warehouse

1. Why this is a different kind of deletion

2. What PII actually is

3. What erasure means per layer

Bronze

Silver

Gold

4. The erasure registry as a distinct artifact

5. Anonymization in practice

6. The erasure cascade

7. The rebuild as verification and guidelines for good architecture

Join the Discussion