By Bram Durieux

The Modern Data Engineer Part 8: What PII Actually Is and How Ordinary Fields Become Special-Category Data

GDPR personal data reaches well beyond names and emails, and ordinary fields can combine into Article 9 special-category data. Why that combination sensitivity has to be caught at ingestion design rather than in a later legal review.

The Modern Data Engineer Part 8: What PII Actually Is and How Ordinary Fields Become Special-Category Data

The Modern Data Engineer Part 8: What PII Actually Is and How Ordinary Fields Become Special-Category Data

1. Reading the HR schema

I was working through the source schema for an HR data platform at another client, deciding what to ingest. The usual work. You go column by column, you figure out the grain, you decide what belongs in the warehouse and what stays in the source. Most of it is unremarkable. Names, employee numbers, job grades, locations, hire dates. The kind of fields you have modelled a hundred times.

Then there was a column called “partners”. On its own it looked like nothing worth a second thought. A contact person, a benefits beneficiary, an emergency contact, any number of ordinary relationships that an HR system needs to track. If you had asked me to classify it in isolation, I would have called it a contact field and moved on.

The thing is, that column shares a record with the employee’s identity, their job grade, their location, and their tenure. And once you read the partners field alongside those, the combination starts to point somewhere the field alone never does. It can reveal the employee’s sexual orientation.

That inference moves the record into Article 9 territory, which is the strictest tier GDPR defines, even though no single field on that record announces anything sensitive. The exposure was not in the column I was looking at. It was in the company the column was keeping.

That schema review surfaced something the everyday working definition of personal data does not prepare you for. So it is worth being precise about what the definition actually says, because the gap between the legal definition and the engineer’s gut sense of “sensitive” is where this kind of thing slips through.

2. What GDPR counts as personal data

GDPR defines personal data as any information relating to an identified or identifiable natural person. It’s worth reading that slowly, because it reaches a great deal further than the cases that come to mind first.

Most engineers carry a mental model built around fields that look obviously sensitive. National ID numbers, payment card data, passport numbers, the things you would instinctively lock down. The categories in that mental model do count, but they are a small slice of the actual definition. IP addresses fall inside the definition. So do device identifiers, location data, online identifiers, and professional information tied to a named individual. A row that links a person to a building access timestamp is personal data. A field that records which laptop someone was issued is personal data. None of those would survive the “does this look like a passport number” test, and all of them are in scope.

The correction this asks of you is straightforward to state and surprisingly hard to practice. The obvious-sensitive heuristic is exactly the filter that lets ordinary-looking fields through unexamined, because they do not trip the instinct that a national ID number trips. The partners field passed that instinct without resistance.

3. When a technical key is really a person

What makes a field personal data is whether it lets you single out one person and tell them apart from everyone else. You do not need to be able to read a name off it for that to hold. The IP address mentioned earlier counts for exactly this reason, since it pins down a connection and follows it even though it names no one. The same reasoning reaches a category engineers rarely think of as personal data at all, which is the opaque identifier generated inside a source system.

A customer GUID, an account number, a surrogate key minted by the source: these read as random strings or meaningless integers, so they tend to get waved into the warehouse as technical plumbing rather than personal data. What actually settles it is whether the key resolves to one person, through a dimension table or through a join that someone in the organisation can make, and not how readable the key happens to be. A GUID that stays stable per person across loads identifies them as reliably as their email address does, even though no human can tell who they are by looking at it.

Hashing or tokenising the value does not change that on its own. A hashed customer id is pseudonymised rather than anonymised, and as long as the mapping back to the person exists somewhere, the data still carries the full set of obligations. Part 9 leans on that distinction heavily. It matters at ingestion design because these keys are the join spine of the warehouse, so they reach into every layer and every mart, and “it is just a technical id” is the assumption that threads identifying data through parts of the platform nobody ever flagged as holding PII. Within personal data there is also a smaller and stricter tier, and the partners inference lands squarely in it, which is what comes next.

4. Article 9 special-category data

GDPR Article 9 defines special-category data, a stricter regime layered on top of ordinary personal data. The categories are worth listing in full, both because the list is shorter than people expect and because seeing it laid out makes the partners problem obvious in hindsight.

  • Racial or ethnic origin
  • Political opinions
  • Religious or philosophical beliefs
  • Trade union membership
  • Genetic data
  • Biometric data processed for the purpose of uniquely identifying a person
  • Health data
  • Sex life or sexual orientation

When GDPR calls this regime stricter, that has concrete consequences for the engineer rather than being a matter of legal tone. Processing special-category data requires an additional Article 9(2) condition on top of the Article 6 lawful basis, the obligations under the regulation are heavier, and the controls expected at ingestion, storage, and erasure are correspondingly tighter. A field you might have generalized at leisure becomes a field you have to justify holding at all.

Sexual orientation sits inside that final category. That is why the partners inference is not a minor classification quibble that a privacy officer might wave through. Reading the partners field next to identity, grade, location, and tenure carries the record straight into the heaviest tier the regulation defines. A single inference is enough to take an otherwise ordinary employee record all the way there.

With both tiers defined, the actual problem comes into focus, and it has very little to do with classifying any single field correctly.

5. How innocuous fields combine into special-category data

Here is the centerpiece. Fields that are harmless in isolation can compound into Article 9 special-category data simply because they share a record. The sensitivity is a property of the relationships between fields rather than of any field on its own, which makes it far more uncomfortable to look for, because the thing you have to inspect is the record as a whole.

Walk the partners field through it now that the regulatory frame is in place. Taken alone, it reads as a neutral contact attribute, the sort of thing every HR system carries. Bring it onto the same record as the employee’s identity, their job grade, their location, and their tenure, and the picture changes. The combination can support an inference about the employee’s household and personal life that reaches sexual orientation, which is Article 9 data carrying the full stricter regime. No single field on that record states anything of the kind. The inference is assembled out of the join.

The CJEU has ruled on this pattern directly. In case C-184/20 [1], the court held that publishing a person’s partner’s name constitutes processing of special-category data under Article 9, because it permits sexual orientation to be inferred by comparison or deduction. The case concerned a Lithuanian transparency register rather than an HR warehouse, but the combination at issue was the same one I had on the table. A field that names a partner, sitting on the same record as fields that identify the individual.

What we did about it was decide the raw value never needed to enter the warehouse at all. Working with the client’s privacy officers, the team derived a boolean has_partner_registered before ingestion and loaded that instead of the original field. The reporting requirements were satisfied by knowing whether a partner was on file. The inference-bearing detail, the part that actually carried the orientation signal, never made it past the extractor. The combination that worried us could no longer be assembled, because one of its ingredients was gone by design.

The same pattern shows up in other domains, for instance in anything that touches health. A medical appointment timestamp, on its own, is a time. A department name is an organisational label. A job title is a job title. Put an appointment timestamp next to a department name and a job title on the same record and the three together can reveal a health condition that not one of them states alone. The mechanism is identical to the partners case: harmless components that assemble into something the regulation treats as sensitive.

If the exposure lives in the combinations rather than in the fields, then the place teams usually go looking for it turns out to be both the wrong place and the wrong time.

6. Why this has to be caught at ingestion design

A legal review that inspects fields one at a time, after the schema is finalized, cannot reliably see this. That is the core of it. Each field passes its own review, because each field genuinely is innocuous on its own. The partners field clears the check. Job grade clears it. Location clears it. The risk is in the product of the four, and a field-by-field pass over a frozen schema never multiplies them together. The review is structurally incapable of seeing what it is looking for.

The point where the combinations are still yours to shape is ingestion design. That is the moment you are deciding which fields to bring in, at what grain, and alongside which others on the same record. Once the schema is locked and the data has landed, the combinations exist whether anyone reviewed them or not. Before that, they are still choices. The whole leverage of the situation sits in that window, which is why the evaluation has to happen there and not downstream.

This also reshapes the working question you carry into a schema review. The instinct is to ask, of each field, whether it is PII. That question is still worth asking, but it is not the one that catches the partners problem, because partners answers it with a confident “no”. The more useful habit is to keep asking what having a given field on the same record as these other fields lets someone infer about the individual. Over time you find yourself interrogating the record as a whole rather than the columns one by one, because that is the level at which the orientation signal actually lived. It was assembled out of the join, and the join is what the schema review has to be able to see.

That same logic is what makes the erasure architecture in Part 9 a design-time concern rather than an afterthought. The decisions about what to encrypt, what to generalize, and what to never ingest at all are design-time decisions for exactly the reason combination sensitivity is. By the time the data has landed, the choices that determine your exposure have already been made.

Recognising the exposure at design time also hands you a concrete lever, and the cheapest one to pull is to carry less in the first place.

7. Data minimization as a partial response

Data minimization is the deliberate response that follows from evaluating exposure at design time. You leave out or coarsen whatever the reporting requirements do not actually depend on. The discipline is to start from what the reports need and ingest up to that line, rather than starting from what the source offers and trimming back.

The same HR platform shows what this looks like in practice. The team stored city only, leaving the street address out of the warehouse entirely. Dates of birth were truncated to January 1st of the birth year. Personal detail was minimized wherever the reporting did not depend on it. None of those measures cost the analytics anything, because the reports needed city-level geography and age bands. What they removed was surface area.

I want to be honest about the limit of this, because minimization is easy to oversell. It shrinks the surface area you have to govern and protect later, which is genuine and worth the effort. It does not remove the obligation for the fields you do ingest. Those still have to be governed, and when an erasure request arrives they still have to be erasable. Minimization makes the eventual erasure problem smaller without making it disappear.

That remaining obligation, for the fields you keep, is exactly what the next post in the series has to architect for.

8. Where this leads

The through-line is short enough to hold in one thought. Personal data is wider than the obvious cases, the strictest tier of all can be assembled out of fields that are individually harmless, and the only reliable place to catch that assembly is at ingestion design, while the combinations are still yours to shape.

Part 9 takes up what follows from this. Now that we have established what the erasure obligation actually has to cover and why those decisions land at design time, Part 9 works through how to honor the right to erasure across the layered warehouse, including a Bronze layer that was deliberately built to be immutable.

It runs straight into the thread from Part 6 and Part 7, which argued that durable, replayable history is what makes the warehouse trustworthy in the first place. The erasure obligation pulls against exactly those properties, and Part 9 is where that tension gets resolved.

The partners field changed how I read a schema. I used to work down the columns one at a time, asking what each one was, and that habit had served me well enough that I never questioned it. What it missed was everything happening between the columns. These days I spend less time worrying about whether a given field is sensitive and more time asking what the record gives away once all of its fields are read together, because that is where the real exposure had been hiding the whole time. It is a slower way to work through a source schema, and it catches things that the old way would have waved straight into the warehouse, which is a trade I will take every time on a system that holds data about people.

References

[1] Court of Justice of the European Union, OT v Vyriausioji tarnybinės etikos komisija, Case C-184/20, judgment of 1 August 2022. [Online]. Available: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:62020CJ0184

Join the Discussion

Thought this was interesting? I'd love to hear your perspective on LinkedIn.

Discuss on LinkedIn