SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • June 19, 2025
  • Rss Fetcher

This post looks at whether you should delete names or replace names when deidentifying personal data. With structured data, generating synthetic names does not increase or decrease privacy. But with unstructured data, replacing real names with randomly generated names increases privacy protection.

Structured data

If you want to deidentify structured data (i.e. data separated into columns in a database) then an obvious first step is to remove a column containing names. This is not sufficient—the reason deidentification is subtle is that it is possible to identify people after the obvious identifiers have been removed—but clearly removing names is necessary.

Instead of removing the names, you could replace them with randomly generated names. This neither hurts nor helps privacy. But this might be useful, say, when creating test data.

Say you’re developing software that handles patient data. You want to use real patient data because you want the test data to be realistic, but that may not be acceptable (or legal) due to privacy risks. You can’t just remove names because you need to test whether software displays names correctly, so you replace real names with randomly generated names. No harm, no foul.

Unstructured data

Now let’s switch contexts and look at unstructured data, such as text transcriptions of doctors’ notes. You still need to remove names, but it’s not as simple as removing the name column. Instead of the database structure telling you what’s a name, what’s a diagnosis, etc. you have to infer these things statistically [1], which means there will always be errors [2].

When working with unstructured text, replacing names with randomly generated names is better for privacy. If you could identify names with 100% accuracy, then it would make no difference whether you removed or replaced names, but you can’t. Software will always miss some names [3].

Suppose your software replaces suspected names with “NAME REDACTED.” When your software fails to identify a name, it’s obvious that it failed. For example,

NAME REDACTED and his wife Margaret came into the clinic today to be tested for …

But if instead your software replaced suspected names with synthetic names, it is not obvious when the software fails. When you see a reference to Margaret, you can’t know whether the patient’s name was Virginia and was replaced with Margaret or whether the software made an error skipped over Margaret’s name.

All else being equal, it’s better to synthesize names than remove names. But that doesn’t mean that just any process for synthesizing names will be adequate. The error rate doesn’t need to be zero, but it can’t be too high either. And the process should not be biased. If the software consistently left Hispanic names slip through, for example, Hispanic patients would not appreciate that.

Related posts

  • HIPAA expert determination
  • Data privacy FAQ
  • Local differential privacy

[1] “But what if I use a large language model?” That’s a particular kind of statistical inference.

[2] Any statistical test, such as testing whether a string of text represents a name, will have false positives (type I error) and false negatives (type II error). Software to remove names will remove some text that isn’t a name, and fail to recognize some text as names.

[3] We’ve tested a lot of deidentification software packages for clients, and the error rate is never zero. But privacy regulations such as HIPAA don’t require the error rate to be zero, only sufficiently small.

The post Deleting vs Replacing Names first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • DOJ files to seize $225 million in crypto from scammers
  • Deleting vs Replacing Names
  • SpaceX’s Starship blows up ahead of 10th test flight
  • Bluesky briefly suspended JD Vance’s account after he joined
  • A timeline of the US semiconductor market in 2025

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.