SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • July 23, 2023
  • Rss Fetcher

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Related posts

  • Sharing data without letting it go
  • Conspicuously missing data
  • Data privacy consulting

The post How an LLM might leak medical data first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • How is Technology Modernizing Recruitment in Temporary Employment Services
  • Banking on a serverless world
  • Court denies Apple’s request to pause ruling on App Store payment fees
  • Cursor’s Anysphere nabs $9.9B valuation, soars past $500M ARR
  • Circle IPO soars, giving hope to more startups waiting to go public

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.