SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • February 8, 2024
  • Rss Fetcher

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    documentclass{article}
    begin{document}
    Hello world.
    end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

Related posts

  • Save as PDF twice and you get different files
  • Put PDF properties in a LaTeX file
  • Conspicuously missing data

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

The post Your PDF may reveal more than you intend first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • AI dev tools for Windows get a fresh coat of paint
  • Microsoft wants to tap AI to accelerate scientific discovery
  • It’ll soon be free to publish apps to the Microsoft Store
  • NLWeb is Microsoft’s project to bring more chatbots to webpages
  • Devs can now tap Microsoft Edge to power AI web apps

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.