SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • April 20, 2024
  • Rss Fetcher

Arshad Khan left a comment on my post on the less and more utilities saying “on ubuntu if I do less on a pdf file, it shows me the text contents of the pdf.”

Apparently this is an undocumented feature of GNU less. It works, but I don’t see anything about it in the man page documentation.

Not all versions of less do this. On my Mac, less applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.

A more portable way to extract text from a PDF would be to use something like the pypdf Python module:

    from pypdf import PdfReader

    reader = PdfReader("myfile.pdf")
    for page in reader.pages:
        print(page.extract_text()) 

The pypdf documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?

PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.

Related post: Your PDF may reveal more than you intend

The post Extract text from a PDF first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • Week in Review: Why Anthropic cut access to Windsurf
  • Will Musk vs. Trump affect xAI’s $5 billion debt deal?
  • Superblocks CEO: How to find a unicorn idea by studying AI system prompts
  • Sage Unveils AI Trust Label to Empower SMB’s
  • How African Startups Are Attracting Global Fintech Funding

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.