SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • August 7, 2023
  • Rss Fetcher

A common idiom in command line processing of text files is

    ... | sort | uniq | ...

Some process produces lines of text. You want to pipe that text through sort to sort the lines in alphabetical order, then pass it to uniq to filter out all but the unique lines. The uniq utility only removes adjacent duplicates, and so it will not remove all duplicates unless the input is sorted. (At least the duplicate lines need to be grouped together; the groups need not be in any particular order.)

When given the -u flag, sort will sort and remove duplicates. This says the idiom above could be shortened to

    ... | sort -u | ...

Aside from saving a few keystrokes, is there any advantage to the latter? There could be, depending on how sort -u is implemented. If internally it simply sorts its input and then removes duplicates, then there is no advantage. But if the code simultaneously sorts and removes duplicates, it could save memory and time, depending on the input. If the code discarded duplicates as their were encountered, the code would need working memory proportional to the amount of unique input rather than the total amount of input.

I had a project recently that makes a good test case for this. The Gutenberg text corpus contains a list of words for 55,000 different sources, each in a separate text file. There are a lot of files, and there is a lot of redundancy between files. The combined file is 3.4 GB.

Running sort -u on the file took 394 seconds.

Running sort | uniq on the file took 610 seconds.

So in this example, sort -u not only saved a few keystrokes, it took about 35% off the time.

The post Sort and remove duplicates first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • 5 Tips on How to be Vigilant on Social Media
  • IT News Africa and Infobip Exclusive Webinar on Digital Loan Recovery for Africa’s BFSI Sector
  • Mysterious hacking group Careto was run by the Spanish government, sources say
  • 5 Dangers of Oversharing on Social Media
  • Can a dev environment spark joy? The Android team thinks so.

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.