PSS #2: Finding Duplicate Lines in a text file
Okay, so today I found another reason to really dig PowerShell. I had a text file that was populated with a large number of lines to text, each on their own line. I found that this file had a decent amount of duplicate lines in it. There were over 1,200 lines in the file to start with.
In order to verify I had duplicates, since it wasn’t blatantly obvious, I dragged and dropped the file into Excel which created a single column of text, one line per row. I then sorted that column using the A-Z sort. This ordered the duplicates together and made them more obvious. I then used the Conditional Highlighting to get an idea of just how much of the file was duplicated. Let’s just say there was more red than white.
I could have spent the time in Excel to go line by line and delete the duplicate rows. Also, there was probably some Excel magic that could have done that for me; however, I thought I’d try a short PowerShell command to accomplish what I wanted.
$lines = @(); Get-Content C:\temp\FileWithDupes.txt | %{ if (($lines -eq $_).length -eq 0) {$lines = $lines + $_}}; $lines > C:\temp\FileWithoutDupes.txt
Explanation:
I create an empty array named $lines. I then call Get-Content on the source file. Get-Content will read the file and pipe that to the next command, which is a foreach-object (shorthanded by the % sign). For each line read from the file I see if it exists in the $lines array. If the file doesn’t exist I add it to the array. Once the entire file is read the $lines array is populated with all unique lines. I then redirect the contents of $lines to a file.
Example:
Create a file (my example names it FileWithDupes.txt) with the following text:
This is a Dupe
This isn’t
This is a Dupe
Who Needs Dupes?
Now execute the PowerShell command above. Make sure to change the file paths to meet your own input and output directory. What you should end up with is a file with the duplicate lines removed.
This is a Dupe
This isn’t
Who Needs Dupes?
Something to note here is that I believe Get-Content is pulling in the entire file into memory before processing it one line at a time. You can control that somewhat with some of the parameters on the Cmdlet. Oh, and the $lines variable is an array that is being populated with all non-duplicate lines so the addition of new items to an array is essentially resizing that array. Needless to say if you have a really LARGE file you would want to look for a more efficient mechanism than this script.
With that in mind, the file I was working with was about 120 K or so. After running the PowerShell command I ended up with only 263 unique lines. It took me about two minutes to write the command above with my knowledge of PowerShell (which isn’t much) and most of that was remembering how to dynamically append the next line to the array. It would have taken me a lot longer and I would have have had considerable more eye strain trying to remove the duplicates by hand.
[Update]: Thanks to MoW the PowerShell Guy there is an even better way to do this. Try the following:
get-content FileWithDupes.txt | sort -u > FileWithoutDupes.txt
Once again, there is always a better way. Thanks MoW!
PS: What is PSS? Well, this will be an irregular series I’m calling PowerShell Snippets that I’ll try to post whenever I come up with a short PowerShell script or command that I found useful. This post is number two because I’m counting my post on creating folders from a file as the first one.