How fast is grep? Reasonably fast. Over the weekend, we were discussing on Twitter a post from Mike Haertel. Mike was the original developer of GNU grep. In the post titled “why GNU grep is fast“, Mike described the algorithm grep uses. He also provided this excellent advice: “#1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE. #2 trick: GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it *does* look at.” “The key to making programs fast is to make them do practically nothing.”
This had me wondering about how PowerShell’s Select-String stacks up. Richard Minerich (@rickasaurus) brought up a good point: compiled C code is generally faster than C# code. As PowerShell rests on .NET, we can make an assumption that grep should be faster than Select-String. Mark Boltz (@mtezna) suggested running several tests of both and taking an average to get a sense of how Select-String stacks up.
If Select-String was significantly slower, then a good weekend project might be to write a faster parser. I do have the occassional free weekend and I was very curious. Today, I performed such a test. Read on to find out what I learned.
I generated sample files using a sample dictionary file. Each file contained sentences of random length (5-25 random words). One in ten sentences contained the word “key” at a random location within the sentence. There were eleven sample files: 1,000 sentences, 10,000 sentences, 20,000 sentences, and so on to 100,000 sentences. (You can download the resulting test files here: grep-select-string-test.zip).
Each search was performed seven times. System.Diagnostics.Stopwatch was used as the time source. The total milliseconds elapsed was used as the time measure. The minimum time and the maximum time were dropped. The time recorded was the average of the remaining five tests.
I used the latest GNU grep for Windows, version 4.2.1 released 2012-12/18. The command executed for grepping the file was: grep “key” “file1000.txt”
For PowerShell, I used version 3 (build 6.2.9200.16398). The PowerShell equivilant of the grep command was: Select-String -Pattern “key” -Path .\file1000.txt
The host operating system is Windows 2008 Server R2 SP1 with the latest hotfixes.
In the following graph, the number of lines in the sample files is plotted on the x-axis. The total time to search the sample file is plotted on the y-axis in milliseconds.
Lines — Grep — Select-String
1,000 — 248.2245 — 29.8712
10,000 — 1,907.8156 — 299.4792
20,000 — 4,013.5332 — 643.2678
30,000 — 6,689.0545 — 1,036.1867
40,000 — 8,419.1654 — 1,319.9755
50,000 — 10,870.3179 — 1,662.6931
60,000 — 12,487.7127 — 1,955.2525
70,000 — 15,048.1311 — 2,344.9599
80,000 — 16,623.6946 — 2,594.3496
90,000 — 16,775.1033 — 2,995.7644
100,000 — 18,697.6675 — 3,303.2918
The bottom line? Select-String is significantly faster than GNU grep on Windows Server 2008 R2. PowerShell is closing the gap between Linux and Windows shell environments.