Some further notes about the method of hashing everything and comparing or grouping hashes later...
If you have these files, for example:
Code:
Name Length
---- ------
nuclear plans from around the world.wim 10,780,988,855
Sonnets About Travis Kelce.txt 2,394
Taylor Swift lyrics in Esperanto.txt 2,394
Using the "hash everything" method, on an internal SSD, takes me about
19.8 seconds, because it's hashing that big WIM file. There is no reason to hash this file, because looking at its size, it cannot possibly be a duplicate of the other files. We only need to compare the two text files. Doing that takes
0.013 seconds.
Using a real-world example, I have a folder of 36,977 files, 80.9 GB in size, on an external spinny disk attached via USB 3. Using the "hash everything" method takes just over
15 minutes to run through this folder, and that's using MD5 as the hashing algorithm.
Using the method I outlined in
#21 takes
less than 2 seconds. Using
@abactuon's method in
#43, after I fixed it, reports similar times. Both of our methods default to SHA-2 algorithms, SHA-256 specifically. So, we are hashing more slowly, but it matters little because we're hashing only when needed.