PowerShell Script to Find Duplicate Files and Delete Them

pseymour · Jan 24, 2025

Okay first of all… hahah j/k

garlin · Jan 24, 2025

I want to point out a few "rookie mistakes", if you're new to PS scripting.

1. "Test-Path $path" will return $true, even if $path is really a normal file (and not a folder). "Test-Path -Type Container $path" is a more specific way to check if the passed argument is a folder.

2. Your list of displayed files doesn't use the full or a relative pathname. This is a major problem if I copy the same file to multiple subfolders. Pass the $_.FullName object reference from Get-ChildItem, and store that in your array list.

3. Another way to re-organize the script's workflow is to expand on your original idea of grouping the files by a shared hash value. We do one pass to recursively collect the combined hash and filename values into an array. After grouping them by unique hash, and counting each group's membership size, we can determine which groups have member counts > 1.

If we use use the [System.Collections.ArrayList] type to cast the hash list variable, the array can be dynamically trimmed (deleted). Without the type cast, PS will complain that the created array is fixed in size, and cannot be modified.

4. We can take the master list of hash values, and loop through multiple processing passes. Every time we select and delete a duplicated file, the hash list gets trimmed by removing the selected filename out of the list. We can continue looping, until the check reveals no individual hash group has a membership larger than a single member (they are all unique).

If the user wants to quit at any point, we allow them to specify that on the input line.

5. At a certain scale, when there are too many duplicated files or hash groups, the script will be less useful because there will be too much text to display. Then you will have to learn how to use WinForms, or some graphical UI, so the user can scroll through a long list of selections.

PS's Out-GridView cmdlet is very useful, but only if the user has the PowerShell ISE app installed (because that's where the cmdlet lives). Most users will have it available but beware some Windows debloaters will remove ISE, which deprives you of Out-GridView.

abactuon · Jan 25, 2025

$fd = Get-ChildItem –path E:\Portable\ -Recurse| Group-Object -property Length| Where-Object { $_.count -gt 1 }| Select-Object –Expand Group| Get-FileHash | Group-Object -property hash | Where-Object { $_.count -gt 1 }| ForEach-Object { $_.group | Select-Object Path, Hash }

$fd | Out-GridView -Title "Select files to delete" –PassThru | Remove-Item –Verbose –WhatIf

You can also replace duplicate files with hard links.

kelper · Jan 25, 2025

The three 'duplicates' that FB's script found in my Desktop folder were unique text files with only a few words.
According to File Explore and using a DIR command, they were all zero bytes. But they are all different and should not have had
identical hashes.

garlin · Jan 25, 2025

kelper said:
The three 'duplicates' that FB's script found in my Desktop folder were unique text files with only a few words.
According to File Explore and using a DIR command, they were all zero bytes. But they are all different and should not have had
identical hashes.

Zero-length files still have a computed MD5 value.

Code:

PS C:\Users\GARLIN\Downloads> dir .\BOOT_TEST\ZERO

    Directory: C:\Users\GARLIN\Downloads\BOOT_TEST

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        1/24/2025  11:42 PM              0 ZERO

PS C:\Users\GARLIN\Downloads> Get-FileHash -Path .\BOOT_TEST\ZERO -Algorithm MD5
Algorithm       Hash                                                                   Path                                                   
---------       ----                                                                   ----                                                                                                                                                        
MD5             D41D8CD98F00B204E9800998ECF8427E                                       C:\Users\GARLIN\Downloads\BOOT_TEST\ZER

Code:

PS C:\Users\GARLIN\Downloads> .\FRIDAY.ps1
Please Enter a Directory Path to Scan: C:\Users\GARLIN\Downloads\BOOT_TEST
MD5: D41D8CD98F00B204E9800998ECF8427E
[1] "C:\Users\GARLIN\Downloads\BOOT_TEST\ZERO"
[2] "C:\Users\GARLIN\Downloads\BOOT_TEST\SUB\SUB2\ZERO"

Pick one of the files to delete, 'q' to quit:

kelper · Jan 25, 2025

All three giles produce identical hashes! Why is this?

I used this script, generated by ChatGPT. It explains why FB's script does not work.

# Define the file path
$filePath = "C:\Users\Admin\Desktop\Satellite dish.txt"

# Check if the file exists
if (Test-Path $filePath) {
# Compute the hash of the file
$hash = Get-FileHash -Path $filePath -Algorithm SHA256

# Display the hash
Write-Output "File: $filePath"
Write-Output "Hash (SHA256): $($hash.Hash)"
} else {
# Display an error message if the file does not exist
Write-Output "Error: File not found at path $filePath"
}

kelper · Jan 25, 2025

Ooops, I see FB used MD5 and ChatGPT used SHA256. Will try again.

MD5 also returns identical hashes for three non-identical, but small txt files.

FreeBooter · Jan 25, 2025

kelper said:
Ooops, I see FB used MD5 and ChatGPT used SHA256. Will try again.

MD5 also returns identical hashes for three non-identical, but small txt files.

The SHA256 hashing algorithm, slower than MD5 hashing algorithm, that is why i use the MD5 also my script works i don't how you are coming up with these results my script tested by many programmers not by just you no one find the script as not working, so problem is you.

kelper · Jan 25, 2025

@FreeBooter

You are teaching yourself PS and I am trying to help by pointing out that your script is infallible, no need to be rude.

FreeBooter · Jan 25, 2025

kelper said:
@FreeBooter

You are teaching yourself PS and I am trying to helpp by pointing out that your script is infallible, no need to be rude.

You are the rude person by telling me that my script i have tested multiple times does not work and you are not programmer, how are you helping me. This not your first you did same with my other scripts all does not work for you for some reason also i don't need your help to learn i have two Discord servers full of programmers if i need help i will ask them.

kelper · Jan 25, 2025

I copied these files and the new one have a size of 70 to 200 bytes now where they were zero size before.

FreeBooter · Jan 25, 2025

Do you know how my script works?

kelper · Jan 25, 2025

FreeBooter said:
Do you know how my script works?

It does not work properly, as I have informed you. Let's leave it. It is not my intention to upset you. Bye.

FreeBooter · Jan 25, 2025

kelper said:
It does not work properly, as I have informed you. Let's leave it. It is not my intention to upset you. Bye.

It works for me and others not for you, all my scripts does not work for you.

kelper · Jan 25, 2025

Agreed, I should have said the scripts do not work for me. The three duplicates that were actually different seem to be corrupt or the content I could see was in an alternate data stream. I apologise for saying your scripts don't work.

pseymour · Jan 25, 2025

Some further notes about the method of hashing everything and comparing or grouping hashes later...

If you have these files, for example:

Code:

Name                                    Length
----                                    ------
nuclear plans from around the world.wim 10,780,988,855
Sonnets About Travis Kelce.txt          2,394
Taylor Swift lyrics in Esperanto.txt    2,394

Using the "hash everything" method, on an internal SSD, takes me about 19.8 seconds, because it's hashing that big WIM file. There is no reason to hash this file, because looking at its size, it cannot possibly be a duplicate of the other files. We only need to compare the two text files. Doing that takes 0.013 seconds.

Using a real-world example, I have a folder of 36,977 files, 80.9 GB in size, on an external spinny disk attached via USB 3. Using the "hash everything" method takes just over 15 minutes to run through this folder, and that's using MD5 as the hashing algorithm.

Using the method I outlined in #21 takes less than 2 seconds. Using @abactuon's method in #43, after I fixed it, reports similar times. Both of our methods default to SHA-2 algorithms, SHA-256 specifically. So, we are hashing more slowly, but it matters little because we're hashing only when needed.

garlin · Jan 25, 2025

Another consideration is there can be hard or symbolic links, which are duplicate files by design. You can use Get-ChildItem's LinkType & Target fields to check if they need to be excluded.

Code:

$FilesList = [System.Collections.ArrayList]@()

Get-ChildItem $path -Recurse -File | select FullName,LinkType,Target | foreach {
    $File = $_
    $FullName = $File.FullName

    switch($_.LinkType) {
        HardLink {
            if ($FilesList -notcontains $File.Target) {
                $FilesList += $FullName
            }
            else {
                Write-Host "Skipping Hard Link: `"$FullName`""
            }
        }
        SymbolicLink {
            Write-Host "Skipping Symbolic Link: `"$FullName`""
        }
        default {
            $FilesList += $FullName
        }
    }
}

kelper · Jan 25, 2025

I started a new thread to avoid hijacking this one. Can anyone explain why Notepad was caching three files elsewhere?

Strange problem with zero-byte text files

I use Notepad to write myself short memos. When I look in File Explorer, or use DIR in a command prompt, three of them show zero bytes and zero bytes on disk. Yet when I click on them they open and the content is visible. Eventually, I found that Notepad is caching these memos elsewhere! Note...

www.elevenforum.com

kelper · Jan 25, 2025

garlin said:

Another consideration is there can be hard or symbolic links, which are duplicate files by design. You can use Get-ChildItem's LinkType & Target fields to check if they need to be excluded.

Code:

$FilesList = [System.Collections.ArrayList]@()

Get-ChildItem $path -Recurse -File | select FullName,LinkType,Target | foreach {
    $File = $_
    $FullName = $File.FullName

    switch($_.LinkType) {
        HardLink {
            if ($FilesList -notcontains $File.Target) {
                $FilesList += $FullName
            }
            else {
                Write-Host "Skipping Hard Link: `"$FullName`""
            }
        }
        SymbolicLink {
            Write-Host "Skipping Symbolic Link: `"$FullName`""
        }
        default {
            $FilesList += $FullName
        }
    }
}

do I have to edit that script to add the file name and path?

garlin · Jan 25, 2025

kelper said:
do I have to edit that script to add the file name and path?

Here's my updated version.

Code:

while (1) {
    $Path = Read-Host "Please Enter a Directory Path to Scan"

    if (-not (Test-Path -Type Container $Path)) {
        Write-Host "Invalid directory path, please try again.`n"
    }
    else {
        break
    }
}

# Create array as type [System.Collections.ArrayList], so we can delete items from the list.

$FilesList = [System.Collections.ArrayList]@()

Get-ChildItem $Path -Recurse -File | select FullName,LinkType,Target | foreach {
    $File = $_
    $FullName = $File.FullName

    switch($_.LinkType) {
        HardLink {
            if ($FilesList -notcontains $File.Target) {
                $FilesList += $FullName
            }
            else {
                Write-Host "Skipping Hard Link: `"$FullName`""
                $Skipped = $true
            }
        }
        SymbolicLink {
            Write-Host "Skipping Symbolic Link: `"$FullName`""
            $Skipped = $true
        }
        default {
            $FilesList += $FullName
        }
    }
}

if ($Skipped) {
    Write-Host ""
}

$HashList = [System.Collections.ArrayList]@(
    $FilesList | foreach {
        Get-FileHash -LiteralPath $_ -Algorithm MD5 | select Hash,Path
    }
)

if (($HashList | Group-Object -Property Hash | Where-Object { $_.Count -gt 1 }).Count -eq 0) {
    Write-Host "No duplicate files found."
    exit 0
}

while (1) {
    $FilenameList = @{}
    $Index = 1

    foreach ($Hash in ($HashList | Group-Object -Property Hash | Where-Object { $_.Count -gt 1 })) {
        Write-Host "MD5: $($Hash.Name)"
        foreach ($File in $Hash.Group.Path) {
            Write-Host "[$Index] `"$File`""

            #  Build list of duplicated files, in numbered order
            $FilenameList[$Index] = $File
            $Index++
        }
        Write-Host ""
    }

    $Selection = Read-Host "Pick one of the files to delete, 'q' to quit"

    if ($Selection -match 'q') {
        break
    }
    else {
        # Recast $Selection as integer to avoid problems later
        $Selection = [int]$Selection
    }

    if ($Selection -lt 1 -or $Selection -ge $Index) {
        Write-Host "$Selection is not valid, or out of range"
    }
    else {
        $DeletedFile = $FilenameList[$Selection]
        Write-Host "Deleting `"$DeletedFile`"`n"
        Remove-Item $DeletedFile -Force

        # Remove matching file from $HashList & $FilenameList arrays
        $HashList = ($HashList | where { $_.Path -notmatch [regex]::Escape($DeletedFile) })
        $FilenameList.Remove($Selection)
    }

    if ($FilenameList.Count -eq 1) {
        break
    }
}

PowerShell Script to Find Duplicate Files and Delete Them

who wants pizza?

My Computer

Well-known member

My Computer

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computers

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computer

Well-known member

My Computers

who wants pizza?

My Computer

Well-known member

My Computer

Well-known member

My Computers

Well-known member

My Computers

Well-known member

My Computer

Similar threads