Get duplicate files with Python and Objective-C

I put together a little script to find duplicate files using pure Python, and it’s pretty quick too. It comes in handy as I do a lot of graphic work where I end up with tons of files as I work, I am paranoid I suppose but losing hours of work has made me this way. Anyway, this works perfectly but can be slow when finding tons of matches in huge amounts of data. Download the code just copy paste below. This should work on Linux, Windows as well as OS X, just change to your path.

"""
Author:  C. Nichols

Find duplicate files and report age in days from creation timestamp.
No licensing, pretty standard stuff - just enjoy it if you need it!
"""
import os
import fnmatch
import hashlib
import datetime
# =============================================================================

def fileAgeFromTodayInDays(secs):
    """fileAgeFromTodayInDays(secs) -> int"""
    file_time = datetime.datetime.fromtimestamp(secs)
    now = datetime.datetime.now()
    diff = now - file_time
    return diff.days

def getDigest(file_input):
    """getDigest(file_input) -> str"""
    handle = open(file_input)
    h = hashlib.md5()
    h.update(handle.read())
    val=h.hexdigest()
    handle.close()
    return val

def fileStats(path):
    """fileStats(path) -> stat results"""
    return os.stat(path)

def traverseFiles(root, pattern='*.*'):
    """traverseFiles(root, pattern='*.*') -> tuple"""
    root=os.path.normpath(root)
    for root, dirs, files in os.walk(root):
        for filename in fnmatch.filter(files, pattern):
            filepath=os.path.join(root,filename)
            try:
                yield filepath,getDigest(filepath)
            except:
                pass #access denied; system file maybe. log it maybe?

# =============================================================================
# MAIN: Set your start directory, c:/ or /home, etc...
Where='/Users/mohawke/Desktop/dupes' # My path on Mac, change...

#Find='*.py'      ## search by extension.
#Find='Script5.*' ## search by filename
Find='*.*' ## get all files; comment out to override.
# =============================================================================
# Get file listing.
# =============================================================================
master={}
count=0
for one in traverseFiles(Where,pattern=Find):
    fpath,hdigest=one
    if hdigest not in master:
        master[hdigest]=[]
    master[hdigest].append(fpath)

    # This just provides some feedback, not important.
    count+=1
    strOut='searching %%s\r' %%count
    print strOut,

# Show results, do something with results here...
for k,dupes in master.items():
    if len(dupes)>1: # if no duplicate, ignore...
        print '\nMATCH:\n'
        for item in dupes:
            print item, fileAgeFromTodayInDays( fileStats(item).st_mtime ), 'days old'

Here it is in Objective-C. I tried to get the MD5 to work on my own, but I ran across this awesome bit of C code while looking for examples. I don’t think I can do better than this at this point. It’s good stuff so why reinvent the wheel, right? This will go through all files and find duplicates based on the hash. You can download the compiled version here, takes the path and extension you’re looking for as args OR you can copy paste the code below, just create a terminal app in XCode and change the path and extension you’re looking for. You can replace the hard coded paths for the argv, I left them in…

//  main.m
//  superduper
//
//  Created by Charles Nichols on 3/24/11.
//  Copyright None. All rights not reserved;

// Standard library
#include <stdint.h>
#include <stdio.h>

#import <Foundation/Foundation.h>
#import <CommonCrypto/CommonDigest.h>

// In bytes
#define FileHashDefaultChunkSizeForReadingData 4096

/*
 Function FileMD5HashCreateWithPath to compute MD5 hash
 written by Joel Lopes Da Silva.

 It's really simple to adapt this function to other algorithms.
 Say you want to adapt it to get the SHA1 hash instead.
 Here's what you need to do:

 replace CC_MD5_CTX with CC_SHA1_CTX;
 replace CC_MD5_Init with CC_SHA1_Init;
 replace CC_MD5_Update with CC_SHA1_Update;
 replace CC_MD5_Final with CC_SHA1_Final;
 replace CC_MD5_DIGEST_LENGTH with CC_SHA1_DIGEST_LENGTH;
 */

CFStringRef FileMD5HashCreateWithPath(CFStringRef filePath,
                                      size_t chunkSizeForReadingData) {

    // Declare needed variables
    CFStringRef result = NULL;
    CFReadStreamRef readStream = NULL;

    // Get the file URL
    CFURLRef fileURL =
    CFURLCreateWithFileSystemPath(kCFAllocatorDefault,
                                  (CFStringRef)filePath,
                                  kCFURLPOSIXPathStyle,
                                  (Boolean)false);
    if (!fileURL) goto done;

    // Create and open the read stream
    readStream = CFReadStreamCreateWithFile(kCFAllocatorDefault,
                                            (CFURLRef)fileURL);
    if (!readStream) goto done;
    bool didSucceed = (bool)CFReadStreamOpen(readStream);
    if (!didSucceed) goto done;

    // Initialize the hash object
    CC_MD5_CTX hashObject;
    CC_MD5_Init(&hashObject);

    // Make sure chunkSizeForReadingData is valid
    if (!chunkSizeForReadingData) {
        chunkSizeForReadingData = FileHashDefaultChunkSizeForReadingData;
    }

    // Feed the data to the hash object
    bool hasMoreData = true;
    while (hasMoreData) {
        uint8_t buffer[chunkSizeForReadingData];
        CFIndex readBytesCount = CFReadStreamRead(readStream,
                                                  (UInt8 *)buffer,
                                                  (CFIndex)sizeof(buffer));
        if (readBytesCount == -1) break;
        if (readBytesCount == 0) {
            hasMoreData = false;
            continue;
        }
        CC_MD5_Update(&hashObject,
                      (const void *)buffer,
                      (CC_LONG)readBytesCount);
    }

    // Check if the read operation succeeded
    didSucceed = !hasMoreData;

    // Compute the hash digest
    unsigned char digest[CC_MD5_DIGEST_LENGTH];
    CC_MD5_Final(digest, &hashObject);

    // Abort if the read operation failed
    if (!didSucceed) goto done;

    // Compute the string result
    char hash[2 * sizeof(digest) + 1];
    size_t i = 0;
    for (i = 0; i < sizeof(digest); ++i) {
        snprintf(hash + (2 * i), 3, "%%02x", (int)(digest[i]));
    }
    result = CFStringCreateWithCString(kCFAllocatorDefault,
                                       (const char *)hash,
                                       kCFStringEncodingUTF8);

done:

    if (readStream) {
        CFReadStreamClose(readStream);
        CFRelease(readStream);
    }
    if (fileURL) {
        CFRelease(fileURL);
    }
    return result;
}

int main (int argc, const char * argv[])
{

    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];

    NSString *path_input = "/Users/mohawke/Desktop/dupes/";
    NSString *ext_input  = "png";

    if (path_input && ext_input){
        NSString *path = [[NSString alloc] initWithUTF8String: path_input];
        NSString *ext = [[NSString alloc] initWithUTF8String: ext_input];

    // get user input from term.
    //if (argv[1] && argv[2]){
    //    NSString *path = [[NSString alloc] initWithUTF8String: argv[1]];
    //    NSString *ext = [[NSString alloc] initWithUTF8String: argv[2]];

        // Get all the files in our path.
        NSDirectoryEnumerator *fm = [[NSFileManager defaultManager]
                                                     enumeratorAtPath:path];

        NSString *newPath;
        NSString *file;

        NSMutableDictionary *_matchesDictionary = [[NSMutableDictionary alloc] init];
        int count = 0;
        while ((file = [fm nextObject]))
        {
            fprintf(stderr, "searching %%u\r", count++);

            if ([[file pathExtension] compare: ext options:NSCaseInsensitiveSearch] == NSOrderedSame)
                {
                    //NSString *filelower = [file lowercaseString];
                    newPath = [path stringByAppendingPathComponent: file];
                    CFStringRef md5hash = FileMD5HashCreateWithPath((CFStringRef)newPath,
                                                    FileHashDefaultChunkSizeForReadingData);
                    if (md5hash!=nil){

                        NSString *hashKey = [NSString stringWithFormat:@"%%@", md5hash];

                        // check if key is already in dict.
                        if ([_matchesDictionary objectForKey: hashKey])
                        {
                            // pull existing array from dict and add new entry to array
                            // for said key.
                            NSMutableArray *keyArray = [_matchesDictionary objectForKey:
                                                                       hashKey];
                            [keyArray addObject: newPath];

                            // put the array back into the dict for said key.
                            [_matchesDictionary setObject: keyArray forKey: hashKey];

                        } else {

                            // create new array, assign empty array to said key in dict.
                            NSMutableArray *keyArray = [NSMutableArray array];
                            [keyArray addObject: newPath];
                            [_matchesDictionary setObject: keyArray forKey: hashKey];

                        }
                    }
                }
        }

        // we have gathered all paths and calculated hash for each file,
        // any hash that matches would be a dupe file.
        NSEnumerator *keyEnum = [_matchesDictionary keyEnumerator];
        id key;
        int sz = 0;
        while ((key = [keyEnum nextObject]))
        {
            id value = [_matchesDictionary objectForKey:key];
            if ([value count] > 1){
                NSString * dupe_files = [[value valueForKey:@"description"] componentsJoinedByString:@"\n"];
                fprintf(stderr, "Dupes [%%s]\n%%s\n\n", [key UTF8String], [dupe_files UTF8String]);
                //NSLog(@"%%@ %%@", value, key);
            }
        }

    } else {
        fprintf(stderr, "Please provide a search path and extension;\ndupe /search_path/ file_extension\n");
    }

    [pool drain];
    return 0;
}