PyFdupes 1.0.3 What is it ? ------------ Pyfdupes can help you to reorganize your mp3 collection or your download directories, finding files with a "similar name" or same content with great results. Requirement ----------- The Windows installer contains all the required library. To run pyfdupes from the source you need the following packages : - python 2.4 - wxpython 2.5 ( only for the graphical version ) - psyco 1.4 ( optional but strongly recommended! ) Download --------- The latest version is available from http://sourceforge.net/projects/pyfdupes/ Installation ------------ To install pyfdupes decompress the downloaded file and run pyfdupes.py or cpyfdupes.py for the preliminary command line version. A Windows binary installer is also available. Use --- Using pyfdupes in five steps : 1. select one or more directories using the "Add" button 2. choose the preferred search method ("similar name" or "same content") 3. start the search with the "Search" button 4. view or delete the found files using the buttons "Open", "Delete" or "Delete All" (or with the right mouse button over the filename) 5. finally save the search with the "Save report" button A context menu is also available by pressing the right mouse button over the "Directories" or "Results" area. It's possible to discard some files specifying their extensions separated by comma on "Skip file ext." editbox. ( eg: html,htm,ico ) Warning! The "Delete" button will remove permanently the selected file. The "Delete All" button is available only for the content search type and it will remove ALL the found files leaving only an original of each duplicated group. Using the preliminary command line version : cpyfdupes.py --dir=e:\down --method=winkler --level=85 --dup --log=e:\log.txt cpyfdupes.py --dir=e:\down --method=strcmp --log=e:\log.txt cpyfdupes.py --dir=e:\down --method=sha --log=e:\log.txt --help : shows the available options --dir : can be repeated more than once in order to search one or more directories. In the "similar name" search type you can also specify one or more filelist ( file:c:\list.txt ) --method : specifies the preferred search method --level : specifies the similarity level to refine your search --dup : use this flag to allow duplicates --log : use this flag to specify an output report file Similar name search ------------------- This kind of search is a heavy process and can take a long time with many files. Working only with filenames, searches can be done not only on directories with real files but also within a list of files created on a different computer. In the FAQs you can read how to make a list of files that can be read by pyfdupes. The available algorithm will give you different performance and different precision according to the following table: Name Precision Speed ------------------------------- jaro High High winkler High High bigram High High editdist High Low seqmatch High Low soundex High Low strcmp Very low Very high For all algorithms except for "strcmp" and "soundex" it is possible to specify a similarity level. A higher level will be faster but with fewer duplicates, a lower similarity level will naturally be slower but with more duplicates found. The "Allow repetitions" options enables a found file to be shown more than once, increasing the search time. The "Default" button will set the best settings for a search. With too many files it is better to do a first search using the "strcmp" method. The search process will stop automatically after a maximum of 5 minute and you will receive a "time-out" warning message in the report box. Mp3 music files will be handled in a particular way using the song title without numbers and other unwanted charactes. ( eg: "Artist - Album - 01 Title.mp3" --> "title" ) Content search -------------- This is the classic method to find duplicates, it's based on the content of the file not on it's name. There are many utilities that already do this and I've added it to pyfdupes for your convenience. In order to increase performance files aren't compared byte to byte, instead a hash key will be generated and compared between files of the same size. Two files shown as identical are extremely unlikely not to be so, however show great care in deciding whether to delete or not. The "md5" method will use up to the first 4 mbytes of the file to generate the hash key, the "sha" method will read up to 40 mbytes therefore being slower but more precise. Developer information --------------------- Pyfdupes was made with : * Windows XP home SP2 * python 2.4.x * wxPython 2.5.x * boa-constructor 0.4.0 * pylint 0.4.2 * pychecker 0.8.14 * epydoc 2.1 * psyco 1.4 * innosetup 5.0.7 * py2exe 0.5.4 * SciTE 1.59 Module list ----------- pyfdupes.py : main graphical application cpyfdupes.py : preliminary command line version MainFrame.py : wxPython interface filecontent.py : file content search class filename.py : file name search class stringcmp.py : various string similarity algorithms soundex.py : soundex string similarity algorithm The filecontent.py and filename.py modules can be independently used in any python program even without a GUI. Read the technical documentation for more information. Performance consideration ------------------------- Searching for similar names is a very time-consuming job because each name needs to be compared with all the other ones. At the beginning I've tried to elaborate an algorithm that decrease the number of required comparisons making groups of strings with the same length etc... Unfortunately all the added complexity didn't pay in terms of performance gain so I'm back to the original simplest solution with two nested loops. Obviously there is a preliminary check to avoid the expensive comparison between two strings of considerably different lengths. If psyco is installed Pyfdupes will use it automatically speeding things up considerably. My own test with psyco showed that it can outperform a difflib python module written in "C" language ! Future developments ------------------- In the TODO.txt file you can find a list of the missed features that could to be done in the future ;) Limitations ----------- See BUGS.txt Bugs ---- PyFdupes shouldn't have any additional bugs other than the ones listed in BUGS.txt if you had to discover one send a detailed report to the developer containing : - the problem description - the log file created in the program directory - OS, python and wxpython version - the instruction to reproduce it License ------- Pyfdupes is written by Luca Montecchiani and released under the GNU GPL license The stringcmp.py module is part of Febrl 0.3 project (C) 2002, 2003, 2004 the Australian National University and others, released under the ANUOS-1.2.txt license Link ---- If you are interested in string similarity you probably won't miss these links: http://itman.narod.ru/english/index.htm ( good FAQ ) http://trific.ath.cx/resources/python/levenshtein/ http://www.merriampark.com/ld.htm http://www.personal.psu.edu/staff/i/u/iua1/python/apse/ http://datamining.anu.edu.au/projects/linkage.html ( stringcmp.py module ) http://www.bio.cam.ac.uk/~mw263/pyagrep.html http://hetland.org/python/distance.py Contributions ------------- Critics, ideas, patches and comments are more than welcome !