PyFdupes 1.0.3

What is it ?
------------
Pyfdupes can help you to reorganize your mp3 collection or your download
directories, finding files with a "similar name" or same content with great 
results.

Requirement
-----------
The Windows installer contains all the required library. To run pyfdupes
from the source you need the following packages :
- python 2.4
- wxpython 2.5 ( only for the graphical version )
- psyco 1.4 ( optional but strongly recommended! )

Download
---------
The latest version is available from http://sourceforge.net/projects/pyfdupes/

Installation
------------
To install pyfdupes decompress the downloaded file and run pyfdupes.py or 
cpyfdupes.py for the preliminary command line version.
A Windows binary installer is also available.

Use
---
Using pyfdupes in five steps :
1. select one or more directories using the "Add" button
2. choose the preferred search method ("similar name" or "same content")
3. start the search with the "Search" button
4. view or delete the found files using the buttons "Open", "Delete" or 
   "Delete All" (or with the right mouse button over the filename)
5. finally save the search with the "Save report" button

A context menu is also available by pressing the right mouse button over
the "Directories" or "Results" area.
It's possible to discard some files specifying their extensions separated by 
comma on "Skip file ext." editbox. ( eg: html,htm,ico )

Warning!
The "Delete" button will remove permanently the selected file.
The "Delete All" button is available only for the content search type and
it will remove ALL the found files leaving only an original of each duplicated 
group.
 
Using the preliminary command line version :

cpyfdupes.py --dir=e:\down --method=winkler --level=85 --dup --log=e:\log.txt
cpyfdupes.py --dir=e:\down --method=strcmp --log=e:\log.txt
cpyfdupes.py --dir=e:\down --method=sha --log=e:\log.txt

--help   : shows the available options
--dir    : can be repeated more than once in order to search one or more 
           directories. 
           In the "similar name" search type you can also specify one or more
           filelist ( file:c:\list.txt )
--method : specifies the preferred search method
--level  : specifies the similarity level to refine your search
--dup    : use this flag to allow duplicates
--log    : use this flag to specify an output report file

Similar name search
-------------------
This kind of search is a heavy process and can take a long time with many files.
Working only with filenames, searches can be done not only on directories with 
real files but also within a list of files created on a different computer. 
In the FAQs you can read how to make a list of files that can be read by
pyfdupes.
The available algorithm will give you different performance and different 
precision according to the following table:

Name     Precision   Speed
-------------------------------
jaro     High        High
winkler  High        High
bigram   High        High
editdist High        Low
seqmatch High        Low
soundex  High        Low
strcmp   Very low    Very high 

For all algorithms except for "strcmp" and "soundex" it is possible to specify
a similarity level. A higher level will be faster but with fewer duplicates, a 
lower similarity level will naturally be slower but with more duplicates found.
The "Allow repetitions" options enables a found file to be shown more than once,
increasing the search time.
The "Default" button will set the best settings for a search.
With too many files it is better to do a first search using the "strcmp" method.
The search process will stop automatically after a maximum of 5 minute and you
will receive a "time-out" warning message in the report box.
Mp3 music files will be handled in a particular way using the song title without 
numbers and other unwanted charactes.
( eg: "Artist - Album - 01 Title.mp3" --> "title" )

Content search
--------------
This is the classic method to find duplicates, it's based on the content of the
file not on it's name. There are many utilities that already do this and I've 
added it to pyfdupes for your convenience.
In order to increase performance files aren't compared byte to byte, instead a
hash key will be generated and compared between files of the same size.
Two files shown as identical are extremely unlikely not to be so, however show
great care in deciding whether to delete or not.
The "md5" method will use up to the first 4 mbytes of the file to generate the
hash key, the "sha" method will read up to 40 mbytes therefore being slower but 
more precise.

Developer information
---------------------
Pyfdupes was made with :

* Windows XP home SP2
* python 2.4.x
* wxPython 2.5.x
* boa-constructor 0.4.0
* pylint 0.4.2
* pychecker 0.8.14 
* epydoc 2.1 
* psyco 1.4
* innosetup 5.0.7
* py2exe 0.5.4
* SciTE 1.59

Module list
-----------
pyfdupes.py    : main graphical application
cpyfdupes.py   : preliminary command line version
MainFrame.py   : wxPython interface
filecontent.py : file content search class
filename.py    : file name search class
stringcmp.py   : various string similarity algorithms
soundex.py     : soundex string similarity algorithm

The filecontent.py and filename.py modules can be independently used in any 
python program even without a GUI. 
Read the technical documentation for more information.

Performance consideration
-------------------------
Searching for similar names is a very time-consuming job because each name needs
to be compared with all the other ones.
At the beginning I've tried to elaborate an algorithm that decrease the number 
of required comparisons making groups of strings with the same length etc...
Unfortunately all the added complexity didn't pay in terms of performance gain
so I'm back to the original simplest solution with two nested loops.
Obviously there is a preliminary check to avoid the expensive comparison between 
two strings of considerably different lengths.
If psyco is installed Pyfdupes will use it automatically speeding things up
considerably.
My own test with psyco showed that it can outperform a difflib python module 
written in "C" language !

Future developments
-------------------
In the TODO.txt file you can find a list of the missed features that could to be
done in the future ;)

Limitations
-----------
See BUGS.txt

Bugs
----
PyFdupes shouldn't have any additional bugs other than the ones listed in 
BUGS.txt if you had to discover one send a detailed report to the developer 
containing :
- the problem description
- the log file created in the program directory
- OS, python and wxpython version
- the instruction to reproduce it

License
-------
Pyfdupes is written by Luca Montecchiani <mluca@users.sourceforge.net>
and released under the GNU GPL license
The stringcmp.py module is part of  Febrl 0.3 project
(C) 2002, 2003, 2004 the Australian National University and others,
released under the ANUOS-1.2.txt license

Link
----
If you are interested in string similarity you probably won't miss these links:

http://itman.narod.ru/english/index.htm ( good FAQ )
http://trific.ath.cx/resources/python/levenshtein/
http://www.merriampark.com/ld.htm
http://www.personal.psu.edu/staff/i/u/iua1/python/apse/
http://datamining.anu.edu.au/projects/linkage.html ( stringcmp.py module )
http://www.bio.cam.ac.uk/~mw263/pyagrep.html     
http://hetland.org/python/distance.py

Contributions
-------------
Critics, ideas, patches and comments are more than welcome !