version 0.1
A small (and ugly less-than-one-hour-hack) Python script that searches (using google web search) phrases in a file in order to ease plagiarism detection. GNU GPL License. Developed and tested on Ubuntu 6.10 (Edgy).
Requirements
python >=2.4, w get (w get package), pdftotext (only for pdf support; in ubuntu it's in poppler-utils package) and GNU recode (recode package) + Internet connection
Usage
Provided that the file has already executable permissions (chmod +x) and python executable is in /usr/bin
./dpd.py inputfile [minwords] [maxwords]
…if it doesn't work
python dpd.py inputfile [minwords] [maxwords]
Input file can be a text (UTF-8) or PDF (provided that pdftotext is available). PDF files are converted to text (a new file inputfile.txt is created). minwords (default 7) is the minimum words per phrase (phrases with less than minwords words are ignored) maxwords (default 20) is the maximum words per phrase (phrases with more than maxwords words are ignored) Modify minwords and maxwords to tune the analysis. The script considers . and ; as phrase separator. The result is a list of links that contains the considered phrases, ordered from the least significant to the more significant (more phrases found in it).
Disclaimer
This tool is far from being perfect. It merely searches pieces of text (word OR word OR word, etc.) on the web. I assume no responsability for its usage, and I do not guarantee that it works correctly as described. Use it only to get some hints. Getting many matches does not indicate a plagiarism: the script merely looks for pages where the same words appear, so manually verify every result before claiming that's a plagiarism!