Download and install clean_corpus.pl

This is a zip file that will contain the script itself plus the lists of English three-character sequences that we use for rule 7. After downloading, you must unzip the file to your POPFile directory; make sure that your archiver keeps the path names of the files. clean_corpus will get its own subdirectory so that we not only keep your corpus but also your POPFile installation clean. Note that this script requires the you have POPFile version 0.21 or higher.

clean_corpus9b.zip (last updated July 8, 2004)

Beside this basic package, you most probably should also get one or more of our language plugins. The term "language plugin" is just a euphemism for some text files containing language specific character triplets. So don't worry.

You need the language plugins to keep non-English words in your corpus. If your correspondence is in different languages or if spammers from foreign countries send you spam in foreign languages, get the corresponding plugin! It is not necessary that you yourself speak the language. If the spammers do, get the plugin.

Each language plugin is in a little zip file that will contain three files. Extract all three files of each plugin to your POPFile directory into the triplets subdirectory. You don't have to do anything else, clean_corpus will find the files and use them.

The plugins for English, French, German, and Spanish are included along with clean_corpus itself. No need to look for them here.

Catalan Danish Faroese
Gaelic Galician Italian
Latin Lithuanian Malay
Nederlands Norwegian Portuguese
Swedish Ukrainian  

Check list

The most important thing is to get the directories right. After installing everything, you should have

Next: Command line options.

About the design of this page

If you are using an old and dated browser (like Netscape 4.something) this page will look crappy. You should be able to get and read all the content, but since all the styling is achieved with cascading style sheets and old browser do not support these properly, you will have to live with it. Sorry.