Download and install clean_corpus.pl
This is a zip file that
will contain the script itself plus the lists of English three-character
sequences that we use for rule 7. After downloading, you must unzip the
file to your POPFile directory; make sure that your archiver keeps the
path names of the files. clean_corpus will get its own subdirectory
so that we not only keep your corpus but also your POPFile installation clean.
Note that this script requires the you have POPFile version 0.21 or higher.
clean_corpus9b.zip (last updated July 8, 2004)
Beside this basic package, you most probably should also get one or more of our language plugins. The term "language plugin" is just a euphemism for some text files containing language specific character triplets. So don't worry.
You need the language plugins to keep non-English words in your corpus. If your correspondence is in different languages or if spammers from foreign countries send you spam in foreign languages, get the corresponding plugin! It is not necessary that you yourself speak the language. If the spammers do, get the plugin.
Each language plugin is in a little zip file that will contain three files. Extract
all three files of each plugin to your POPFile directory into the
triplets subdirectory. You don't have to do anything
else, clean_corpus will find the files and use them.
The plugins for English, French, German, and Spanish are included along with
clean_corpus itself. No need to look for them here.
| Catalan | Danish | Faroese |
| Gaelic | Galician | Italian |
| Latin | Lithuanian | Malay |
| Nederlands | Norwegian | Portuguese |
| Swedish | Ukrainian |
Check list
The most important thing is to get the directories right. After installing everything, you should have
- a
clean_corpusdirectory in your POPFile directory. This is where the script itself (clean_corpus.pl) must be. - within that, you should have a
tripletsdirectory. - All the text files from the language plugins must go into this
tripletsdirectory.
Next: Command line options.