How to use clean_corpus

After you have downloaded and installed the script (see the download page) you open a command (or DOS) prompt, shell terminal, or whatever the command line is called on your system, and change to the clean_corpus directory that is now a sub-directory of your POPFile installation directory.

Don't worry. clean_corpus will not yet actually change your corpus!

The first thing you do is to shut down POPFile if it is currently running.

The second step is to simply run clean_corpus.

If you are using the cross-platform version of POPFile, type
perl clean_corpus.pl
and hit return.

If you using POPFile for Windows (and don't have Perl installed on your system), you should use the batch file clean_corpus.bat to start clean_corpus.

clean_corpus will now go through your corpus and apply the rule set to each word with a word count of 1. This may take a while, depending on the speed of your computer and on the size of your corpus, but you will be given lots of information along the way.

When the script is finished, you will find a new file in the directory, named thrownout.txt. This file will contain all the words that triggered one of the rules along with information about the bucket they are contained in and along with the exact rule that was triggered.

Inspect this file in your favorite text editor and look for words you would rather keep than have them thrown away.

If you find anything you want to keep, you can create a new file, name it keep.txt which includes the words you would rather keep. You can either put one word per line in this file or you use thrownout.txt: Delete each line pertaining to a correctly thrown out word, leaving only the words you want to keep, and save the edited version as keep.txt.

How to clean your working corpus

After you have had a look at what clean_corpus will throw out, you may want to actually do the real thing and this time really get rid of those words. Here is how:

Next: Download clean_corpus.

About the design of this page

If you are using an old and dated browser (like Netscape 4.something) this page will look crappy. You should be able to get and read all the content, but since all the styling is achieved with cascading style sheets and old browser do not support these properly, you will have to live with it. Sorry.