clean_corpus.pl
A script to clean your POPFile corpus. Current version: 9b (July 8, 2004)
News
I have finally gotten around to update clean_corpus. Version 9 is compatible with POPFile 0.20, 0.21, and it will also work with version 0.22.
Besides compatibility with the current version of POPFile, clean_corpus #9 has gotten a new rule. This rule is only used in what we call the "probability mode". The default run mode of clean_corpus still does what you expect it to do: it looks for non-words and removes them from your corpus.
The probability mode is described on (you guessed it) the probability mode page.
What is clean_corpus.pl?
If you have made it here, you most probably know what POPFile is. If you don't know what POPFile is, this page won't be of any use for you. But if you use email, you definitely should give POPFile a try.
If you have been using POPFile for a while, your corpus will have grown. It might even have grown more than it needs to. There is stuff in emails, that is not likely to ever show up in another email again. For example, message ids should be unique, so why would you want them in your corpus? Other examples include nonsense words that spammers put into their messages because they think this will prevent filters from working, typos, and random letters that earlier versions of POPFile extracted from encoded messages.
All of this will make your corpus bigger, but it will not contribute in any way to the classification accuracy. However, it will eat up memory and cpu resources. That is why some people think that their POPFile corpus should be clean, lean, and mean. And this is where clean_corpus will kick in.
clean_corpus.pl is a perl script that will, well, clean your corpus. To this end,
we use a set of rules that were designed to tell words from non-words.
The purpose of clean_corpus is not to increase POPFile's accuracy. The purpose
is to remove words from the corpus that don't do anything because they will never be used
to classify a message. Thus our target is to decrease the size of the POPFile corpus while not
changing POPFile's accuracy.
Another important point is that you should not run the script too often. The script will by default only look at words with a word count of 1. These can either be words that appear only once (our targets) or words that are new to the corpus and whose word count may increase with the next reclassification you perform.
If you think that your POPFile corpus is too big, run the script and see where it gets you. If you think POPFile's accuracy is too low, there is nothing we can do for you.
Next: The rule set.