The probability rule

What the probability rule does is quite easy to explain: It looks for words that are contained in all your buckets; if they have equal probabilities in all buckets, it will remove those words.

Why would it do that? Because if a word is equally probable for all your buckets, then that word doesn't contribute anything to the classification of a new message.

However, this only goes for a corpus that isn't too new, let's call it a matured corpus.

Simply give it a try. Like clean_corpus' default mode, the probability mode will not change your corpus unless you tell it to so with the nodebug command line option. Also like the default mode, it will tell you what words it has thrown out (or would throw out) by means of the thrownout.txt file.

The count of the words that the probability rule will remove from your corpus will be rather low. Expect 100 words, maybe less.

About the design of this page

If you are using an old and dated browser (like Netscape 4.something) this page will look crappy. You should be able to get and read all the content, but since all the styling is achieved with cascading style sheets and old browser do not support these properly, you will have to live with it. Sorry.