The rule set

clean_corpus has a set of rules that it applies to words in your POPFile corpus. The rules were designed to catch non-words and we think that they do a very good job. A very important point is that we will not look at each and every word in your corpus. Unless you tell clean_corpus something different (see the command line switches) it will only consider words with a word count of 1 so that real words are even less likely to be caught.

Here are the rules:

1. No consonant or vowel sequences longer than 3!

If the script encounters a word that contains a sequence of more than 3 consonants or vowels, the word will be thrown out.

2. No words consisting of consonants or vowels only!

Regular words come with consonants and vowels intermixed. If we find a word that consists solely of either consonants or vowels, we discard it.

3. No sequences of "strange" characters!

A strange character is anything within the ASCII range of 192 to 255 (excluding umlauts and accented characters).

4. No more than 3 consecutive digits

If a word (not an email address and not an IP address) contains more than 3 consecutive digits it will be expunged from your corpus.

5. No substring repetitions!

If substrings of 2 or more characters are repeated more than twice in a word we refrain from calling that a word.

6. No unknown triplets!

Using dictionaries, we have compiled language specific lists of three character sequences (triplets). If we encounter a triplet that is not in our list of allowed/known triplets, we discard it.

7. No message ids!

If something looks like an email address, but contains more than 3 digits before the @ sign, we consider it a message id and throw it out.

When you read the above, you might have said "But wait, this will throw out xy!". We have tried to keep the number of real and possibly useful words that the script will remove to a minimum. Thus, e.g., only rule 7 will get to see words that look like email addresses. We have compiled a long list of common abbreviations, so CNN will not be thrown out. We took care to deal with accented characters and umlauts to not throw out words in languages other than English. And much more. So don't worry too much about real words that might get removed because we have already tried to do that on your behalf. Instead, think of the things you can throw out: Spammers like to insert random character sequences into their messages because they have the strange belief that this will stop filters from working. Another source of random junk is that older versions of POPFile, before 0.19, had a bug that would treat encoded strings as words. Message ids might make it into your corpus. People make typos that they don't necessarily repeat.

Next: How to use clean_corpus.

About the design of this page

If you are using an old and dated browser (like Netscape 4.something) this page will look crappy. You should be able to get and read all the content, but since all the styling is achieved with cascading style sheets and old browser do not support these properly, you will have to live with it. Sorry.