Command line options
clean_corpus will listen to a set of commmand line options that
you can use to make it do something special, experimental, or dangerous. Here they are:
--set backup=0
clean_corpus will by default attempt to backup your corpus (i.e. the popfile.db file).
It will create a directory called "corpus_backup" within the clean_corpus
directory and the backup of your corpus will be placed inside that directory, overwriting
any older backups that may already sit in that directory.
If you don't want to make a backup, set the backup variable to 0.
--set nodebug=1
This will make the script overwrite you old corpus (default is 0/off).
Use it _after_ you have run it in debug mode and inspected the file
thrownout.txt.
--set max_wordcount=n
If you do not use this option, this script will only consider words that have a word count of 1. Set it to 2 or whatever by replacing n with that number. Raising the maximum word count should also raise the number of real words that the rule set will mistakenly remove from the corpus.
--set all_checks=1
If you want to debug our rule set, you can use this option to force each rule to be applied. Normally, the script will break out of the rule set as soon as a word triggers one of the rules to save some time.
--set rule=n
If you want to debug a specific rule, use this option and set n to the rules number. E.g. --set rule=4 will only use rule #4.
Options for the new probability rule
--set equal_prob=1
If you use this command line switch, clean_corpus will not use its default rule set, but it will look at the word probabilites instead. Visit the page about the probability rule for more details.
--set mod_stopwords=1
If you want the probability rule to add each word that has equal probabilities in all your buckets to the list of ignored words (stopwords), set this variable to 1. By default, the variable is set to 0 so that your stopwords list will not be changed.
--set tolerance=n
The tolerance variable determines what the probability rule will consider as "equal". You can use the values 1 through 5; default is 1. The higher this value, the more tolerant the rule will act, i.e. the more words it will find.