[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - Poco's Bayesian Filters
Page 1 of 2

Poco's Bayesian Filters

PostPosted: Sun Aug 22, 2004 5:58 pm
by rabmail
No matter what I do I can't get Poco's Bayesian Filters to give the same performance in blocking Spam as I can with PopFile running on my OE, both on the same mail accounts. On Poco, I am running Strict Bayesian with the other filters disabled.

Twice I have deleted the DBGood.ini and DBSpam.ini and retrained Poco's Bayesian Filters but I seldom see better than 96%. which I have today. I presently have 13,988 Junk Words and 7,062 Good Words.

By contrast, PopFile on OE is at 98.85% today running on the same mail accounts. Yes, I know I can disable Poco's Bayesian Filters and use PopFile but I really want to try and get everything done with Poco.

When I have some Spam messages which were not filtered out, I highlight them and then Shift+Control+- to move them to the Junk Mail folder. It is my understanding that this action also trains Poco's Bayesian Filters. Am I correct? It certainly changes the statistics.

Poco's implementation of Bayesian Filtering seems overly complicated in regards to training as compared with PopFile which I can train in two days to get 98+%.

rabmail
PocoMail 3.1.0.1880

PostPosted: Sun Aug 22, 2004 7:31 pm
by robin
There are posts here and here on Baysian filtering that might help.

PostPosted: Mon Aug 23, 2004 1:36 pm
by vamp07
Until it is improved I don't think it is possible to get popfile accuracy with the pocomail Bayesian filter.

PostPosted: Wed Aug 25, 2004 4:45 am
by SFCurley
I reset my Bayesian Filters about two weeks ago and thought I'd share my stats before and after AND mention how I trained differently this time . . .

I had been at 98% with 1.25% or so false positives. I had a small corpus and only trained on errors, based on recommendations from POPFile, which I had used previously.

I was ok with the 98%, but didn't like the false positive rate, so decided to try something different.

I deleted my two corpus files, and then trained Poco's BF on EVERY good email I had in my Trash Box. This produced a good corpus of about 25,000 words. I essentially wanted to make sure that Poco had seen every word that had ever occured in a good email.

I then trained on batches of bad emails up to the point where I wasn't getting any false negatives based on what I had in my junkmail box. As it turned out, this produced a junk corpus of about 25,000 words, too. Notice that I didn't run every junk email through, just enough so that Poco wasn't making any classification mistakes and then I stopped.

With that and my Good Mail Bias set at 3.0, my two week stats are 99.1% and 0.23% false positives. That's on about 1,000 emails so far. I can certainly live with these numbers if they hold-up.

PostPosted: Wed Aug 25, 2004 6:12 am
by Jim
Thanks for sharing. That's good to know. Your experience goes on to confirm our belief that while PocoMail's Bayesian filters may perform better or worse (marginally at best) than standalone Bayesian filters (like POPFILE) in the short term or at installation if you will, in the longer term, it is the ongoing training regimen, employed by each user, that determines the performance of Bayesian filters.

PostPosted: Wed Aug 25, 2004 6:47 am
by SFCurley
Jim,

Quite welcome.

In part, just because I like to experiment, I'm going to continue updating my corpus with all good emails, say once a week by selecting them and then going to the Junk Mail configuration screen and clicking on "Good".

Since my experience (and I think Vamp's, too) is that the thing most likely to cause a false positive is an email with a bunch of new words Poco has not seen before, this might help alleviate that.

I do wonder if there's not a tweak regarding how Poco's Bayesian Filter accounts for never-before-seen words that would make a bit of a difference, but obviously not a huge deal for me given my current numbers (assuming those percentages hold up).

--Sean

PostPosted: Wed Aug 25, 2004 7:03 am
by vamp07
I'm going to try this again. Not sure I have the patience to do what you did but I am going to train my good words with up to 30,000 words followed by training my bad words to that same number. Lets see if that helps. I'm also beginning to think popfile's accuracy may come from the tokens it creates of header items. Doesn't poco ignore most headers? I got popfile working again and with dictionaries of about 3,000 words good and bad I am getting somewhere around 99% accuracy. I switched back to popfile from poco after I got stuck with accuracy of 92-93%. That was with dictionaries that had climbed to about 20,000 bad words and I cannot remember how many good words.

PostPosted: Wed Aug 25, 2004 7:52 am
by SFCurley
Poco DOES look at header data and actually creates a separte "class" of tokens for info per header. So, say, if the word "viagra" appears in the subject line and in the body, poco creates two tokens: one for Viagra and another for Subject-Viagra. This makes sense and is, I believe, what POPFile does, too.

There are some differences b/w Poco and POPFile:

1. POPFile ignores all-numeric tokens except IP addresses.

2. POPFile uses something called psuedo-tokens, which are tokens that are created not by words in the email, but by certain conditions appearing in the email.

An example is that if an email has a word separted by periods, then POPFile creates a token "trick:dotted-words" and adds one to the count for that token. "trick-dotted-words" is nothing that specifically appears in the email but indicates that this "trick" is present in the email. Pretty clever idea, if you ask me.

Those are the only two things where I know for sure that Poco and POPFile differ.

********************* 

By the way, the pseudo-tokens that I know POPFile uses are:

trick:dottedwords (word with letters seperated by dots l.i.k.e t.h.i.s)
trick:invisibleink (words using color to hide)
trick:spacedout (word with letters seperated by spaces l i k e t h i s)
 charset:<various>             (the character set listed in the message)
 encoding:<various>            (encoding methods)
 header:<various>              (headers present in a message)
 html:authorization            (a URL contained authorization information)
 html:backcolor<value>
 html:colordistance<number>    (distance between back and foreground colors)
 html:comment                  (HTML comments used)
 html:cssdisplay<value>        (display value defined with CSS)
html:cssfontsize<size>        (font size defined with CSS)
 html:cssvisibility<value>     (visibility value defined with CSS)
html:css*color<color>         (various ways of defining colors with CSS)
 html:emptypair                (HTML tags that aren't marking anything)
 html:encodedurl               (encoded URLs)
 html:fontsize<size>           (various ways of defining fonts)
 html:iframeremotesrc          (an iframe in the mail has a remote source)
 html:img*<pixels>             (height and width of images)
 html:imgremotesrc             (an image in the mail has a remote source)
 html:imgwidth<value>
 html:imgheight<value>
 html:imgfontsize<value>
 html:invalidtag               (fake HTML tags)
 html:numericentity            (URLs written in numeric format)
 html:td                       (table definition elements)
 html:*color<color>            (various ways of defining colors)
 mimeextension:<various>       (extension of attachements)
 mimename:<various>            (filenames of attachements)
 spamassassin:<various>        (SpamAssassin tests)
 spamassassinlevel:spam        (counted once for every full point of SpamAssassin level)
 

PostPosted: Wed Aug 25, 2004 8:11 am
by Jim
SFCurley wrote:I do wonder if there's not a tweak regarding how Poco's Bayesian Filter accounts for never-before-seen words that would make a bit of a difference, but obviously not a huge deal for me given my current numbers (assuming those percentages hold up).

--Sean


This is something we can definitely work on. If I remember correctly, we do assign a probability to never-before-seen-words depending on the corpus where it does not exist. I don't know what the actual values are, but since the probabilities are normalized, that should not matter too much.

PostPosted: Wed Aug 25, 2004 9:16 am
by vamp07
If we assume that the way poco is tokenizing is very similar to popfile and that for most case it is equivalent then there must be something very different in the way they compute the values of the tokens. How is it possible that with only 3k words good and bad popfile can give me over 99% accuracy but poco needs many more token and even then I'm not convinced that in this latest round of testing I will get anything similar to popfile in accuracy although I am trying.

PostPosted: Wed Aug 25, 2004 9:54 am
by SFCurley
Well, that is a good question and I suspect it could be a couple of things. One is that POPFile doesn't include all-numeric tokens other than IP addresses. That would be a 10% reduction in corpus size right there. I think also that POPFile excludes certain headers, which are almost always unique to the message -- ones like "xoriginalarrivaltime", "deliverydate", and "date", which all are non-recurring and probably don't add to the equation one way or the other. All three of these would lead to a larger corpus in Poco than in POPFile. I would actually like to see these selective exclusions (and perhaps some others) and the psuedo-token idea implemented in Poco, but that's just one man's opinion.

As far as accuracy, Vamp, what is your false positive rate? If you were only a 93%, I'm assuming that means you were getting 7% of your junk emails incorrectly treated as good (false negatives). Any false positives and at what rate? What were your good bias and threshhold settings? Also, do you have poco checking mail from multiple accounts, with one (or more) account(s) having a much higher percentage of junk than the others?

POPFile out-of-the-box has what equates to a 99% threshhold level, by the way. Don't know about a bias factor, though.

PostPosted: Thu Aug 26, 2004 12:54 am
by vamp07
In my latest testing I am running with a positive bias of 3 and my good dictionary has 28,000 and bad dictionary has 178,000. So far I am seeing good results although I have only been running one day. I suspect my accuracy is in the same range as popfile based on what I have seen but I have changed to many things since setting it up yesterday to give numbers. Obviously the dictionaries are huge compared to popfile although I could care less if it can maintain accuracy.

Yes I do check multiple accounts although really only one is the culprit of giving me so much spam (about 300 spams a day). Most of the good email I receive is from the same sources and is consistent in content.

Doesn't poco ignore words until they occur 3 times? Could that be the problem with dictionaries needing to be so big?

PostPosted: Fri Aug 27, 2004 3:17 am
by SFCurley
I THINK it's not until the words occur 3 times, but rather until the number of times the word occurs mutliplied by the good word bias exceeds 5, that Poco will start using that word in the probability calc. (Note emphasis on THINK.) However, whether a word has been seen 1 time or a million times doesn affect the size of the corpus -- it's still counted as just one word for purposes of saying how many words are in the corpus.

Will be interesting to track your results and see how they hold up going forward.

PostPosted: Fri Aug 27, 2004 6:30 am
by vamp07
Well I already deleted my huge dictionaries. It was obvious that the accuracy was excellent. I now know that poco's BF can give you great results but you need to let the dictionaries grow. I am now back to training slowly to see how long it takes. I suspect it will take a while because of only training on errors that over time become less frequent. Last time I had a go at this I tried for several weeks only seeing my accuracy hover around 92-93%. It will take a while to grow the dictionaries to whatever size they need to be. I know that with popfile I got 98%+ accuracy after less then a week of training. With poco I suspect it will take months but eventually it will get there. Not sure why I am doing this. Mostly curiosity I guess. I think the word count could be part of the culprit for dictionaries needing to grow so big and training taking so long They may be filled with words that never get used for their few occurrences. Should be relatively easy to test. Just need to sort those ini files by what comes after the = to see what percentage of the words only occur 1-2 times and consequently never reaching a count that allows them to be used. Do we know if popfile uses this rule of tokens needing to occur x times before they are used?


SFCurley wrote:I THINK it's not until the words occur 3 times, but rather until the number of times the word occurs mutliplied by the good word bias exceeds 5, that Poco will start using that word in the probability calc. (Note emphasis on THINK.) However, whether a word has been seen 1 time or a million times doesn affect the size of the corpus -- it's still counted as just one word for purposes of saying how many words are in the corpus.

Will be interesting to track your results and see how they hold up going forward.

PostPosted: Fri Aug 27, 2004 6:50 am
by SFCurley
I don't know if POPFIle uses the "rule of x" to determine if a token is considered in the calculation.

Regarding sorting by the count, an easy way to do this is to export into an Excel spreadsheet, using "=" as a delimiter and then sorting by the count column. (I've done this and played around with deleting entries with low counts.)

Re: Poco vs. POPFile, I suspect that a big part of the difference between Poco and POPFile could be POPFile's use of the psuedo-tokens I mentioned above. I bet -- and it's just a bet -- that there's a wealth of information about whether an email is or isn't spam implicit in those conditions.