[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - Differing Bayesian Results

Differing Bayesian Results

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

Differing Bayesian Results

Postby SFCurley » Wed Sep 29, 2004 2:21 am

A thought occurred to me about why it's taking some users so much more time (and a larger corpus) to get good results. I was getting very good results (98%+) at 18k words in each corpus, my home account is getting good results with 3k good words/6k bad words (99.81%), and yet someone recently pointed out that he was at 70k words (total) and still at 70%.

Here's the thought/question:

Both at home and at work, I filter my newsletters before they get scored by the bayesian filter and so I never correct on / train the newsletters into the bayesian corpus. I'm wondering if the people who are getting good results either a) don't get many newsletters or b) do get newsletters but filter them and don't train them into the Bayesian filters as good, whereas people who are getting poor results might be training on the newsletters and, thus, requiring a larger corpus before getting satisfactory results.

I ask because a lot of newletters can have a spammy look to them and just wondering if this could be a factor. Just a question . . .
SFCurley
 

Postby Michael » Wed Sep 29, 2004 2:29 am

One thing I'm finding is that the "Apply and Test" results can be different from the results computed when the message is downloaded. I just discovered this this morning.

I use Poco's old JMF rules as well as several scripts and filters. After messages are downloaded I run a script to identify those in the Junk Mail mailbox that were not caught by the BF engine. I have typically then used those to train the engine. This morning I had a look at the headers for one of the messages. It was reported as passing the BF tests (a +10 score was added) but when I ran the "Apply and Test" function against it it reported that the BF rules were failed (a -20 score).

Can anyone else confirm this?
Michael
Moderator
 
Posts: 866
Joined: Mon Jul 26, 2004 12:14 pm
Location: Victoria BC, Canada

Postby SFCurley » Wed Sep 29, 2004 7:38 am

I have noticed, too, that this does occur on occasion.

Somewhat related, somewhat not is that I have a check built into my filter stream that checks for any known sender being tagged as junkmail by the BF. When this happens, a display box with the message info pops up saying "False Positive" and I train it right there.
SFCurley
 


Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 1 guest

cron