[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - More PocoMail Junk Filtering vs. K9

More PocoMail Junk Filtering vs. K9

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

More PocoMail Junk Filtering vs. K9

Postby Trapper » Wed Jul 28, 2004 10:57 am

Obviously the thread I was working in got lost in the forum "transfer" process along with my username and password. :-)

Whatever, I was discussing the effectivness of PocoMail's Junk Filtering when put up against K-9. I was also having a major difficulty getting PocoMail to drop any junk what-so-ever into the Junk folder. I even did an over-install and that did not work so I simply saved my ini's and my complete mailbox folder and subfolders, uninstalled, reinstalled and slipped in my ini's and folders. Uniquely, mail started going to the Junk folder.

After testing Poco's junk filter for 10 days I gave up because it was obvious it is not yet up to par with k-9 or PopFile. About the highest I could get was a little over 92%. Perhaps that sounds okay but consider this: In the past 5 days I have had 531 mails go thru K-9 with 302 good, 229 spam and 1 good mail mis-identified as spam. That gives me 99.81% detection in that timeframe. I think the PocoMail developers would do better if they'd simply drop all those other filtering rules they have in the junk filter and get the bayesian filter up to speed. Just a thought .... a K-9 plugin into PocoMail would give it a tremendous spam filtering capability.

So, fire at will ...... he's over there - - - - - > :-)

Trapper
Trapper
Drop-in Visitor
 
Posts: 11
Joined: Wed Jul 28, 2004 10:30 am

Postby vamp07 » Wed Jul 28, 2004 1:31 pm

I completely agree. Bayesian filtering is incredible but poco's implementation needs much improvement. I see no need for the built in rules. I think they just complicate the issue. They are remnants of a method of spam filtering that never worked very well and now only serves to complicate the process.
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby Slaven » Wed Jul 28, 2004 2:19 pm

You can try removing the built-in filters completely and use only the Bayesian filter - I've had my non-Bayesian filters disabled for several months now while testing Bayesian filters and since the last stat reset I've filtered 12,000 messages, 70% of which were spam with 99% accuracy. This is PocoMail Bayesian filters acting alone, without any other third-party aids. The more you train, the more accurate it will get.
Slaven Radic
Poco Systems Inc
Slaven
Poco Systems Inc
 
Posts: 1644
Joined: Fri Jul 23, 2004 7:37 pm

Postby bern » Wed Jul 28, 2004 2:32 pm

That hasn't been my experience.

I get 60 to 90 mail a day where 3-5 are good. For six weeks I've turned off all the other filters and I've found that Bayesian gets up to about 90% accuracy.

But then I get 1 or 2 good mail treated as junk which then I have to retrain as GOOD. The accuracy goes down to 70% (all the GOOD mail is ok but a 30% junk mail failure), whereupon I train a lot more mail, accuracy goes back up to 90% and then the whole cycle start over.

Maybe its because I have a very high percentage of spam.
bern
 

Postby SFCurley » Wed Jul 28, 2004 3:13 pm

I have spent A LOT of time playing with, analyzing, and testing the Poco Bayesian filters -- even to the point of manually walking through the algorithm with a few sample emals (very tedious exercise). Previously, I had been using POPFile. My pure Poco Bayesan Filter stats are 1.2% false positives, 0.75% false negatives. With the addition of white listing, my false positive rate is probably only 0.2% give or take.

My corpus is very small -- 10,000 good words, 13,000 junk words (or vice versa, I'm not at work right now). I have GoodMail Bias set to 5.0. About 80% of my email is junk. I'm not using any of PMs built-in, out of the box junk filters, although I have implemented a few of my own.

My comments are as follows:

1. I think that y 98% overall accuracy rate is okay, but, I agree that it could be better. My experience with POPFile was more like 0.25% false positives, 0.75% to 1.0% false negatives. I can live with the false negatives, but hate the false positives.

2. I think you CAN over-train the PM Bayesian filters. I've tried various experiments, one being training Poco on every email (good or bad) I've received in the last six months, and after having done this the accuracy was atrociously bad. I only train when Poco gets it wrong and as a result have a pretty small corpus. I've read elsewhere that the benefit of this is that you end up with a corpus that contrains a higher percentage of true junk/good discrimators and fewer non-disciminators in the corpus, thus making it tighter and more likely to be on point.

3. My false positives seem to occur mostly with emails that have a lot of new words that Poco hasn't seen before. I wonder if there's not an opportunity for improvement in how the effect of new words is calculated -- either in the algorithm or the implementation thereof.

4. In practicality, I use white listing, then the Bayesian Filters, and if it still looks like junk, I have a challenge/response system. This way, I'm very unlikely to miss any really important emails as the challenge will almost always elicit a response from a "real" sender. That said, aesthetically speaking, I would really like to see the Poco Bayesian filters "be all they can be."
SFCurley
 

Postby Jim » Thu Jul 29, 2004 2:12 am

Once implemented, the effectiveness of Bayesian filters (not just PSI's) is affected by user action, mostly your training method.

I run only Barca's (which is identical to that of PocoMail's) Bayesian filters. The standard and user defined filters are turned off. My settings are:

1. ~30,000 bad words and 10,000 good words.
2. Sensitivy level is 'Strict Bayesian'
3. Good mail bias is set to 2

Once I reached the above word counts, I only train on errors. I am running at 99.6% accuracy with 0 false positives.

If you are not getting at least ~98% effectiveness, I will suggest resetting your Bayesian stats, deleting your word database, and starting the training process again.
Jim
 

Postby frazmi » Thu Jul 29, 2004 2:57 am

Jim, could you be a little more specific in how to delete the word database? Is it sufficient to delete DBgood.ini and DBspam.ini? Or is anything else needed?
frazmi
Poco Enthusiast
 
Posts: 248
Joined: Tue Jul 27, 2004 1:27 am
Location: South Korea

Postby Pete » Thu Jul 29, 2004 5:32 am

While you're waiting for Jim's answer, I'll mention that I have done this in the past by (1) resetting the statistics (in Tools > JunkMail Filtering), (2) closing PocoMail, and (3) deleting those two files.
Pete
 

Postby Jim » Thu Jul 29, 2004 7:46 am

frazmi wrote:Jim, could you be a little more specific in how to delete the word database? Is it sufficient to delete DBgood.ini and DBspam.ini? Or is anything else needed?


Thanks Frazmi. Pete's last post answers the question...but a slight addition though...to delete the files, close PocoMail (Barca), navigate to the Mail directory under your main PocoMail (Barca) directory, then delete the files named DBSpam and DBGood.
Jim
 

Postby bern » Thu Jul 29, 2004 12:58 pm

Jim,

Are you saying, once Bayesian is active and it sorts my mail, I still have to train even if a mail is correctly sorted.

So, for example, if Bayesian puts a mail into the junk folder, I still should classify it as junk until I reach the 30,000 and 10,000 word levels?

Bern
bern
 

Postby SFCurley » Fri Jul 30, 2004 3:20 am

I'll re-iterate that in my experimentation over-training leaded to pretty dismal results. In that vein, the way the current build works is that if any other junkmail filters move an item to the JunkMail folder, then the DBSPAM corpus is automatically updated based on that email, which I think is bad because it can lead to the over-training thing. One thing you might check is to see if the other built-in junkmail filters are moving mail to the JunkMail folder, thus leading to over-training.

A question for Slaven: does this only occur when the built-in junkmail filters move an item to JunkMail, or (as I'm guessing) whenever anything moves an item into the junkmail folder?

Second question: is this going to be changed in the next release?

My workaround has been to setup another JunkMail folder (JunkMail2) and have any filter-controlled moves move into that folder not the "real" JunkMail folder.
SFCurley
 

Postby Jim » Fri Jul 30, 2004 4:15 am

bern wrote:Jim,

Are you saying, once Bayesian is active and it sorts my mail, I still have to train even if a mail is correctly sorted.
Bern


No, I did not say that. If you reached your acceptable filtering level at less or more words than mine, then that is what you should stick with. I did say you should always train on errors though.
Jim
 

Postby Alpha » Fri Jul 30, 2004 7:43 am

SFCurley wrote:I'll re-iterate that in my experimentation over-training leaded to pretty dismal results. In that vein, the way the current build works is that if any other junkmail filters move an item to the JunkMail folder, then the DBSPAM corpus is automatically updated based on that email...


That is not accurate, unless of course what you are saying is the following...The BF databases are only updated/trained during...

1. A manual classification of a message (not residing in Junk or Trash) as Junk using the right-click menu
2. A drag and drop of a message into the Junk folder
3. A filter run- on a message that was already downloaded- that results in the message being transferred into the Junk folder.
Alpha
 

Postby SFCurley » Fri Jul 30, 2004 7:54 am

A question for Slaven: does this only occur when the built-in junkmail filters move an item to JunkMail, or (as I'm guessing) whenever anything moves an item into the junkmail folder?


In essense that was my question: I was under the impression (wrong, I'm guessing from your response) that if any junkmail filter moved an item to the JunkMail folder, the DBSPAM corpus was updated. So, you're saying this is not true, correct?
SFCurley
 

Postby Alastria » Sun Aug 01, 2004 7:38 pm

It took me a few weeks to get the filters working, but things have steadily improved since they have.

One thing that stumped me early on was that I needed to classify at least 30 "good" messages as such, even though they may have been properly filtered. Once I did enough of those, only then did the Bayesian filters kick in.

Right now, I'm up to 93.75% accuracy, and probably > 98% within the past month. And I get a LOT of mail (5500 junk emails within the past two months). So, I'm happy w/ Poco's filters.
Alastria
New Arrival
 
Posts: 2
Joined: Sun Aug 01, 2004 7:30 pm

Next

Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 2 guests

cron