[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - Junk Mail Filtering Question

Junk Mail Filtering Question

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

Junk Mail Filtering Question

Postby J-Mac » Mon Sep 06, 2004 3:15 pm

I'm using Barca with automatic Junk Mail Filtering enabled and sensitivity set to "High". I've also enabled "Run learning Bayesian filters".

The junk mail filtering has generally been extremely accurate. Current stats show 89.22% accuracy with 1.64 false positives, which is a MUCH better batting average than Sunbelt Software's iHateSpam program running on Outlook 2002 - my previous set-up.

But every now and then a great deal of spam gets through all of a sudden - as if the spammers are using a new trick to get by the filters. This has been occuring in the last few days with an alarming degree of success. For the spammers, that is.

Most troubling is the analysis offered by the Junk Filter dialog when I tried to train it for some new junk mail that got through, and then hit the Test button:

Message "Sweetheart wants a Rolex ?-webmaster extremum inaccessible cranky" would NOT be considered junk mail at any sensitivity. [XPS5]

+3 [X-MAILER=] (X-Mailer)
+2 [FROM=%ADDRESSBOOKS%] (From %Addressbooks%)


I immediately checked my address book to see if the From adddresses were getting added, but they're not. I checked the full header and it is not spoofed to show it's sent from me.

The message with full headers showing is pasted below, with my pertinent info (e-mail adress and IP info) changed and in red, as well as hyperlinks disabled. Nothing offensive, but if this isn't allowed, Moderator please feel free to remove or edit as necessary.

A lot of the recent spam is getting through and being scored this way. What are they doing to trick the filters into thinking the From address is in my address book, and has anyone found a way to thwart this?

Thanks!



From <xhuaaqdma@worldnet.att.net> Mon, 06 Sep 2004 18:13:14 -0700
From: "Rolex is forever." <xhuaaqdma@worldnet.att.net>
To: <---One of my e-mail addresses--->
Return-path: <xhuaaqdma@worldnet.att.net>
Envelope-to: <---One of my e-mail addresses--->
Delivery-date: Mon, 06 Sep 2004 20:11:20 -0400
Received: from [XX.XXX.XXX.XX] (helo=h00e0185d5f47.ne.client2.attbi.com)
by serve.<---My Server--->.net with smtp (Exim 4.34)
id 1C4TZo-0002xh-EU
for <---One of my e-mail addresses--->; Mon, 06 Sep 2004 20:11:19 -0400
X-Message-Info: 1mrhg89kZMZ/jdzFHuyrJwTFDcCP4Odt
Received: from tanh (XXX.XXX.XXX.XX)
by kpm2.planetaria.breakwater.melanin.fuse.net
(InterMail vY.6.74.24.98 07-12829-0-332-667-1887923) with ESMTP
id <209826740.AWVV5799.pktq8-mail.downfall.tommy.net.cable.rogers.com@greenhouse>
for <---One of my e-mail addresses--->; Tue, 07 Sep 2004 00:13:14 -0100
Message-ID: <28210it78846hlxjj$51123496qh818$263tc35vvl525@bell>
Reply-To: "Rolex is forever." <xhuaaqdma@worldnet.att.net>
Date: Mon, 06 Sep 2004 18:13:14 -0700
X-Antivirus: avast! (VPS 0436-4, 09/03/2004), Inbound message
X-Antivirus-Status: Clean
Delivery-Date: Mon, 6 Sep 2004 20:32:36
Status: U
X-Poco-Score-Detail: +3 [X-MAILER=] (X-Mailer )
X-Poco-Score-Detail: +2 [FROM=%ADDRESSBOOKS%] (From %addressbooks%)
X-Poco-Scored: +5
Subject: Sweetheart wants a Rolex ?-Webmaster extremum inaccessible cranky
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Poco-UID: XXXXXXXX
X-Poco-Status: R
X-Account: <---Account name for this e-mail address--->






Hello,
We all want to wear SWISS WATCHS,
they are expensive-we all know that,
Now we have effordable Replica's--

Rolex
--------------from $99 !!

also available :
=================
CARTIER
FRANK MULLER
Jager-LeCoultre
OMEGA
PATEK PHILIPE
=================
AND MORE
.http://itsmyreplica.info/index.php?ref=hp

Italian Crafted Rolex - Complete Watch Store
Reliable Service and Support

Check Here For More Information

.http://itsmyreplica.info/index.php?ref=hp


Regards
Francisco Guevara

-----------





Any help or advice is greatly appreciated!
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby Eric » Wed Sep 08, 2004 11:09 am

Hi J-Mac,
A lot of topics about Junk Mail filtering can be found to tweak your settings even better. Try doing a search on Junk or Bayesian.
But every now and then a great deal of spam gets through all of a sudden - as if the spammers are using a new trick to get by the filters.
You're correct in saying that spammers always use new tricks, but if your Bayesian Filter is tweaked and has learned enough words, then probably it will be stopped.

However a new technique is being used by sending you a message with subject and a greeting, but in attach you'll find a picture with promotion to viagra & other products.
How can you stop that? :?
Eric
 

Postby J-Mac » Wed Sep 08, 2004 12:09 pm

Thanks Eric.

I did go through all the threads that I could find and I subsequently deleted my DBGood.ini and DBSpam.ini files, reset statistics, and turned off the standard non-bayesian filters. Now I have it learning from every e-mail I receive.

It's up to 2,104 junk words and 15,433 good words, in one and a half days. Right now it's not doing well, but learning, I hope! Caught one junk mail and missed 7, with 6 false positives. (Though I haven't seen any false positives??) That makes the very earlt stats 82.35% accuracy and 20.59% false positives. But it is only less than two days, and I understand that I need about 30,000 words before it kicks in.

I have the Junk Threshold set to .99 and the good mail bias set to 3.0, as I've read in the forum. This automatically sets my Junk Score to 100 and Good Score to -100. Can that be right?

As for my original post, I'm just wondering how the filter calculates in points based on the %addressbooks% value when it most definitely is NOT in my address book. Or by adding two points is it saying that it is not in the address book?

I apparently still have some learning to do on BFs!

Thanks Eric.
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby J-Mac » Sat Sep 11, 2004 3:11 pm

Well, this Bayesian experiment didn't work. Accuracy dropped steadily to 64%

I've now switched back to regular non-bayesian filters. Still getting more junk than ever.

Is there a way to just reset all back to what it was? 88% was a lot better than I thought, apparently. I followed all the advice on all the bayesian threads I could find, and I'm not sure how long it should take, but I don't think it's supposed to get increasingly worse before getting better.
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby Eric » Sat Sep 11, 2004 11:38 pm

J-Mac wrote:Well, this Bayesian experiment didn't work. Accuracy dropped steadily to 64%
I've now switched back to regular non-bayesian filters. Still getting more junk than ever.
Can't complain using only Strict Bayesian. Currently running at 99,15%. :)
Is there a way to just reset all back to what it was? 88% was a lot better than I thought, apparently. I followed all the advice on all the bayesian threads I could find, and I'm not sure how long it should take, but I don't think it's supposed to get increasingly worse before getting better.
What I did was letting it learn before resetting the statistics. I was afterwards running at 34%, but the junk filtering got better from day to day. Touching wood now that it will stay this way.
Take a look at this topic for resetting the databases and/or in Junk Mal Filtering | Bayesian press Reset Defaults.
Eric
 

Postby J-Mac » Sun Sep 12, 2004 3:00 am

I don't know, maybe I just don't have the patience, Eric.

Over a couple hundred messages anly one junk message was caught. Of the many dozens that were missed, I tested all and the test message said, in all cases, that the message would not be considered junk at any sensitivity. And the accuracy rate was dropping every day. Something just didn't seem right!

If you happened to have read my posts in other "Bayesian"-related threads you can see my settings. The only suspect setting was that when I click "Strict Bayesian", the scoring scale looks weird: Junk = 100, Good = -100. When I reset to the default, Junk = 20, Good = 0. Big difference! But Strict Bayesian does that, at least on my machine! I guess I could have changed the scale values manually, but I was trying to stick with the Strict B. settings.

I also had the Junk Threshold set at .99 and the Good Mail Bias set at 3.0. Default now has it at Junk = .90 and Good Bias = 2.0.

It just seemed that it should have slowly started to increase in accuracy, not the other way around.
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby SFCurley » Wed Sep 15, 2004 3:38 am

Hi,

The real question on whether Bayesian by itself is working has less to do with the -100/+100 and more with the probability number that PM's BFs calcuate based on your corups and the email itself.

As I'm gathering you read in the other forum postings, I am running at Threshhold=.99, GoodBias=3.0, with good and bad corpi both at about 26k words, and am getting very good results.

A couple of suggestions:

1) On the address book question, you can create a separate non-Bayesion filter that says "If From contains either %addressbooks% or %exceptsenders%, then stop processing this message, and make that your FIRST filter". This will whitelist everyone in your addressbook and your safe senders list.

2) On the accuracy numbers, if you have a supply of old emails (both spam and ham), you can jump start the training process by running them through the Good/Bad training buttons (in separate batchs of course) to build up the corups more quicly. I did this and started out of the gate with 18k words in each. However, in a separate trial of the BFs, I did it incrementally and got 80%+ results right away with steady improvement.

More generally, though, I think PM has pretty good Bayesian capability, but what I'm finding both interesting and curious is the huge range of experiences that people are having -- some like me very good experiences (I get about 100 emails a day) and am running 98.5% and others who are reporting abysmally bad results. Having been a big reader of POPFiles (another BF) forum, there did not seem to be nearly as large a range of experiences, although most of the people in that forum were really into the nuts and bolts of bayesion filtering and the tweaks.

What is apparent to me, though, is that Poco's huge flexibility definitely offers the opportunity for confusion. That said, I like the flexibility (but I'm getting good bayesian results, so it's easy for me to say that).
SFCurley
 

Postby Pete » Wed Sep 15, 2004 2:06 pm

I've considered changing some of my BF settings to what you use, SFCurley, but then I don't because I ask myself, if your settings are better, why doesn't PSI make them the default settings? In particular, I continue to use a "Junk Threshold" of 90% and a "Good Mail Bias" of 2.0. I can't help but wonder, though, if the default settings are not optimal.
Pete
 

Postby J-Mac » Wed Sep 15, 2004 3:23 pm

Pete,

Are you using Bayesian alone? With the Non-Bayesian filtering turned off?

And what kind of stats are you getting, if I may ask?

Thanks!
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby J-Mac » Wed Sep 15, 2004 3:52 pm

Thanks for the reply, SFCurley.

Are these the correct settings for running just Bayesian?

Top of Junk Mail Filtering box:
Enable automatic Junk Mail Filtering - Checked and set to "High Sensitivity"

General Settings:
Run Automatic.... - After downloading...
Filter Settings: Not checked

Bayesian
Run learning Bayesian Filters: Checked
Junk Threshold: 0.99
Good Mail Bias: 3.0
After Classifying a message...: Junk Score: 100, Good Score: -100


Of course, once you click on the "Strict Bayesian" button, the threshold and bias reset to 0.90 and 2.0, so I changed them back.

I presently have a recent and accurate corpus of 7,755 junk words and 23,179 good words, all collected in the last few days. I collected this corpus by manually classifying all eyeball-checked good messages as "Good", and manually classifying all junk messages as "Junk", even if they are already caught and sent to the Junk Mailbox. Is this the correct protocol? Should I continue to do this ad infinitum? Or stop manually classifying them after at least 18K of each are collected?

Are the above settings correct? Good enough to start a new test run? Or should the top "Enable Automatic..." box be unchecked? (I'm not sure if this disables ALL filters, including Bayesian, or just the non-Bay. I currently have it selected). I'm presently getting between 90 and 120 messages a day, not counting about an equal amount that are trapped at my ISP's (Comcast) server and auto-deleted by their BrightMail spam filters.

Thanks for all your assistance!
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby SFCurley » Thu Sep 16, 2004 6:03 am

J-Mac,

Top of Junk Mail Filtering box:
Enable automatic Junk Mail Filtering - Checked and set to "High Sensitivity"

General Settings:
Run Automatic.... - After downloading...
Filter Settings: Not checked

Bayesian
Run learning Bayesian Filters: Checked
Junk Threshold: 0.99
Good Mail Bias: 3.0
After Classifying a message...: Junk Score: 100, Good Score: -100


Yes, these are exactly how I have mine now set.

I presently have a recent and accurate corpus of 7,755 junk words and 23,179 good words, all collected in the last few days. I collected this corpus by manually classifying all eyeball-checked good messages as "Good", and manually classifying all junk messages as "Junk", even if they are already caught and sent to the Junk Mailbox. Is this the correct protocol? Should I continue to do this ad infinitum? Or stop manually classifying them after at least 18K of each are collected?



What you have described is similar to what I did. I first of all classified/trained on every good mail I had. This resulted in a good word corpus of about 18k words. Then I processed large batches of junk emails through the junk filter until I had a junk corpus of about the same size (18k words). THEN I selectively did an apply/test on a bunch of other junk mails, and only trained on the ones where PM's BFs got it wrong. Did that for a bunch, but don't now how many. As an experiment, I am now updating just the good corpus with any new emails at the end of each week . . . just to expand the number of good words the BFs have seen.

One thing that Jim has suggested is that the two corpus files be similarly sized. It seems to me that if the probabilities are normalized, it shouldn't matter, but none-the-less, mine right now just happen to both be about 26k words. At home, my wife's email acocunt corpus sizes are very different 3k and 9k words, and she's getting 99.75% accuracy.

Are the above settings correct? Good enough to start a new test run? Or should the top "Enable Automatic..." box be unchecked? (I'm not sure if this disables ALL filters, including Bayesian, or just the non-Bay. I currently have it selected).


I believe they are as they should be.

*****

Pete,

On the Bias of 3.0 and the Thresshold of .99, my reasoning for doing this is that I am less concerned about false negatives and more concerned about false positives. By increasing the Bias and the Threshhold, I make false positives less likely, but false negatives more likely. In general, this is the way I would prefer to air. Build up a big enough junk corpus, and the false negative problem will probably be not too much of a problem.

As far as what is optimal, I don't know personally. I know POPFIle out of the box uses an approach that amounts to a 0.99 thresshold, although some set it as high as .999 believe it or not. Don't know if there is a good word bias built into POPFile or not. In my experience, in most cases, GoodBias doesn't make a huge difference in actual results . . . changes the outcome of an email's classification only every so often.
SFCurley
 

Postby SFCurley » Thu Sep 16, 2004 6:13 am

One other thing that my impact my results a bit. . .

I'm actually somewhat selective in what I classify as good. For example, I have a filter that intercepts any newsletters I receive and moves them to a folder. I never check to see if they are classified correctly or not. Many newsletters probably have a spammy-feel to them. Classify a lot of them as good, and it probably won't help the BFs in its accuracy. None-the-less, over a large enough sampling, Poco's BFs WILL learn to differentiate even these kinds of things, so it shouldn't have too much of a long-term impact on the accuracy.
SFCurley
 

Postby Pete » Thu Sep 16, 2004 1:25 pm

J-Mac wrote:Are you using Bayesian alone? With the Non-Bayesian filtering turned off?

And what kind of stats are you getting, if I may ask?

Yes, Bayesian alone. I disabled the non-BF setting.

I started using the BF about three months ago. Since the day that I reached the 1,000-word threshold for both corpi, I only add to the corpi when the BF makes a mistake (as per SFCurley's advice).

Here are my current stats:

22,000 junk words
4,000 good words
90.26% accuracy
421 false negatives
3 false positives


SFCurley, thanks a lot for the explanation about the "junk threshold" and "good mail bias". I'll probably leave mine at the defaults (90% and 2.0) since, as you can see, I don't have a big problem with false positives (and emails from people in my address book never go into the junk mailbox anyway (but I do run them through the B filter first and change the marking color if they are false positives so that I know that I must update the good corpus)).

Hmm, considering the high number of false negative I get, I wonder if I should take an opposite approach. For example, change the good bias to 1.0 and/or change the threshold to 80%? That sounds a little scary, and it almost seems more like guesswork than science.

I think that I'll just wait a few more months and then if I'm still getting so many false negatives, then I'll experiment with the settings.
Pete
 

Postby J-Mac » Thu Sep 16, 2004 1:44 pm

SFCurley & Pete:

Thanks a bunch!!

I'm on the Bayesian road again!
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby Pete » Fri Sep 24, 2004 3:09 am

One of the things that is frustrating is that, in almost every case, I can tell if an email is spam simply by looking at the subject, but PocoMail misses so many of them even after looking at the subject, body, and headers. I realize that a computer algorithm is no match for a human eye, but I hope that PSI will be able to find ways to improve this.

In an attempt to help PocoMail, I've just changed my junk threshold from 90% to 80%. I'll post again if that causes any interesting behavior. (The good bias is still at the default of 2.0.)

This seems like the right thing to do in my situation. I'm perplexed, however, that SFCurley has 98%+ accuracy with his settings yet I'm at 90% accuracy and I'm moving my junk threshold in the opposite direction. :?
Pete
 

Next

Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 1 guest

cron