[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - Junk Mail Filtering Question

Junk Mail Filtering Question

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

Postby SFCurley » Fri Sep 24, 2004 3:14 am

I, too, find the discrepenacies in results very curious.

As an aside, I have my threshhold set at .99.

Out of curiosity, how many words in your two corpus files?
SFCurley
 

Postby Pete » Fri Sep 24, 2004 4:10 am

I posted that a few posts ago. :)
Pete
 

Postby SFCurley » Fri Sep 24, 2004 4:20 am

So you did -- sorry!

Just looking to see the balance b/w good and bad corpi. Given your corpus sizes, you would think -- if anything -- that the error would be the other way (i.e., more false positives). My wife's installation of PM at home is 3:1 in favor of junk words (3,000 good words; 9,00 bad words). Here results are even better than mine -- 99.82%.

I'm stumped.
SFCurley
 

Postby SFCurley » Fri Sep 24, 2004 4:24 am

Another thought: since your results are so mediocre anyway, if you have a large body of spam in your junk or trash folder, why not run all of that throught the junk training button -- even if some of it's already been through, then reset your stats and see if that makes any difference.

You could also backup your two corpus files, just in case your results get even worse. Then you could always go back to your 90% corpi files.
SFCurley
 

Postby Pete » Fri Sep 24, 2004 5:11 am

I haven't kept any spam, so I cannot train the BF like that. I understand the idea behind the suggestion, but of course, people shouldn't have to do that.

Let's hope that my results don't get any worse! At this point, it would be easier to delete each one manually than to use PM's BF. :)
Pete
 

Postby neo » Fri Sep 24, 2004 5:40 am

I want to share my own experience with the Bayesian filters

Until 10 days ago I was a bit disapointed about Poco BF, I couldnt get more than 91/92% accuracy, before I was using K9 with a 99%, as I receive nearly 500 spam messages a day, the 8% difference still meant a lot of messages coming thru.

So, I decided to make some experiments and now I am at 98.7% (I can live with that) but it keeps improving.

My conclusion is that accuracy goes down when the BF is "overtrained".

This is what I did to be now at 98.7% with Poco BF, it may be a little tedious during the first days, but it really worked for me.

Good mail bias set to 1
Junk mail score set to 20
Custom sensitivity set to the lowest to avoid the filter to move automatically the messages to the junk folder.

I trained the BF with very few messages (spam and not spam) just to get the 1000 words needed for the BF to work.

I went back to my K9-Poco configuration that gave me 99%

I created a folder named Junk K9
I created a folder named Known

I made my download filters work this way:

1) Run Junk mail filters
2) If junk score is 20 mark the message with a colour
3) If the from header is in the address book or my exceptsenders file move the message to the known folder
4) If junk score is 20 move the message to the junk folder
5) if the message is marked as spam by K9 move to the Junk K9 folder

So the messages end this way:

In the known folder the false positives are coloured and easy identificable.

In the Junk K9 are the spam not recogized by Poco BF

In the IN are the spam not recognized either by K9 or Poco

Now the hard job:

Once a day I go to Junk K9 folder (I had near 300 messages the first time)

1) I classify only the first message as spam moving it to the Poco junk folder
2) Then I select all the messages and use the option "Tag junk messages"
3) I then delete all the tagged messages
4) Back to 1) until ther is no spam message left.

I do the same with the false positives after that.

Doing this you get the BF trained with the exact number of words.

It requires some time at the begining but I have really got great results.

Now I'm again with the Poco BF only (without K9)

I hope this help and sorry for any grammar mistake as english is not my native language.

Neo
Last edited by neo on Fri Sep 24, 2004 7:22 am, edited 1 time in total.
neo
Poco Tourist
 
Posts: 49
Joined: Mon Jul 26, 2004 12:11 am

Postby Pete » Fri Sep 24, 2004 6:11 am

Hi Neo, that's interesting.

When you say, "false negative", do you mean "false positive" (one that PM incorrectly marked as spam)?

---------------------------------------------------------------------------------

I'm beginning to suspect more and more that the problem is that PM needs A LOT of training before it works. Neo receives 500 per day (!) and SFCurley wrote (I think) that he trained the BF with batches of spam messages.

On the other hand, I (and perhaps the average user?) only train the BF one message at a time.

Also, like Neo and SFCurley, I only train on errors.
Pete
 

Postby neo » Fri Sep 24, 2004 7:19 am

Pete wrote:Hi Neo, that's interesting.

When you say, "false negative", do you mean "false positive" (one that PM incorrectly marked as spam)?.


Sorry, my mistake, I meant false positives :oops:
I now edited the post.

Pete wrote:
Neo receives 500 per day (!)



some of my addresses are almost 9 years old, so they are in every mail database that exists! :(
Last edited by neo on Fri Sep 24, 2004 7:24 am, edited 1 time in total.
neo
Poco Tourist
 
Posts: 49
Joined: Mon Jul 26, 2004 12:11 am

Postby SFCurley » Fri Sep 24, 2004 7:20 am

Wow! I tried to follow Neo's description and hope that it won't take quite that much for most users.

I had been training one at a time, but found it was taking too long, too, so I did switch to the batch-training to just get it over with.

I, too, think in Poco's case -- at least initially -- more training is needed rather than less.
SFCurley
 

Postby neo » Fri Sep 24, 2004 7:31 am

I did batch training in the begining, but I couldnt get more than 92%

I made a lot of tests, and the last method was the best for me, but I agree that its a little time consumer.. :shock:

Neo
neo
Poco Tourist
 
Posts: 49
Joined: Mon Jul 26, 2004 12:11 am

Postby J-Mac » Fri Sep 24, 2004 9:33 am

Well guys, I'm still training the BF and I'm sitting at 83.77 % with 14,105 junk words and 23,691 good words in the corpi. Three times as many junk messages have been missed than caught!

And I've been scrupulous with training it properly.

I getting very disappointed again. I can almost get these results randomly.

Sounds like everyone invests an awful lot of time trying to train the BFs, and the results are so varied as to be ridiculous.

I really don't think much time or effort is put in to the filters by PSI. You all are getting reasonable results - not great but reasonable - yet most posts I read don't show nearly as high a success rate.

Looks like I'll be getting K9 or Popfile. :evil:
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Postby neo » Fri Sep 24, 2004 9:47 am

J-Mac wrote:Well guys, I'm still training the BF and I'm sitting at 83.77 % with 14,105 junk words and 23,691 good words in the corpi.


I guess everybody configuration is quite different, I'm over 98% and the corpi is

Spam: 11001 words
Good: 2627 words

Neo
neo
Poco Tourist
 
Posts: 49
Joined: Mon Jul 26, 2004 12:11 am

Postby Jim » Fri Sep 24, 2004 12:51 pm

As I pointed out on another thread, we will look into this. Although, we think our filters are very effective, the level of activity on the Junk Mail Filtering forum implies the need for additional work on our part. So J-Mac, hold tight!

Please send any suggestion to me via email. Thanks.
Jim
 

Postby J-Mac » Sat Sep 25, 2004 5:13 am

Don't worry, Jim, I'm not giving up! but I'm going nuts with spam here.

I'm going to give K9 a try, but I'll be right back to the PM BF when it's updated to give it another shot!
J-Mac
J-Mac
Poco Enthusiast
 
Posts: 356
Joined: Wed Jul 28, 2004 9:54 pm
Location: The Great State of Pennsylvania, in the Merry Old Land of Oz!

Previous

Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 3 guests

cron