[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - Example of spam that PocoMail missed

Example of spam that PocoMail missed

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

Example of spam that PocoMail missed

Postby Pete » Fri Sep 24, 2004 3:58 am

After three months of training PocoMail's BF, and after moving the junk-threshold setting down to 80%, I just received this email that PM missed:

Subject: Group Sex!

(Content-Type: text/html; charset="windows-1251")

Body:
She survived to take a huge splat job!


XXX Hustler Movies !


QaJApCCkjMCCbevXhzHQuiWpiWMvTQN



I realize that there are "common everyday" words in the body, and one big garbage word, but what about these:

in DBGOOD.INI:

SUBJECT-group=2
SUBJECT-sex=0
survived=1
huge=2
splat=0
job=3
xxx=0
hustler=0

in DBSPAM.INI:

SUBJECT-group=1
SUBJECT-sex=3
survived=0
huge=9
splat=0
job=2
xxx=5
hustler=2


I moved the junk threshold down to 50% and the good-word bias down to 1.0. When I clicked on "Apply and Test", PocoMail still said that the message would not be considered junk.

Something is wrong here.
Pete
 

Postby Pete » Fri Sep 24, 2004 6:40 am

  1. Based on a previous discussion between vamp07, SFCurley and others, I wonder if the BF ignores words in DBSpam.ini if their count is less than some number. If so, then I think that this is very wrong. At this rate, it could take years before my word counts are high enough to do any good. :?:
  2. I do a similar thing to what Neo does. I let the BF score all incoming messages, but I don't let it move them to the junk mailbox. I use a PocoScript SaveMessage command instead.

    Does anyone know if the BF automatically updates DBSpam.ini if you let the BF move messages to the junk mailbox? If true, then this might be a factor in my problem. If false, then I think that my first bullet is a more likely cause.
Pete
 

Postby Jim » Fri Sep 24, 2004 7:51 am

Pete wrote:
  1. Based on a previous discussion between vamp07, SFCurley and others, I wonder if the BF ignores words in DBSpam.ini if their count is less than some number. If so, then I think that this is very wrong. At this rate, it could take years before my word counts are high enough to do any good. :?:
  2. I do a similar thing to what Neo does. I let the BF score all incoming messages, but I don't let it move them to the junk mailbox. I use a PocoScript SaveMessage command instead.

    Does anyone know if the BF automatically updates DBSpam.ini if you let the BF move messages to the junk mailbox? If true, then this might be a factor in my problem. If false, then I think that my first bullet is a more likely cause.


We will look at it.

On 1, yes, there is a threshold before a word is treated "normally". The word is not ignored but rather treated like a word with zero occurence.

On 2, if the BF filters move a message into Junk, then there is no need to train itself again since it already knows that the message is junk.

As an aside, I don't think you can make the conclusion that is implicit in your post. For example, the words Sex and Group may not be significant in your corpi when compared to other words. You should simply let your filters work under normal conditions, train them with that message and check the result.
Jim
 

Postby Pete » Fri Sep 24, 2004 9:20 am

Jim wrote:On 2, if the BF filters move a message into Junk, then there is no need to train itself again since it already knows that the message is junk.

Duh, Pete! Thanks for the explanation, Jim!


Jim wrote:As an aside, I don't think you can make the conclusion that is implicit in your post. For example, the words Sex and Group may not be significant in your corpi when compared to other words.

I understand that there are other factors and other words besides the small subset of words that I examined.


Jim wrote:We will look at it.

Thanks, Jim. I can send PSI any relevant data.
Pete
 

Postby Jim » Fri Sep 24, 2004 12:39 pm

Actually, Pete, please send that message to me and to (in separate emails) test1 AT pocomail DOT com? Sorry for the extra step. Thanks in advance.
Jim
 

Postby Pete » Sat Sep 25, 2004 3:55 am

Okay, I've sent the data.
Pete
 

Postby Pete » Sat Sep 25, 2004 4:17 am

Given that I know almost nothing about Bayesian filtering, I have some questions/thoughts:

  • Is it necessary to have the 1,000-word minimum for each corpus before the BF is active AND have the minimum-occurrence count for each word? Speaking out of ignorance, it seems unnecessarily redundant.

    Also, it might make PM's BF easier to understand for new users if they didn't have to know that they must manually classify good messages as good in the beginning, but only until they reach the 1,000-word minimum. Then, they only have to do it for false positives. I think that little inconsistencies like this are what really frustrate new users of computers (and I think that it's a little confusing for advanced users too). Also, this requires that they know that they have to continually open "Tools > Junkmail Filtering" to manually monitor the statistics (whether they want to or not) until the 1,000-word minimum is met for the good-word corpus. If nothing else, maybe PocoMail could present a message box to the user when the 1,000-word minimum is met for the good corpus?
  • What is the purpose of having a minimum-occurrence count for each word? I guess that I can understand the logic if the minimum is one or even two (to handle random words), but anything higher than that seems too high (again these questions come from my ignorance of Bayesian filtering).


I don't expect or need answers from PSI, but I wanted to ask these questions in the community to see if anyone wanted to briefly explain this.
Pete
 

Postby paleolith » Thu Sep 30, 2004 3:25 pm

Regarding the minimum occurrence count per word -- this is to avoid having words with only one or two occurrences skewing the results.

The way the technique works, each word is assigned a "spam probability" based in its actual frequency of occurrence in your email and spam -- and only yours. It's clipped, like at 1% and 99%, to prevent words which have only occurred in one or the other from having veto power.

But a word which has occurred only once then has a probability of either 1% or 99%, and if this turns out to be wrong, it can skew the results badly.

By contrast, a word which has occurred five times might also have a probability of 1% or 99% (or 20% or 40% etc), but it's a lot more reliable.

It might be possible to extend the technique to take the total number of occurrences into account, so that a one-occurrence word would be meaningful but at a lower weight. The problem I see with this is that when you'd accumulated tens of thousands of emails in your training database, it would become much less flexible at identifying new spammer tricks.

Paul Graham was responsible for popularizing the idea of Bayesian filtering, and his articles are still good ones to read:

http://www.paulgraham.com/antispam.html

There's technical material in there, but you can skip over what you don't want to read and still pick up a lot from the rest.

... now that Pocomail has an integrated Bayesian filter, I gotta get busy about converting to it ...

Edward
paleolith
Drop-in Visitor
 
Posts: 7
Joined: Thu Sep 30, 2004 3:23 pm
Location: Florida, land of liquid sunshine

Postby Pete » Fri Oct 01, 2004 7:30 am

Thanks for the info, Edward.
Pete
 


Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 2 guests

cron