[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Notice: in file [ROOT]/includes/session.php on line 2208: Array to string conversion
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4688: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4690: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4691: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4692: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Poco Forums • View topic - More PocoMail Junk Filtering vs. K9

More PocoMail Junk Filtering vs. K9

Discussion on Bayesian and standard junk mail filters

Moderators: Eric, Tomas, robin, Michael

Postby frazmi » Sun Aug 01, 2004 9:23 pm

Is the "30 good messages" an approximation in order to get the "1,000 good words" that the Bayesian setup screen mentions, or is this an additional (possibly undocumented?) requirement?
frazmi
Poco Enthusiast
 
Posts: 248
Joined: Tue Jul 27, 2004 1:27 am
Location: South Korea

Postby Alastria » Mon Aug 02, 2004 5:17 am

It was around 30, I don't remember the exact number. I picked out specific messages that I thought were good examples of legit mail that I get: personal correspondance, emails that mention my business, etc. After 30-40 emails, the Bayesian filters said they were then on. Things have been great since.
Alastria
New Arrival
 
Posts: 2
Joined: Sun Aug 01, 2004 7:30 pm

Postby frazmi » Mon Aug 02, 2004 3:17 pm

OK, it sounds like you hit the 1,000 word mark after about 30 messages -- it could have been fewer if you had classified long messages, and it might have been more if you had classifed really short messages.
frazmi
Poco Enthusiast
 
Posts: 248
Joined: Tue Jul 27, 2004 1:27 am
Location: South Korea

Postby vamp07 » Wed Aug 04, 2004 11:59 pm

I went back and switched to poco's Bayesian filters again. Turned off the default filters. My accuracy is getting better. I would also agree it is probably in the >98% range. I was actually going to start using popfile again after using yahoo's bayesian filtering which is very good. I got popfile all set up but for some reason popfile was hanging frequently and decided to do pure poco. I still think the implementation is overly complicated. All these options to train/untrain messages along with some methods such as moving by hand accomplishing these tasks automatically. In either case it does work.
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby Trapper » Sun Aug 08, 2004 7:26 am

Thanks to everyone for their input. I've personally decided to stick to K9, if for no other reason, for the simplicity involved. The goal is to filter away as much spam as possible. If a utility does that as well as or better than another and takes much less effort on my part to do so, that's my choice. I find no need to make spam filtering a personal challenge or obcession. I pretty much discovered with a mailserver that filtering with content control, rules and dnsbl's etc. can become extremely time consuming. I enjoyed this thread.
Trapper
Drop-in Visitor
 
Posts: 11
Joined: Wed Jul 28, 2004 10:30 am

Postby vamp07 » Tue Aug 10, 2004 12:36 am

I think some of these issues with accuracy vs. K9 or PopFile has something to do with new words. What does Poco do with new words? It's my understanding that PopFile just ignores them completely. My accuracy is at 91% and I notice that the emails that gets through are the ones with random words where as with popfile I remember this stuff generally getting trashed.
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby saoir » Tue Aug 10, 2004 1:49 am

FWIW: I used K9 and found it the very best of all of the spam filtering apps.

I had to abandon it because of the way it interacted with Polcomail as regards mid sized and large attachments.

I am 'happy' with Pocomail's filtering but it is aparent that it is not nearly as effective as K9.
The reason I am 'happy' is because I like to have both processes in one app, the attachment problem (that I wrote about extensively in the now lost forum), and even though a few of the ~200 spam I get every day get through... it's not the end of ther world......

I hope that Slaven devotes quality time to it's improvement however, as this is a key area in Email nowadays.
saoir
Poco Enthusiast
 
Posts: 201
Joined: Mon Jul 26, 2004 2:49 am

Postby vamp07 » Tue Aug 10, 2004 7:55 am

I agree completely about it being a key area. As long as poco's Bayesian filter can get about lets say 96-97% I'll be happy. If popfile or k9 are 1% above that it's not enough that I would even notice. The reason I like PopFile vs. K9 is that with PopFile I can open the message that got misclassified from within popfile and reclassify it. With k9 you need to go to k9 and find the message. In either case PopFile seems to have the same problem as K9 of getting stuck with larger emails which is a real pain.
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby SFCurley » Wed Aug 11, 2004 1:10 am

On the what-POPFile-does-with-new-words question. My recollection from reading about how POPFile works is that it assigns never-before-seen words a probability equal to 1/(10*Corpus Size).
SFCurley
 

Postby Trapper » Wed Aug 11, 2004 9:57 am

Saoir, my experience has also been that K9 is the best of the Windows spam filtering apps.

I wish the old forum had not been "lost". I would have liked to have read your thread on difficulty with K9 with large files. I have not experienced any difficulty with larger files.

I contacted Robin Keir, the K9 developer, concerning large file problems. Below is his reply:

"By default K9 will examine all emails however large they are. Because it can take time for an email to first be downloaded into K9 and then passed on to the email program some email applications timeout or complain, despite K9 sending "keep alive" messages to keep them awake.

If people are still having timeout problems with large emails they
should select the "Don't filter messages larger than..." option on the Configuration page. With this enabled, emails larger than the specified size will pass right through K9 without delay and will not be examined."

---end of quote---

It's extremely rare that a spam mail is very large at all.

Trapper
Trapper
Drop-in Visitor
 
Posts: 11
Joined: Wed Jul 28, 2004 10:30 am

Postby robin » Wed Aug 11, 2004 10:10 am

Trapper wrote:"By default K9 will examine all emails however large they are. Because it can take time for an email to first be downloaded into K9 and then passed on to the email program some email applications timeout or complain, despite K9 sending "keep alive" messages to keep them awake.
This is also true of Anti-Virus software.
robin
 

Postby vamp07 » Wed Aug 11, 2004 11:54 pm

So the larger the corpus the less probable a word is of being bad?

SFCurley wrote:On the what-POPFile-does-with-new-words question. My recollection from reading about how POPFile works is that it assigns never-before-seen words a probability equal to 1/(10*Corpus Size).
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby SFCurley » Thu Aug 12, 2004 1:31 am

I think it applies that rule to both the good and bad corpi, so if the word is in neither good or bad corpi, then the probability of that word being bad is (1/(10*bad corpus size)) and probability of it being good is (1/(10*good corpus size)).

If the word is already in good corpus but not in bad coropus, it's good probability is calculated normally using the actual corpus info, and bad probability calculated as above.
SFCurley
 

Postby vamp07 » Thu Aug 12, 2004 1:50 am

The logic makes no sense to me. The larger the difference between the corpus the more it swings it in one direction or another. A really large bad words corpus and a small good words corpus would make it swing towards being good. I would think you would want it to do the opposite. The logic being that most words you receive are bad, not good.
vamp07
Frequent Visitor
 
Posts: 66
Joined: Mon Jul 26, 2004 11:31 am

Postby SFCurley » Thu Aug 12, 2004 2:11 am

Well, all I can say in PF's defense, is it seems to work very well. Here's a series of posts from the PF's forum -- the primary answer being from the software's author, John Graham-Cumming:

POPFile assigns a probability to every word, for words that it has seen it naturally assigns the probability from the corpus. For unseen words it assigns the probability 1/(10 * size of that bucket's corpus)---i.e. a probability that indicates that the word is "unlikely" to appear.

The other possible choices are 0 (which would screw up classification since all classifications would be 0) or 1 (which would be a mistake since it would indicate that the word always appears).


long time ago I investigated different values for the unknown word and determined that what we have is the most efficient.


Which is what I believe is the reason that my accuracy improved when I took all the "junk" words out of my spam bucket (about 75% of the word count was random letters not even words).

It made new "spam words" less "unlikely" ...if you will.
SFCurley
 

PreviousNext

Return to Junk Mail Filtering Help and How-To

Who is online

Users browsing this forum: No registered users and 0 guests

cron