Spam filtering

by **wxPhil** » Sat Jan 05, 2008 1:28 pm

Can I suggest (ask for?) a step-by-step "idiot's guide" to spam filtering, and especially with regards to Bayesian Filter training....? It isn't really very obvious exactly how to go about this; what steps to take, in what order....
regards...
Phil

by **markr** » Mon Jan 07, 2008 12:22 am

I would also like to see something like this.

regards,

Mark

by **Code Crafters** » Mon Jan 07, 2008 11:57 am

Ok, here's a simple guide for how to set up SPAM filtering:

1) Make sure you're running the latest version.

2) Run the SPAM Filtering setup wizard from the first page of the SPAM Filtering settings and set to medium level protection. Disable the SPAM trap email address if you don't want to use this. SPAM trap is an email address hidden in your website to catch out bots that scan your website for email addresses to SPAM to. This wizard will setup nearly all SPAM filters with appropriate settings for most mail servers.

3) If desired, you need to separately set up Bayesian Filtering because it is too complicated for the wizard to setup quickly. This is a learning system that will stop 99.5% of all SPAM once trained well but you shouldn't act on its results until at least 1000 SPAM and 1000 non-SPAM mails have been learnt from. Set this up to use only the Auto-Learn from Users training method which allows participating users to archive their mail into SPAM / non-SPAM folders for the Bayesian to learn from. Specify the users and folders to be used in the Bayesian settings. There is a delay of X (defualt of 7) days before the mails are learnt from to allow the user to correct any incorrectly placed mails that they or content / user filters etc. may have moved to those directories. If a mail is incorrectly learned from you can still move it to the correct folder where it will be re-learned. It is recommended to keep all of these mails in case some bad learning is done by any user and the system needs to be retrained.

4) There is also a relaying exemption option on the first page of the SPAM filtering settings to allow all authenticated users to skip SPAM filtering as well as white and black lists for direct IP allow / block lists.

5) If you need further custom filtering you can use content and user filtering to achieve this.

by **wxPhil** » Tue Jan 08, 2008 10:21 am

Byesian filtering:

should we wait til we have accunulated 1,000 "good" and "bad" emails before attempting to "teach" the filter? Or can we start with fewer (we already have 1,000 spam, but are far short on genuine ones so far!) - I am still not sure how long this "teaching" process should go on... indefinitely? Do I set the auto-learn up and then just forget about it, (until, maybe, we think it's not performing well, and then wipe it and start again)?

cheers...
Phil

by **Code Crafters** » Tue Jan 08, 2008 11:51 am

The system auto-learns so you just put the mails in the appropriate folders and after 7 days (you can shorten this if you want) the system will poll all participating users and appropriate folders and auto-learn the mails. The main thing is to have 1000 SPAM mails before acting on the results but you still need some (maybe 100) non-SPAM mails for the system to work reasonably accurately. You should just keep storing your mails in this way and ideally allow the system to keep auto-learning as it needs to learn new SPAM mail formats as they emerge since SPAMers are always coming up with new ways of trying to get through. Our Bayesian filter has learned thousands of mails now and blocks about 99.5% of all our SPAM on its own.

by **wxPhil** » Wed Jan 09, 2008 10:11 am

ok - final question (some people will believe anything, :lol:

!) - although we already have 1,000+ spam emails (just in the last week) we are a long way short of an equivalent number of genuine ones.... would I get way with "cheating" by forwarding to myself loads of old emails (from another account, maybe) and using them to teach the Bayesian filter? As it uses ALL the words to learn I think I might, but don't want it to think that an email is more likely to be genuine if, for example, the subject line begins "FW: "....

Only asking because we are keen to get this anti-spam kicking in asap, and are in a hurry to "teach" it - our spam is getting out of control and a real nuisance to all.... (as I'm sure you are only too well aware...)

by **Code Crafters** » Wed Jan 09, 2008 11:53 am

Bayesian fitlering should ideally be done on a per user basis for best accuracy. I would advise that you use the 1000 SPAM mails and whatever non-SPAM you've already got and not any other old emails. Start the learning now but while it becomes more accurate just check the SPAM folder and move any non-SPAMs out periodically (ideally before the 7 day learning period). Then just store all new non-SPAM mails in the non-SPAM folder and keep learning. The system will keep getting more accurate at catching more SPAMs and not getting as many false positives with more and more learning. We still have to check our SPAM folder and remove a couple of false positives every few weeks. We get over 1000 SPAMs a day but don't get many in our Inbox at all now our Bayesian filter is really well trained.

by **wxPhil** » Fri Jan 11, 2008 11:47 am

You didn't really think it was the last question, did you? :-)

This one may be though!

OK, I set up Bayesian auto-learn, and it seems to be doing quite well... I have set up a custom filter to move all suspected spam to a "check" folder where I can double-check them later (for now). There were a bunch in there this morning - good! My question is: is there any point in moving them now to the SPAM folder for the filter to learn from, or will it say "I've already looked at these" and ignore them? Or has it so far just marked them, but not learnt from them?

Thanks...

by **Code Crafters** » Mon Jan 14, 2008 11:52 am

The SPAM flag in the mail when a SPAM filter triggers is just for identification purposes in content / user filtering so that you can act on this by editing the subject or moving the mail to another folder etc. You still need to move the mail to the SPAM / non-SPAM folders for Bayesian filtering to learn from to allow you the opportunity to correct any wrong filter triggering before it is learnt from after 7 days (by default). Bayesian will have done its scoring part for the mail but not the training part at this point.

Spam filtering

Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Re: Spam filtering

Who is online