
General
Image Spam
An anti-spam company's founder explains this increasingly
troublesome scourge of e-mail.
What is it?
An "Image Spam" is a spam e-mail that contains its sales pitch in the
form of an image, such as a JPEG or GIF image. There may be no other content
in the e-mail, or it may include nonsensical text, unrelated text such
as jokes or news reports, or simply gibberish.
Why image spam?
As content-filtering spam software became more sophisticated and accurate,
spammers found it more difficult to pitch their wares using normal text
or HTML messages. As a result, they turned to encoding their sales pitch
as an image. This completely bypasses most anti-spam content-filters,
because they cannot analyze the words in the images.
How can we combat image spam?
It turns out that image spam can be detected quite accurately using the
same techniques that fight other spam:
The gibberish or nonsense text included with image spam very quickly
becomes "red-flag" text for a Bayesian filter. A distributed Bayesian
database such as Roaring Penguin's Training Network adapts extremely quickly
to most image spam.
An image with little or no accompanying text is also a red flag, because
almost all legitimate mail that contains images also includes a reasonable
amount of body text.
Normal connection-level techniques such as greylisting and DNS-based
RBLs continue to be effective against image spam.
What about OCR?
Some anti-spam vendors have resorted to using Optical Character Recognition
tools to extract the text from an image spam for analysis. Unfortunately,
OCR has met with limited success. The state-of-the-art in OCR is not very
advanced. Furthermore, OCR tools are not designed to extract text from
an image that is actively being manipulated by an adversary. Spammers
have reacted to OCR tools by obfuscating the text in the images they send.
The obfuscated text is still relatively easy for humans to recognize,
but very difficult for OCR tools to extract.
In addition to the accuracy problem, OCR is very compute-intensive and
can greatly slow down a content filter.
End
|