How does mailcow learn Spam and Ham?

mailcow uses several methods to reliably distinguish between spam (unwanted emails) and ham (desired emails) and to continuously improve the filtering system:

Global Bayesian filter
User-defined Bayesian filter
Fuzzy hashing

Bayesian Filters – Probability-Based Evaluation¶

Global Bayesian Filter¶

The global Bayesian filter evaluates incoming emails based on statistical probabilities and automatically learns from messages whose spam score clearly exceeds or falls below a threshold.

For a message to be trained, there must be a high degree of certainty about its classification as spam or ham. Individual, random patterns are not sufficient – a certain level of repetition is required.

Additionally, the system regularly analyzes emails marked as "read" and older than a specific number of days. These messages are also incorporated into the learning process.

In the case of manual training (e.g., through user actions), no probability is required. Instead, the storage location determines the classification: Is the email in the Junk folder or not?

User-Defined Bayesian Filter¶

The user-defined filter directly responds to the user moving emails into or out of the Junk folder.

Unlike the global filter, it does not require statistical accumulation and starts learning after just a few interactions. Repeated training significantly increases accuracy.

Important:
The user-defined filter has higher priority than the global filter. This means: If both filters classify an email differently, the assessment of the user-defined filter is preferred.

Fuzzy Hashing – Intelligent Pattern Recognition¶

Fuzzy hashes are so-called "fuzzy checksums" that detect similar emails – even if certain parts of the message, such as names or wording, differ.

Since many spam emails are personalized (e.g., using the recipient's real name), the system generates a generalized hash value from such messages, which can be recognized even with slight structural changes.

A fuzzy hash is only generated if a message is classified as spam with very high probability and has been received in a similar form multiple times. Simply moving a message is not sufficient to generate a hash.

This method helps reliably identify modified spam variants and combat them collectively across multiple user accounts.