At LOVOO we’re putting a lot of effort into improving our user’s experience in connecting people. One of the most important factors is to only deliver genuine contact and chat requests by developing sophisticated anti spam techniques.
In this introductory blog post, we give a short overview about our current architecture and some insights into the algorithms that help us to separate so-called hammers (genuine users) from spammers.
The LOVOO anti spam tool
Our current anti spam system processes over one billion events per day. Every event represents an interaction between users within the LOVOO network (like chatting, voting, following, etc.). With over 1 TB/day of data archived for about 7 days, we’re managing over 7 TB of processed and historical data in order to make proper decisions about the classification. So far, we’ve reduced the amount of user spam reports by over 50% (late-August 2015).
We generally distinguish our users between hammers, spammers, fakers, and scammers.
Spammers usually try to advertise a competitor’s website and as such annoy our users only by sending classical spam messages, whereas fakers and scammers damage users directly and have a much more significant influence in the trustworthiness of LOVOO (e.g., by blackmailing our users). By taking measures against these types of disturbances we’re improving our user’s experience on LOVOO and increase the reputation and trustworthiness of our brand.
For the sake of simplicity, we do not distinguish between spammers, fakers, and scammers in the remainder of this blog post but refer to them simply as spammers.
Distribution of spammers
We’ve analyzed the origin of our spammers in the figure (Figure 1) below. The darker the shade of the country, the more spammers were detected. Germany and France rank up high for several reasons:
- These countries have a strong userbase which reports spammers pro-actively and
- The anti spam tool is very effective in finding these spammers.
While our userbase is not yet very strong in the US, the anti spam tool has undoubtedly already left its trace: Regarding detected spammers the US ranks 3rd worldwide.
The following figure (Figure 2) shows a demographic analysis of spammers on LOVOO for the 5 top countries by monthly active users. In all countries, except Brazil, female profiles are preferred by spammers. While the age of female spam profiles remains almost constant across all countries, the age of male spam profiles varies ranging from 29 (Brazil) to 37 (France). Finally, the devices used to access the spam profiles also vary significantly between countries: While the web devices dominate in France, Android devices dominate in all other countries.
Anti spam-measures are developed by analyzing user profiles of known spammers. Whenever we discover new attacks, we update our algorithms “on the go”. However, in order to develop algorithms and counter-techniques we have to first understand how spammers behave and what their characteristics are (so-called feature-extraction).
The following two sections give a short overview about our software architecture and one of the basic algorithms we’re using: Bayesian Updating.
The LOVOO approach
At LOVOO, we want to take a user-centric approach to spam. Instead of deciding whether individual interactions are spam or not (e.g., an upload of a fake profile picture), we want to base our decision on the history of interactions a given user has created. It turns out that a method from statistical learning provides a solid mathematical foundation which does exactly that: “Bayesian Updating” (Some citations mention it under the name of Bayesian Inference or Bayesian Learning). Our anti spam tool is built around this method.
At LOVOO, each user has a probability of being a spammer. Each interaction a user creates will be evaluated by several scorers which themselves compute a probability for this event to be spam or not spam. Suppose, for example, that most spammers on LOVOO are wearing red shirts in their profile pictures. Now, one hypothetical scorer could count the percentage of red colors in each profile picture upload. The higher this percentage, the higher the probability the scorer would return.
Bayesian updating now allows us to learn from this event and the scorer’s evaluation, respectively. Bayesian updating makes use of the well-known bayesian rule for conditional probabilities but doesn’t use the interpretation of events occurring simultaneously. Rather, bayesian updating talks about an evidence and an hypothesis. The evidence (e.g., a profile picture upload) then may either support or weaken the hypothesis that the given user is a spammer.
So far, however, we have not explained how we update this probability throughout a user’s history. This is where a further variable in the bayes theorem comes into play: The prior probability for the given hypothesis. Initially, we assume that every user has a 50% chance of being a spammer (i.e., the prior probability is 0.5). Whenever we apply the bayesian updating, the resulting probability (i.e., the posterior) is saved as the prior probability for the hypothesis that a user is a spammer. This way, the prior changes with every event the user creates. Eventually, if we are certain enough (i.e., the prior has risen above a given threshold), the system blocks the given user automatically.
The following overview (Figure 3) depicts the anti spam system’s architecture which was developed with vertical and horizontal scalability in mind. As LOVOO grows, it’s easy to set up new anti spam instances in order to keep up with the larger number of users (and probably spammers). As of today, we’re managing about 36 mio. users and 7 TB of event data with just a very small server cluster: Our tool is designed to be very efficient when it comes to resources.
The anti spam tool consists of different components, including the learner, dispatcher, and scorer as the most important ones. Incoming events are passed to the dispatcher which takes care of providing all events to the scorers for evaluation. Once a scorer evaluated the event, the learner propagates the evaluation to all other anti spam instances for learning purposes.
We’re maintaining a very strong in-house cooperation with other departments such as our customer service in order to continuously measure the accuracy of our scorers.
Stay tuned for more information over the next few month about our progress on hunting spammers!