Fingerprinting & Fuzzy Hashing Explained
Introduction
Humans can easily tell when two things are visually similar, however, for a computer this task is not as straightforward. In recent years we have seen an uptick in new technology, such as computer vision, that are extremely promising. Computer vision is closing the gap between human and machine at an astonishing pace, as a result, companies are leveraging its powerful capabilities for a variety of purpose.
While impressive and feeling like a solution to a variety of problems, there are some limitations. Deploying at scale is very resource intensive in both monetary and physical terms. These two characteristics alone can make it impractical for a large amount of use cases.
In the example of email analysis, utilising this method for every single inbound email would not be practical and is the equivalent of using a sledgehammer to crack a nut. A lower cost, faster method of detecting similar emails is through the use of fingerprinting and fuzzy hashing.
What is fingerprinting?
Fingerprinting is performed by providing an input to a function which then generates an output known as a hash or digest. The outputted hash is an algorithmically generated fixed size string of characters that represents the data within a file. The algorithms used are deterministic which means given the same input, the same output will be generated each time and any changes to the input will result in a drastically different output.
How Does Fingerprinting Work?
Fingerprinting is typically done using MD5 or SHA1. While MD5/SHA-1 are no longer suitable for cryptographic purposes, we can use it in this fashion since it is not being used to protect the confidentiality of the data and a simpler algorithm is quicker to process in time sensitive applications. Let's explore an example scenario where we fingerprint an email.
Here we see an email from Lionel to Luke. It looks like a fairly generic sales email as there is nothing particularly malicious about the contents, however, it would likely be unwanted by the majority of people.
It seems a colleague, Homer, has also received the same email. Everything is identical except for the greeting which has been updated to “Hey Homer” instead of “Hi Luke”.
Let’s make a MD5 hash of each file to compare:
Output for email with greeting “Hi Luke”
17c55bb1949d71433dcb98939e31f2187
Output for email with greeting “Hey Homer”
13b51586d8e638fe2c3de379070b9901e
As we can see, these outputs are completely different despite the input being very similar which means comparing the hash values does not provide a method of correlating the input to the produced output. In this hypothetical scenario, the sender was able to modify the greeting, causing our generic fingerprinting technique to be rendered infective.
What happens if this has been sent to thousands of recipients? There needs to be a better method of computing similar content.
Fuzzy Hashing Explained
This is where a concept called "Fuzzy Hashing" comes to the rescue. Rather than creating a hash using a sum of all the bytes in the file like our previously used MD5 function, it is a rolling hash that iterates through "blocks" of data.
Since the input is now taken in blocks, our issue with slight modifications causing massive changes in the output should be resolved as the rest of the file that is the exact same will not change the output. Once we have our outputs, we can then calculate the similarity between the two files.
Let’s create a proof of concept using the same emails as before for our input and using a fuzzy hash generation program called "ssdeep".
Here is the output:
ssdeep hash output for email with greeting “Hi Luke”
XPEW4kzeFo2rZS+PwoM8snSYlKnVyzeFo2rZS+PwoMOyrUq:XcfkzeFocZS+PwJ84/KVyzeFocZS+Pwb
ssdeep hash output for email with greeting “Hi Homer”
sEW4kzeFo2rZS+PwoM8shYKyKnrLNyzeFo2rZS+PwoMOyrUf8:bfkzeFocZS+PwJ8DKrLNyzeFocZS+PwZ
Unlike the MD5 output hashes, these are immediately more similar looking to the naked eye which is very promising. There is a large amount of shared characters:
XPEW4kzeFo2rZS+PwoM8snSYlKnVyzeFo2rZS+PwoMOyrUq:XcfkzeFocZS+PwJ84/KVyzeFocZS+Pwb
sEW4kzeFo2rZS+PwoM8shYKyKnrLNyzeFo2rZS+PwoMOyrUf8:bfkzeFocZS+PwJ8DKrLNyzeFocZS+PwZ
To verify this mathematically, ssdeep includes a compare function. Using this we can calculate the percentage similarity which is returned as 85%!
Real World Uses for Fuzzy Hashing
What about some real world examples where it really matters? Where does the true value of this technique lie?
Threat actors commonly utilise a toolset known as “phish kits” to create webpages and email templates based on real world companies. These phish kits are sold on the dark web for a low price, contain all the necessary code, and are very easy to deploy. Sending large amounts of bulk phishing emails without a huge amount of configuration from nefarious spammers greatly increase the speed and efficacy of attacks.
Typically when targeting a company, some minor modifications will be made between iterations, such as the end user’s name and company’s name to make emails and webpages more believable.
From what we have learned above, fuzzy hashing is a perfect method of abusing this short coming since we can now effectively check if multiple people, across multiple domains have received the same phishing campaign.
How Does Mesh Utilize These Techniques?
Fingerprinting is one of the many ways we help protect against threats.
At runtime, all inbound emails are parsed and fingerprinted using traditional methods (MD5) and using various fuzzy hash algorithms. Instead of storing the original email in a database, we can instead store these hashes as they represent the content without the requirement to retain sensitive information.
In the event a false negative is reported to us using our “Live Email Tracker” or as an attachment to our spam mailbox, we check for two things:
Does the reported email’s hash match a previously generated MD5 hash?
Is the reported email’s hash similar to a previously generated fuzzy hash?
If the email matches an MD5 hash, this indicates the threat actor took no extra steps to vary any of the sent messages and it is much more likely to be a "spray and pray" attack or a bulk email.
If the email is similar to a fuzzy hash, it indicates a very similar structure to something that has been seen before. Depending on the type of reported email, this can be an indicator that the threat actor has sent out unique emails to particular individuals.
If we are confident that the inbound email is in fact malicious and these similar emails are the same campaign, we can now automatically develop a ruleset to actively detect any similar emails in future.
In our upcoming product, Mesh 365, we allow for cross-tenancy remediation, which will allow you to remove these already delivered emails from the inboxes of your clients in one central location. The ability to remediate emails directly from multiple inboxes can save time, money, and most importantly, prevent a potential security incident.
Conclusion
Now more than ever, organizations are looking towards their MSP or technology provider to help them protect their employees against the evolving email threat landscape. Mesh provides specialist protection against the complete spectrum of email attacks. For more information on how Mesh provides MSPs with the tools to protect their clients, request a free trial or NFR account today.