Taxonomy is the science of naming, defining and classifying groups of biological organisms based on shared characteristics. Fundamentally it’s an organization scheme that has allowed scientists to study organisms without confusion or overlap since the Swedish naturalist Carl Linnaeus introduced his framework for a uniform naming system more than 300 years ago. Today’s threat researchers can likewise benefit from applying similar organizing principles in the fight against malware. In this post—condensed from a SANS Webcast – VMRay Sr. Research Analyst Tamas Boczan and SANS analyst Jake Williams explain why malware family identification matters, what exactly constitutes a malware family, and how to use this identification tactically to hone incident response.
What are malware families?
While there are millions of different malware samples, many of these samples can be grouped into families, indicating that different malware families share common traits such as the codebase or the development group that authored the original strain, from which the all other derivative members can be traced.
As SANS Institute analyst Jake Williams explains, “when we talk about a malware family, it’s more than just malware that shares some common lineage. Rather, we’re talking about samples that share the majority of their code with another sample.”
One of the primary ways that malware authors create malware variants is through the use of builders, allowing them to customize certain features of the malware without having to recompile it. Says Williams, “a Builder is not necessary to declare a malware family, but certainly, any malware with a Builder meets the family definition since you’re churning out sample after sample with the same core code — just different configurations.”
Why does family identification matter?
How does knowing if a particular malware sample is part of a known family help mitigate it? Family identification provides an important context for a security researcher, helping to accelerate the recognition and mitigation of a known strain. Williams recommends downloading some of the free rule sharing groups like Yara but warns about the potential for false positives since the classification schemes are only as good as the rules.
As VMRay’s lead security analyst, Tamas spends the majority of his time analyzing malware samples and points out that, “even though a sample statically might be very different, the behavior is still almost the same for the same family. So if I see the same sandbox report every day, then I know that the samples are part of a family.”
While certain malware samples can statically look very different at an assembly or machine code level, their behavior at runtime is virtually identical. Williams offers an analogy of how a Special Forces team might use different camouflage uniforms depending on what environment they’re working in. While they might appear superficially different in say a desert setting versus a woodland setting, regardless of what they’re wearing on the outside, they still execute their orders the same way.
Identifying a particular strain as part of a family can also greatly accelerate the mitigation process for incident responders. As Williams explains, “I need to understand more about what the attacker is doing with the malware. And if I understand that it’s part of a known family, I can start to break down some of the techniques and tradecraft that the attackers are using how they have been used previously.”
Malware family identification in the wild
Williams goes on to show how correlating whether a certain malware strain can be attributed to a known family can provide a number of valuable insights. For instance with certain strains of ransomware he explains, “family identification helps me evaluate the likelihood that paying is going to get decryption keys released… if I can identify the family and I can identify the reputation of that group, right, that’s going to help me a lot.”
To show how rapid family identification can improve the efficiency and responsiveness for an incident response team, Williams offers up three different case studies:
1) Plug-X, which when it was the first launch was almost exclusively being used by nation-states such as the Chinese state hackers but were later adopted by other groups once its Builders were leaked into the general domain.
2) Emotet, which is traditionally thought of as a banking trojan but now shows up as an infection vector for other malware strains (and often indicates more APT-styled follow-on actions;
3) Dharma Ransomware, one of the first strains to be classified as a Ransomware-as-a-Service.
As he closes his section of the webinar, Williams summarizes by saying that identifying malware as being part of a family, “gives practitioners a more complete picture of the incident. Remember, malware is a tool. Like a hammer in my hands, I can swing it and I can break stuff. But in a carpenter’s hands, it becomes a completely different tool.”
In the second half of the webinar, Tamas Boczan, VMRay’s Sr. Threat Researcher dives deeper into the techniques a sandbox can be used to identify malware families and then provides examples from some of his research to demonstrate how these practices are applied with actual samples.
Static v. Dynamic Analysis
The first intuition with family classification is often to run static analysis only or upload the sample to VirusTotal to see how antivirus vendors label the sample. This method rarely results in the correct family classification. The labeling provided by static scans of new samples is inaccurate – the primary goal of a static antivirus scan is only to decide if the sample is malicious or not. The antivirus engines extract static properties about the file and make a very short emulator run. Based on the data gathered this way about the sample, they define heuristics to mark the sample as either malicious or clean.
These heuristics are often enough to make a binary decision, but they are not capable of correctly identifying the family. To achieve family identification, the unpacked payload is necessary. This is where dynamic analysis comes in.
Using a debugger, the analyst can extract the payload and then run signatures on it. At this point, the signature engine actually has the data that it needs to make a correct decision. An automated way to extract the payload is the memory dumping feature of sandboxes. As Tamas explains, “Doing this in an automated way however is not simple at all which is why sandboxes should be considered for unpacking. It simplifies the process so much and automates a normally very manual process.”
In the clip below, Tamas shows how signatures in memory dumps are used to identify Emotet and identified which modules this specific Emotet sample was using.
Benefits of hypervisor-based monitoring
Tamas goes on to explain some of the advantages afforded by VMRay’s unique hypervisor-based monitoring. Unlike VMRay, most sandboxes are based on API hooking. They monitor calls of a small subset of the Windows API and have to ignore everything else. Because the overhead of each hooked function is substantial, they are forced to keep the list of monitored API functions small. This is not the case with VMRay. Its hypervisor-based monitoring approach can monitor the API function calls made by the sample.
As Tamas explains, these API calls don’t result in file creations or registry operations but expose key internal behavior that would otherwise be hidden. In an example, he shows how a simple internal string operation reveals the message before the sample encrypts and transmits it.
Identifying families via behavioral techniques
Because most of the code of the malware is shared between samples of each family, executing the samples and closely observing details of their behavior is a powerful tool for identifying families.
As Tamas explains, “By digging deeper into what type of malicious behavior it is, like what processes were injected or if the malware is stealing cached passwords from 67 applications and some of these 67 are unique to this malware family, then those can also be used to identify a particular family.”
Tamas concludes his portion by summarizing the three reasons why you would want to classify samples into families:
To speed up analysis. If a sandbox can provide an analyst with a family this will greatly accelerate all other aspects of remediation
Provide a deeper understanding of the threat. If an analyst wants to take a deeper dive or reverse engineer it, then starting with the family provides a good starting point.
Provide a broader understanding of the threat landscape: Understanding the type of malware families organizations are being targeted will help security teams better prepare for future threats.
To learn more about Malware Family Identification, view the full webcast: Family Matters: Practical Malware Family Identification for Incident Responders