In the era of Big Data scalability is always a key concern. Simply throwing hardware at the problem isn’t enough. If the software architecture can’t fully take advantage of the available bandwidth and compute power, bottlenecks remain. One of VMRay Analyzer’s main advantages is our agentless hypervisor-based approach, allowing substantially faster analysis for a given sample than earlier emulation and hooking approaches.
With this as a starting point, we’ve written up here some of the work we’ve done at VMRay to scale up VMRay Analyzer to leverage the bandwidth and hardware available and support the most high-volume analysis environments. This work is part of our latest release V 1.9 and this post is the first in a series of what’s new.
To begin, we studied loads generated by typical samples and analysis output, identified bottlenecks and addressed them in our architecture.
The Analysis and Report Generator of VMRay Analyzer has excellent performance and low resources requirements as a result of our agentless hypervisor-based approach. This allows VMRay Analyzer to deliver more analyses for a given number of target virtual machines (VMs) than other approaches.
To be able to run many target VMs in parallel, VMRay has a Server-Worker model. A single VMRay Worker can support 120 or more VMs running in parallel, assuming sufficient hardware.
To use a war analogy, troops and weapons are important, but battles are won and lost by the Quartermaster and the supply chain. In this analogy, VMRay’s ability to spin up and support hundreds of VMs from a single Worker is a high-bandwidth supply chain, ensuring adequate resources are available.
Optimizing the Server-Worker System
The VMRay Server-Worker system has several tasks:
Server accepts samples submitted by the user
Server pushes sample out to available Worker
Worker runs sample in the target machine
Worker runs Analysis and Report Generator against running sample
The analysis results are sent to the Server
Server presents the malware analysis report to the user.
Communication between the Server and the User can be either through the Web UI or via the API. In V 1.9 we have implemented REST/JSON support in the API and will talk about that more in a future blog post.
The Network Bottleneck
So far so good. In our testing, we found, as expected, that the network itself was the bottleneck. The communication between worker and server was entirely over HTTP.  In ‘small’ configurations with as many as 100 parallel instances (that is, target virtual machines) running, this performed well. However, above this size, this approach reaches its limits and performance degrades.
As our system is Linux-based and is intended for the Server and Worker to run on the same subnet, it was trivial to add NFS (Network File System) support. In addition, by leveraging the advantages of Network Attached Storage (NAS) for file transfers, this enabled us to exploit the full capability of the network. In this scenario files themselves are transferred via NFS, while the service messages are still transferred over HTTPS.
A look at typical payloads
We evaluated the sample and analysis sizes contained in our database and found that samples have a mean size of 2MB and analysis results a mean size of 20 MB.
The analysis results are a zipped archive comprising:
An HTML report
Extracted files
Screenshots
Network data (pcap) and transcripts of system activity.
Memory dumps
Behavior and function logs
And more…
If desired, some of these files can be filtered out by a script run by the user, initiated before file transfer.
Pipelining
A major performance gain was achieved by parallelizing the upload. We did this by implementing a pipeline, conceptually similar to CPU instruction pipelining . As the diagram shows, the pipeline works in two steps. While the current analysis is uploaded to the server or NAS, the subsequent sample can be analyzed in parallel.
Testing and Results
For network testing we implemented a special test mode where we simulated Analysis and Report Generation for 120 seconds delivering a 20 Mb randomized file. With NFS Protocol enabled, we were able to attach 500 Worker instances to one Server and one NFS server (a separate physical machine), all running on low cost hardware.
Â
We show here two graphs to illustrate the advantage of the pipeline approach. In the first graph, the number of parallel running VMs holds fairly steady at 5oo instances and the upload peaks at the 80 second mark. However, the pipeline has a time buffer of up to 120 seconds under the given assumptions.
The second graph shows the network throughput at the NAS server. The upper red line shows the traffic caused by the uploaded analysis archives and the lower blue line shows the traffic caused by the downloaded samples. The maximum network capacity of slightly more than 100MB/s isn’t reached, leaving room to increase the amount of VM instances.
These results were achieved with default settings. Performance can be further optimized by filtering out extraneous data as mentioned previously.
An extra boost from push-based notification
A last change we implemented as part of this architecture is to support push-based notifications in addition to the default pull-based notifications. If you want to be notified when jobs are finished, then the simplest approach is to periodically query the REST API for when a job converts into a finished analysis or an error. For large scale deployment, this approach can be a bottleneck as it can create a lot of unnecessary API requests. Instead of a pull-based approach, VMRay also provides the opportunity to implement a push-based approach. You can provide a custom program(script) that is executed each time a job has finished. This script can then forward this information using an arbitrary message broker (like RabbitMQ or ActiveMQ) to trigger any callback you want.
Scalability – Looking ahead
Moore’s law applies broadly not just to CPU power, but virtually all aspects of hardware and network infrastructure – memory, storage, bandwidth. We can reasonably expect that the physical infrastructure available will continue to grow exponentially in its capacity to support higher throughput. This is fortuitous as we can similarly expect the same kind of growth in unique malware samples produced that will need analyzing. With this architecture, we’ve taken an important step forward to support the scalability required for the future.