I have always had an interest in collecting things and considered my field, that led me to collecting malware samples and learning how to cluster them.

This effort started with building a malware indexing system and a malware scraper. For indexing I chose Viper (which can be found here) since I have a decent amount of familiarity with it. On top of this I wrote a malware collector which runs off of a Cron job. This collector will query various sites for new samples and tag them in Viper with varying metadata.

So far I have collected around 15,000 samples and this got me thinking about the relationships between them. Running 15,000 samples through antivirus to get a sense of what they were, isn't exactly an option when services like Hybrid Analysis and VirusTotal rate limit API requests, so I had to come up with another idea. I decided to test out some clustering ideas I had, and the results were quite interesting.

At the time I was reading Malware Data Science (which I highly recommend), which got me started with my clustering tooling.  The basic process of clustering is to define your malware into a set of features. These features could be anything from strings all the way to control flow graphs. Once the features have been extracted from each sample, each sample is compared against the rest. This is the premise behind all malware clustering techniques. There are various shortcuts and optimizations that can be made, but those will not be covered in this post.

I decided to use strings with an n-grams pattern for clustering as that would be relatively easy to implement. It allows the user to set the variable n for their n-grams to fine tune the similarity index. This is the result of my initial clustering.

Results from strings n-gram clustering

I will write a separate post going over the results of this test and explaining what they are exactly.

The Process

Once you have a decently sized malware collection (you can get samples from Virusshare), you are ready to cluster some malware. As stated earlier, the general process for doing so is as follows:

  1. Define the features of your malware. Could be anything from strings to opcodes
  2. Extract those features and input them into a data structure like a python dictionary so you have quick lookups
  3. Do comparisons between each sample and all the others
  4. Visualize the graph with a tool like fdp or sfdp

Although there are a couple caveats when it comes to clustering. The clustering itself runs at n(n-1)/2 so after a couple thousand samples the system will start the bottleneck. Another thing to note is that this will not be able to handle packed malware. If someone were able to programmatically unpack all malware samples then this approach would be highly effective.

The Code

All the code for this post will be on my GitHub (located here). Although I encourage you to follow along rather than just taking the code.

Define your commandline arguments:

parser = argparse.ArgumentParser(
		description = "Identify similaritiess between malware samples and build similarity graph"

	help="Directory containing malware"

	help="Where to save the output graph DOT file"
	"--jaccard_index_threshold", "-j", dest="threshold", type=float,
	default=0.8, help="Threshold above which to create an 'edge' between samples"

args = parser.parse_args()

Define some helper functions:

# Get all strings from a PE
def get_strings(fullpath):
	strings = os.popen("strings '{0}'".format(fullpath)).read()
	strings = set(strings.split("\n"))
	return strings

# Check for the MZ header
def check_pe(fullpath):
	return open(fullpath).read(2) == "MZ"

get_strings is how we get our features for each malware sample. check_pe is just a validation check to make sure that the file we are looking at is actually a PE.

Define your similarity function that will return a value between 0 and 1, where 1 is 100% similar and 0 is 0% similar.

# Get the similarity index between the 2 feature sets
def jaccard(set1, set2):
	intersection = set1.intersection(set2)
	interection_len = float(len(intersection))
	union = set1.union(set2)
	union_length = float(len(union))
	return interection_len / union_length

For a similarity function to be acceptable in this use case, it must:

  1. Return a normalized value (a value between 0 and 1)
  2. Be performant

For this case I decided to use Jaccard since it is the most comprehensive and accurate similarity function. Jaccard by itself is quite slow though. Jaccard works as follows, similarity index = shared items/total items. Which is how the function above works. Python comes with handy functions that can be used for set intersection and union which are utilized here.

Now define your main function.

def main():
	malware_paths = []
	malware_features = dict()
	graph = networkx.Graph()

	for root, dirs, paths in os.walk(args.malware_directory):
		for path in paths:
			full_path = os.path.join(root, path)

	malware_paths = filter(check_pe, malware_paths)

	for path in malware_paths:
		features = get_strings(path)
		print("extracted {0} features from {1}".format(len(features), path))
		malware_features[path] = features

		graph.add_node(path, label=os.path.split(path)[-1][:10])

	for malware1, malware2 in itertools.combinations(malware_paths, 2):
		jaccard_index = jaccard(malware_features[malware1], malware_features[malware2])

		if jaccard_index > args.threshold:
			print malware1, malware2, jaccard_index
			graph.add_edge(malware1, malware2, penwidth=1+(jaccard_index-args.threshold)*10)

	write_dot(graph, args.output_dot_file)

Once this is all put together you can invoke the script with the following, python simgraph.py --malware_directory malware_samples --output_dot_file clustered_samples.dot . Although this doesn't generate a picture off the bat. Instead it creates a .dot file that holds all the mappings between the samples.

When the .dot file is generated you can use a tool like fdp or sfdp to generate a visualization. fdp clustered_samples.dot -T png -o clustered_samples.png -Goverlap=false. This takes the input dot file and renders a .png without overlap. There are other options you can pass to graph generation tools to tweak the resulting visualizations. One thing to note is that the tool used to create the visualization have different pros and cons. Some are faster than others, and some map the clusters in a more digestible manner. For the resulting picture I linked above, that was developed with fdp command above.

Of course all of this is a very basic example of clustering that isn't very optimized. Things like Minhash and converting your feature set to bitvectors can greatly speed up the analysis but that is beyond the scope of this blog.


I hope you gained a better understanding of how basic malware clustering works in the field.. As the first post in the blog, this is mostly an exercise of learning the platform and its features. While this blog will occasionally cover topics such as malware clustering and large scale analysis, the majority of the posts will be regarding malware families and how to look at the family rather than just a single sample. To kick of the analyses, I will be releasing a post on Shamoon in the near future.

All feedback is welcomed, as this is as much a learning opportunity for myself as it is for the readers.