programming

Python for Malware Analysis – Getting Started

Introduction

Improving your Python programming skills is likely on your to-do list – just like cleaning your closet, painting that wall, or tightening that loose screw (you know which one I’m talking about).

Scripting, in general, is a useful skill to have across most security disciplines.  Writing a script can help you automate a menial task, scale your analysis to large volumes of data, and share your work.

Although there are multiple programming languages to choose from, Python is the most popular language of choice because, among other reasons, it is cross platform and relatively easy to read and write. Many existing open-source security tools are also written in Python, so learning this language helps you better understand existing capabilities.

This blog post introduces Python programming for Portable Executable (PE) file analysis. In this context, a script can enable you to quickly parse an individual file and extract key characteristics, or scale that activity across numerous files to help prioritize work.

Note that this post assumes the reader has had some basic exposure to Python and programming concepts.

Learning Python

With some basic programming skills, it’s possible to improve your knowledge of Python by simply reviewing existing code, making changes as needed. While simply tweaking code may yield the desired results in some cases, many will likely benefit from a more formal introduction to the language. A quick online search will reveal many freely available written and video Python tutorials. For a structured, interactive introduction, I recommend Code Academy. If you’re available for a more rigorous, immersive Python learning experience, consider the SANS SEC573 “Automating Information Security with Python” course (full disclosure, I’m a SANS Certified Instructor).

Existing Tools

There are many Python-based malware analysis tools you can use today. Below are just a few that I find helpful for static file analysis:

These tools produce useful output and serve as excellent starting points for understanding Python. By simply viewing the source code and performing research as necessary, you can learn from what the authors wrote and modify the code to serve your own purpose. However, as you build experience in technical analysis, you will likely encounter scenarios where existing tools do not meet your needs, and a  customized solution must be developed. Rest assured, these cases do not require you to write code from scratch. Instead, you can rely upon existing Python libraries to extract data and manipulate output in a way specific to your needs.

A popular, long-standing library for PE file analysis is aptly called pefile. This module provides easy access to the structure of a portable executable. Another fairly recent and more versatile cross-platform library is called Library to Instrument Executable Formats (LIEF), and it includes a Python module for PE file analysis (documented here).

This blog post will focus on using Python 2 and pefile for file analysis. Note that pefile is a third-party module, not one that is built-in with a standard Python install.  As a result, you may have to install it it first; try pip install pefile.

Exploring pefile

For our environment, we will use the REMnux malware analysis Linux distribution, which you can download here. We begin by launching the Python interactive shell to explore the pefile module and write some initial code. Rather than diving straight into creating a script, the interactive shell is a great way to learn about available modules and perform quick testing. Simply type python at the terminal and you’ll see a prompt similar to the following:

python
Next, import pefile to make use of its functionality:

import_pefile.jpg

Let’s explore this module by viewing its help information. Type help(pefile). Below is an excerpt of the output.

help_pefile.jpg

In addition to an overview of the module, we see a description of classes contained within the module. Scrolling down provides information about each class. For now, we will only focus on the PE class:

class_PE

The description tells us that this class will give us access to the structure of a PE file, which is precisely what we need for our Windows file analysis. The output also explains how to create an instance of the PE class. Let’s read in a file for testing. For this post, we’ll use an emotet sample.

pefile_pe

We can return to the help menu to read more about the methods and attributes of the PE class. Alternatively, we can view a summary of this information by typing dir(pefile.PE). An excerpt of this output is below.

help_pefilepe

There is a lot of text here, and much of it may not make depending on your prior exposure to PE file analysis. However, let’s look for some basic terms we may recognize. We see references to multiple methods beginning with “get_” that are helpful for collecting some basic static information about a file. For example, get_impash() returns an MD5 hash of the Import Address Table (IAT). Let’s give this a try using our file instance.

file_instance

The get_imphash() method worked as expected, providing the file’s import table hash.

Another”get_” function I find valuable is get_warnings(). When pefile parses a Windows executable, it may encounter errors along the way. The get_warnings() function returns a list of warnings generated as the PE file is processed. Security analysis is all about investigating anomalies, so this output can reveal useful starting points for further review. For example, this function’s output may indicate the file is obfuscated, even if the specific packer cannot be identified by common tools that look for packer signatures (e.g., ExeInfo or PEid). In this particular case, however, executing the function did not provide errors:

get_warnings.jpg

Let’s continue our journey with pefile and extract other static information often reviewed during initial malware analysis. For example, how can we use pefile to understand which DLLs and functions are imported by this executable? To answer this question, we will again use the built-in help() system with some old fashioned trial and error. This methodology can be used with any well documented Python module.

First, let’s review our options by learning more about the PE class. We can type help(pefile.PE) and scroll through the output. An excerpt of interest is below:

sections.jpg

We see references to many “DIRECTORY_ENTRY_” attributes, which point to the location of key file components. Since we’re interested in imports, we will focus on DIRECTORY_ENTRY_IMPORT, which is described as a list of ImportDescData instances. Let’s begin by iterating through this list to see what information it provides:

item.jpg

Just as the the help output specified, we see a list of ImportDescData objects. What do these objects represent? We will return to help again and type help(pefile.ImportDescData):

ImportDescData.jpg

As shown above, this structure contains the name of the DLL and a list of imported symbols. This sounds like the information we need. Let’s again iterate to confirm:

ImportData.jpg

We’re making progress, but we have a new structure to investigate. We type help(pefile.ImportData):

ImportData2

For now, we will just focus on imports by name, so the name attribute should have the information we need. Let’s incorporate this into our code and make the output a bit more readable.

imports.jpg

Success! This code provided us with the name of an imported DLL and its corresponding imported function names. We could make this output more elegant, but the information we need is here.

Scaling

As discussed in the Introduction, automating work with a script enables you to scale a task across a larger volume of data. The individual file analysis performed above has its place, but if your day-to-day job involves malware analysis, you may have hundreds or thousands of files to sift through before choosing one for closer review. In these scenarios, extracting key information from all files allows you to group and prioritize samples for more efficient analysis.

Let’s again consider a file’s imphash. Across a large number of samples, grouping by imphash makes it easier to identify similar functionality or a common packer/packaging tool used to generate the binary. To explore this idea, we will write a small script to extract the imphash from a directory of files. The code should accomplish the following tasks:

  1. Create a list of all files in the directory (full path).
  2. Open an XLSX file for writing (I often use Excel for easy viewing/sorting, but you can certainly output to CSV or, even better, write this information to a database).
  3. Calculate and write each file’s sha256 hash and imphash to the XLSX file.
  4. Autofilter the data.

Below is one way to approach these tasks.

#~/usr/bin/env python
import sys,os
import pefile
import hashlib
import xlsxwriter

if __name__ == "__main__":

	#Identify specified folder with suspect files
	dir_path = sys.argv[1]

	#Create a list of files with full path
	file_list = []
	for folder, subfolder, files in os.walk(dir_path):
		for f in files:
			full_path = os.path.join(folder, f)
			file_list.append(full_path)

	#Open XLSX file for writing
	file_name = "pefull_output.xlsx"
	workbook = xlsxwriter.Workbook(file_name)
	bold = workbook.add_format({'bold':True})
	worksheet = workbook.add_worksheet()

	#Write column headings
	row = 0
	worksheet.write('A1', 'SHA256', bold)
	worksheet.write('B1', 'Imphash', bold)
	row += 1

	#Iterate through file_list to calculate imphash and sha256 file hash
	for item in file_list:

		#Get sha256
		fh = open(item, "rb")
		data = fh.read()
		fh.close()
		sha256 = hashlib.sha256(data).hexdigest()

		#Get import table hash
		pe = pefile.PE(item)
		ihash = pe.get_imphash()			 

		#Write hashes to doc
		worksheet.write(row, 0, sha256)
		worksheet.write(row, 1, ihash)
		row += 1

	#Autofilter the xlsx file for easy viewing/sorting
	worksheet.autofilter(0, 0, row, 2)
	workbook.close()

I titled the above script pe_stats.py and ran it against a directory named “suspect_files” with the command python pe_stats.py suspect_files. To populate the target directory, I downloaded 100 highly convicted files from VT (specifically, I used the basic VTI query “type:peexe positives:50+”)An excerpt of the resulting data, when opened in Microsoft Excel, is below.

xlsx

A quick glance at the first few rows immediately reveals a pattern in the imphash values. As a next step, perhaps you will investigate the largest cluster of import table hashes to understand why these groups of files have the same imphash. You may also revisit the pefile library documentation to explore additional static characteristics worth including in this spreadsheet. With more detail, this document could help you triage and prioritize samples for analysis . I leave these tasks to you for further exploration.

Conclusion

This post provided an initial approach to analyzing PE files using Python. Most importantly, it walked through how to use the built-in Python help feature and some basic knowledge of PE files to systematically explore a file’s characteristics and then scale that process to a larger set of files.

If you would like to learn more about malware analysis strategies, join me at an upcoming SANS FOR610 course.