Author: asoni

SANS FOR610 Reverse-Engineering Malware – Now, with Ghidra

I’m excited to announce that the SANS FOR610 Reverse-Engineering Malware course I co-author with Lenny Zeltser now uses Ghidra for static code analysis. Ghidra is a free and open-source software (FOSS) reverse engineering platform developed by the National Security Agency (NSA). It has an active community of users and contributors, and we are optimistic about the future of this analysis tool. I found it an invaluable addition to my toolkit, as have many other malware analysts.

Ghidra includes a full-featured, visual disassembler. Moreover, it comes with a built-in decompiler, which provides a C representation of the disassembly. Decompiled output complements disassembly nicely, and this additional perspective can accelerate the malware analysis process. For example, let’s compare some disassembly (Figure 1) with the decompiled code (Figure 2):

Picture1

Figure 1: Disassembly Example

Picture2

Figure 2: Decompiled Code

Some aspects of the analysis benefit from the low-level insights that the disassembler providers. Other tasks are faster when looking at the decompiler’s output, which is easier to review and assess. When reverse-engineering malware, I found it helpful to switch between Ghidra’s disassembler and decompiler output.

Ghidra also supports scripts and plugins for extensibility, providing ample opportunity for analysts to automate their work as their reverse engineering skills grow with experience. In addition, Ghidra has multiple collaborative work features to support teamwork for complex analysis tasks. The built-in help menu is an excellent resource to learn more about these features and many more.

If you’re wondering how you might incorporate Ghidra into your toolkit, take a look at the walkthrough I published earlier as an Introduction to Code Analysis With Ghidra. For additional insights, view the 20-minute video I recorded to explain a typical analysis workflow with Ghidra:

I hope you’ll join me and other FOR610 instructors at an upcoming course to explore this impressive analysis framework and strengthen your reverse engineering skills.

-Anuj Soni


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.

Intro to Cutter for Malware Analysis

Introduction

My last blog post described an intro to radare2 for malware analysis, so it is only fair that we also cover its GUI variant, Cutter.  This post will closely mirror the previous article to discuss Cutter and its usage. If you like the radare2 framework but find the command-line interface intimidating, Cutter may strike the right balance for you. Alternatively, if you are simply looking for a free graphical disassembler to perform malware analysis, Cutter is worth considering.

Installing Cutter

You can download the latest release of Cutter here. Then, simply extract it to a directory of your choosing.

I use a 64-bit Windows 10 virtual machine for my analysis, so I downloaded and ran the appropriate binary.  Specifically, I’m using the Windows VM we distribute in the SANS FOR610 Reverse Engineering Malware course, so you will see references to the “REM” user in screenshots.

Analyzing a File with Cutter

Loading a binary

The first time you launch Cutter, this dialog box appears:InitialScreen

Beyond  the typical version and About information, this window provides the opportunity to change the GUI’s theme. Clicking on the “Native Theme” dropdown allows the analyst to choose an alternative “Dark Theme.” Since dark modes/themes are all the rage these days, the latter theme is used in upcoming screenshots.

Next, we can specify the file to load into Cutter:

OpenFile

For this post, we will use the same Gandcab ransomware sample referenced in the last post. The sample is available here (password: malware). You can click Select to browse to the file or simply drag and drop it to this window. Then, click Open and review the Load Options:

LoadOptions

Notice that the “Analysis” checkbox is checked by default, indicating the binary will be preprocessed – this is in stark contract to radare2, which requires the user to deliberately kick off any analysis. The sliding bar in the middle of the load options can be dragged left or right for less or more rigorous analysis, respectively. We’ll leave the default “aaa”, which generally performs a sufficient level of auto analysis. We will also leave other defaults untouched and press OK.

Once processing is complete, we see the initial window layout with a functions window and a disassembly window:

Loaded

As expected, Cutter brings us to the program’s entry point, 0x4044bb.

Static File Information

Before digging into code analysis, note that the Dashboard tab provides some high level information:

dashboard

While other static file analysis tools provide similar data, it is convenient to have this output readily available within Cutter.

Viewing imports

To view imported functionality, choose the “Imports” tab on the bottom:

Imports_Window

There are many APIs we could explore, but as in the last post, we will focus on CreateToolHelp32Snapshot (not shown above). This API is used to capture a snapshot of running processes on a system. Malware often uses this snapshot to enumerate running processes and identify specific process names. To find this imported function in the import list, we can use the “Quick Filter” on the bottom and begin typing the preferred API:

CreateTool_Search

Finding an API reference

Next, we will locate references to this API. First, double-click on the import above, which will take us to the entry in the Import Address Table (IAT). Next, right-click on the function name and choose “Show X-Refs” or simply hit “x” on the keyboard to view references:

CreateTool_xref.jpg

The x-refs window shows two CALL instructions, which represent instructions that call CreateToolhelp32Snapshot:

CreateTool_Ref1.jpg

Notice Cutter also conveniently includes a preview of of the reference code with a single click on each row. We could explore both references, but for this post we will only jump to the second reference by double-clicking on it. This takes us to 0x004041e9, where we see a CALL to CreateToolhelp32Snapshot:

CreateTool_Code

Understanding the code

At this point, we can browse the disassembly to better understand the code. In the last post, we used radare2 commands to print summary information about the current function. One benefit of a GUI is avoiding the command-line interface, but if you’re open to launching commands on occasion, Cutter conveniently allows this by activating the console:

FOR710

A text console will appear at the bottom, giving us the opportunity to enter radare2 commands. For example, we could type pdf~call to print all CALL instructions referenced in the disassembled function:

pdfs

This output provides a nice summary of Windows API activity. Notice the APIs highlighted in red, which include CreateToolhelp32Snapshot, Process32First, lstrcmpiW, TerminateProcess, and Process32Next. As mentioned in the previous post on radare2, this progression of CALLs is often used to capture a snapshot of running processes, begin iterating through the list, compare process names to one or more predefined names and then terminate the process if a match is made. You may have noticed a list of process names the code is likely checking located above the CALL to CreateToolhelp32Snapshot:

process_names

Ransomware commonly checks for and terminates processes that access document files to maximize the number of files it can encrypt.

Another perspective on this code is provided via Cutter’s graph view. When viewing the code in the Dissassembly view, hit the space bar to access this alternative view. Below is an excerpt of decision points from the current function:

decision_points

Focusing on the lstrcmpiW, OpenProcess and TerminateProcess CALLs in graph view provides additional insight into what happens if the program matches a process name against its predefined list. Specifically, if a string match occurs, the program will access the target process via OpenProcess and then terminate it. If a match is unsuccessful, execution will jump over the code that calls OpenProcess and TerminateProcess since those APIs are not needed.

Closing Thoughts

This article mirrored the previous post on radare2 to provide an alternative (i.e., graphical) interface for using the radare2 framework. It’s tempting to stick with one tool and sometimes uncomfortable to try others. However, even brief exposure to alternatives could be eye-opening. In the best case, you absorb a new tool or approach into your RE arsenal. In the worst, you find even more reason to love your current tool of choice.

IDA has been the gold-standard disassembler for malware analysis, but its competitors are maturing rapidly. Analysts have more options, and this means tool developers and contributors have excellent incentives to create the best tool at an affordable price. It is an exciting time to learn and perform malware analysis.

For more information on Cutter, I encourage you to explore these resources:

-Anuj Soni / @asoni


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.

Intro to Radare2 for Malware Analysis

Introduction

In recent years, a variety of inexpensive or free disassemblers and debuggers have gained serious momentum, including radare2 (a.k.a. “r2”), Cutter (GUI for radare2), Binary NinjaHopper, and x64dbg. If you have a license for IDA Pro and are happy with the experience, you may have little reason to explore other options. However, if you are still in the early stages of your career in malware analysis, or you are working with a small budget, you may not have access to this relatively pricey product. Regardless of your background, disassembler preference, or budgetary restrictions, each tool listed above provides a different reverse engineering experience, and each is worth trying once. At the very least, a test drive can clarify your preferences and bring an appreciation for the tool(s) you choose to use.

This post focuses on an initial workflow for performing static code analysis using radare2. Specifically, it will cover how to load a PE file into radare2, identify an imported API of interest, find a reference to the API, view assembly at that location, and begin to assess the code’s purpose.

Radare2 is a project that contains multiple tools for binary analysis, and radare2 (yep, same name) is the primary tool that performs disassembling and debugging. It is command line driven, which may be daunting after extensive use of other disassemblers that provide a GUI. In fact, the official R2 book depicts the level of effort required to learn radare2 like this:

learning_curve

Source: https://radare.gitbooks.io/radare2book/content/first_steps/intro.html

Let there be no doubt – learning how to use radare2 is complicated. However, the best way to tackle a difficult task is to get started.

Installing Radare2

Radare2 binaries and source for a variety of operating systems are available here. I used a 64-bit Windows VM environment for my analysis, so I downloaded and ran the appropriate binary.  Specifically, I’m using the Windows VM we distribute in the SANS FOR610 Reverse Engineering Malware course, so you will see references to the “REM” user.

Analyzing a File with Radare2

Loading a binary

For this post, we will use a Gandcab ransomware sample. If you want to follow along, you can download the sample here (password: malware).

To load the file into radare2, simply type radare2 , as shown below.

load

We now have a radare2 shell waiting for additional commands. Notice the shell indicates we are at the address 0x004044bb, which is the entry point for this executable (more on that in a moment).

To navigate an executable within radare2, you will use text-based commands to initiate processing and query information. Along the way, using the question mark (“?”) will provide help about command options. For example, type a question mark ? and hit return. Below is an excerpt of the initial output.

question_mark.jpg

If you scroll down the output on your screen, you will find a reference to the command:

i_option

The i command provides information about a file. For more detail on the type of information we can query, type i?:

i?.jpg

For example, typing ie will provide information about the executable’s entry point. Below is an excerpt of this command’s output.

ie

Notice the virtual address (“vaddr”) matches the address of our location within the radare2 shell, confirming that we currently reside at the executable’s entry point.

Initiating code analysis

To begin our code analysis with radare2, we must first kick off some automated analysis. Depending upon your prior exposure to radare2, you may be surprised to know that, by default, radare2 does not perform any analysis at startup. Other disassemblers and debuggers like IDA Pro and x64dbg will automatically analyze the binary to identify functions, code and data. The author of radare2 (pancake), however, takes a different approach. He details his case here, but the basic point is that radare2 aims to run on various platforms with varying levels of computing power, and it is capable of analyzing many different binary architectures. As a result, no analysis is typical, so it’s up to the analyst to determine what types of processing are relevant. While some may reject this approach, it forces the analyst to be more deliberate in their work. In fact, the entire radare2 experience reinforces this by requiring explicit commands to view information and navigate the code.

While we won’t discuss the details of all possible commands (see this resource), you can see options by typing aa? and aaa?. If you want to take a leap of faith and perform a variety of analyses against a file, I suggest using the aaa command. After executing this command, you will see a variety of output messages as radare2 analyzes the binary (excerpt below):

aaa

Viewing imports

Now that the initial auto analysis is complete, it’s time for us to manually navigate the code. One approach is to first find Windows API references that support malicious functionality. To view functions imported by the suspect binary, we can type ii:

ii.jpg

There are many APIs we could explore, but for this post, we will focus on CreateToolHelp32Snapshot (not shown above). This API is used to capture a snapshot of running processes on a system. Malware often uses this functionality to enumerate running processes and identify specific process names. To find this imported function in the import list, type ii~CreateToolHelp32Snapshot (the tilde searches the output of ii for the specified text):

ii_CreateTool

Finding an API reference

Next, we want to locate references to this API. To query this information, we will type the letters (analysis), (cross references), and t (find references to the specified address), followed by the address of the imported function:

axt

In the output above, we see references to two CALL instructions, which represent instructions that call CreateToolhelp32Snapshot. We could explore both references, but for this post we will only jump to the first reference address using the (seek) command:

s_

Notice the address in the prompt changed to our destination address, a signal that we have arrived at the desired location. To confirm this, we can use the pd (print disassembly) command and its subcommands:

pd?

First, let’s make sure we reside at a CALL to CreateToolhelp32Snapshot using the command pd N, where N indicates number of disassembly lines to print (pardon the small text size to maintain formatting):

pd1

Notice the autogenerated comments in red are unhelpful in this case, but they are included for completeness.

Understanding the code

To view a summary of the function where we currently reside, we can type pds (print disassembly summary):

pds

As indicated by the help information, pds focuses on strings, calls, jumps, and references to provide an overview of the function. Looking at the above output, notice the APIs CreateToolhelp32Snapshot, Process32First, lstrcmpiA, and Process32Next. This progression of CALLs is often used to capture a snapshot of running processes, begin iterating through the list, compare process names to one or more predefined names, and continue through the list, respectively.

To understand precisely how these calls are used and what decision points are encountered, we need more information about the function. We could print the entire function body with the command pdf (print disassembled function), but this prints a rather large amount of output that you have to scroll through in the terminal.

One approach to evaluating the context of the CreateToolhelp32Snapshot CALL is to view the instructions that occur right before it. To view the 10 instructions before the CALL, we can type pd  -10 (only the command and disassembly are shown due to space constraints):

pd-10

I mentioned earlier that when malware uses CreateToolhelp32Snapshot to capture and evaluate running processes, it often compares the process list snapshot to predefined values. It is likely this is that predefined group of process name strings. Considering this sample is ransomware, it makes sense that the malicious code would want to check for these process names, as they may have a lock on files worth encrypting.

Another approach to understanding the context of this code is to use radare2’s visual mode, which allows you to browse the assembly similar to a GUI-based disassembler. Type to enter visual mode, and use the HJKL keys (similar to vi/vim) to navigate the code:

V

Since this approach allows for easy exploration of the code, it is my preferred method for code analysis. Typing a question mark will provide a help menu specific to this interface, although that output is not shown here.

Another perspective on this code and its flow of execution is achieved by entering graph mode with the command VV. Once in graph mode, the HJKL keys will allow you to browse the decision points that occur throughout the function. Below, focusing on the lstrcmpiW, OpenProcess and TerminateProcess CALLS in this more visual interface provides insight into what happens if the program matches a process name against its predefined list.

VV

Specifically, if a string match is found, the program will access the target process via OpenProcess and then terminate it.

Closing Thoughts

This post introduced radare2 and explored a basic workflow to load a binary and begin analysis. There is certainly much more to learn about radare2, but I hope this jumpstarts your journey.

For more information on radare2, I encourage you to explore these resources:

If you would like to learn more about malware analysis strategies, join me at an upcoming SANS FOR610 course.

-Anuj Soni / @asoni

Exploring the PE File Format via Imports

Introduction

Just as a surgeon should understand the human body and its parts to excel in surgery, a malware reverse engineer should understand the structure and components of a binary to be proficient in malware analysis. Within the Windows operating system, we are referring to the Portable Executable (PE) format.

This article will not discuss every excruciating detail about a Windows executable. If you’re looking to scratch that itch, read through Microsoft’s PE Format and Peering inside the PE articles or start reading about structures defined in winnt.h. Don’t get me wrong – these are excellent reference articles, and I will link to them throughout this post, but tackling each resource in its entirety can be overwhelming.

For this discussion, we will navigate a PE file, focusing primarily on fields associated with the binary’s imported DLLs and functions. My hope is that by concentrating on just this one aspect, you will (1) learn an approach to maneuver the PE structure and (2) apply this approach to better understand terminology related to an executable’s imports.

For this discussion, I will use the freely available CFF Explorer tool that is part of NTCore Explorer Suite. Also, my target file is – brace yourself – notepad.exe. Why use a legitimate file for this exercise? First, to understand the structure of a PE file, you don’t need malware. Second, a deeper understanding of legitimate files allows you to more easily discover anomalies when you analyze suspect files. For more on this topic, read my earlier post on analyzing files, not malware.

Let’s take a walk

We begin our travels through the PE file format, well, at the beginning. After loading notepad.exe into CFF Explorer, you will see headers on the left side that comprise the first bytes of a typical Windows executable. These headers describe the rest of the file, including the executable content, resources, and imports.

Beginning
Figure 1: MS-DOS Stub

Let’s start with the MS-DOS header (also called the MS-DOS Stub), which displays “This program cannot be run in DOS mode” when the executable is run in MS-DOS. At the beginning of this header (see top-right of Figure 1) is the e_magic field, and it contains the well-known “MZ” characters represented by the hexadecimal value 0x4D5A (shown as 0x5A4D above because the value is interpreted as little-endian). Most fields in this header are not relevant to newer operating systems, but the final field e_lfanew (see below) is significant because it points to the PE header, shown in CFF Explorer as Nt Headers.

Figure 2: Pointer (address) to PE header

Clicking on Nt Headers (below) takes us to file offset 0xF0, which matches the value of e_lfanew above. The value translates to the string “PE”, which typically appears at the beginning of the PE header.

PEHeader.jpgFigure 3: PE header

Next on our path is the COFF File Header, displayed simply as File Header in CFF Explorer. This header includes information such as the target machine type (e.g., x64), the compile timestamp and file characteristics (e.g., is the executable a DLL or EXE?).

FileHeader
Figure 4: File header

Then, we have the Optional Header. By the way, this header is “optional” for files like object files, which are not directly executable. For image files like notepad.exe, which are directly executable, this header is required. It contains a wealth of information that supports loading the executable into memory. One field worth mentioning is the ImageBase (below) which specifies the preferred address where the executable should be mapped in memory.  If ALSR is enabled, this address is randomized.

ImageBase.jpgFigure 5: Optional header

At the end of the Optional Header are Data Directories, which point to tables that contain supporting information, including imported and exported functions. As a reminder, we are focused on import-related information for this discussion. Among the listed directories, there are only two groups that refer to imports and have non-zero values, highlighted in red:

DirectoriesFigure 6: Data directories

Both the Import Directory and Import Address Table Directory have RVA and size values. The size is straightforward in that it indicates the size, in bytes, of the table. The Relative Virtual Address (RVA) refers to the location of the specified table. RVA is a virtual address because this is an address after the executable is loaded into memory (i.e., after it is “memory-mapped”). It is relative to the ImageBase, so adding the RVA to the Imagebase provides the Virtual Address (VA) in memory of the specified table.

The final headers we see on the left in CFF Explorer are the Section Headers:

section_headersFigure 7: Section headers

The contents of a Windows executable after the headers are organized into sections. The table above provides important information on the name, location (both on disk and in memory) and characteristics of each section. Key sections include “.text” for executable code, “.rdata” for read-only data, and “.rsrc” for resources like icons.

You may have noticed in Figure 6 that both highlighted RVA rows have “.rdata” in the Sections column, indicating both tables reside in that section. How was this determined? First, see .rdata’s Virtual Address value in Figure 7, which is 0x1A000. I should clarify that this column lists RVAs, not VAs as the column heading suggests. Next, note .rdata’s Virtual Size of 0x73A8. Performing simple math shows that the .rdata section will extend from RVA 0x1A000 to 0x213A7 (inclusive). Looking back at Figure 6, RVAs for both the Import Directory and Import Address Table Directory (0x1F300 and 0x1A620, respectively) fall within this range.

Following the import trail

The Import Directory RVA is 0x0001F300 and notepad.exe’s ImageBase is 0x140000000, so the VA is 0x14001F300. What’s located at that address? Looking at this offset within the file on disk will not be helpful since, as mentioned earlier, the VA is an address in memory. As a result, we must use a tool that will load our executable similar to how the Windows loader would in preparation for execution. One approach is to use a dissassembler like IDA Pro, which will load the executable into memory in the same manner as the Windows loader during file execution. For this example, I will use IDA Freeware version 7.0 for Windows.

When loading notepad.exe into IDA, you will see the window below with load options. I recommend unchecking “Create imports segment,” at least for now. Leaving this checked means IDA will create an “.idata” section for imports, and for this discussion I prefer to more closely represent the raw binary by not creating additional sections.

LoadNewFileFigure 8: IDA load file options

After clicking “OK,” you will also see a prompt asking if you want to take advantage of debug information. Choose “No” for now.

Next, let’s jump to the VA 0x14001F300 by typing ‘g’ and inserting the address:

jumptoaddress.jpgFigure 9: Jumping to the Import Directory VA in IDA.

Note that jumping to the above address assumes the loader will respect the address in the ImageBase field. IDA Pro takes this approach, but the Windows loader and other dissassemblers like x64dbg will randomize the ImageBase unless ASLR is disabled  for this executable (for more information on this point, see Lenny Zeltser’s article here).

Jumping to 0x14001F300 brings us here:

ImportDescriptorFigure 10: Beginning of the Import Directory Table

Since we calculated this address using the Import Directory RVA, it should be no surprise that this is the beginning of the Import Directory Table,  which contains all the references we need to understand the program’s imports. The above excerpt shows two entries, one per imported DLL. Each entry consists of the following elements:

  1. Import Name Table (as shown in IDA Pro) or Import Lookup Table (as described in Microsoft documentation) RVA: This points to a list of function names imported from the specified DLL. Using the first entry as an example, double clicking on off_14001F558 takes us to the location below:

ImportnametaboeFigure 11: Import Name Table

At 0x14001F558 we find a list of addresses that appear in close proximity with one another (for more detail on the format of values in the Import Name Table, see here). Let’s double-click on the first address, word_14001FED0. The destination is below:

Firtsone.jpgFigure 12: Hint/Name Table

This is the beginning of the Hint/Name Table. We see references to functions including OpenProcessToken, GetTokenInformation, and DuplicateEncryption – all functions imported from advapi32.dll. This makes sense since we arrived here after double-clicking the first entry in the advapi32.dll Import Name Table.

One Hint/Name Table covers all imported functions for the file. Each entry in the table has 3 components:

  • Hint: This is an index into the imported DLL, and it is used to help locate the required function. In the first Hint/Name table entry above, the value is 0x214
  • Name: The name of the imported function, null terminated. This is used to find the imported function within a DLL when using the Hint does not suffice. In the first entry above, this value is OpenProcessToken.
  • Padding: IDA Pro’s “align” directive refers to 0-byte padding.

2. Time Stamp: This value will generally be zero, unless the DLL is binded. DLL binding is out of scope for this post, but see this article to learn more.

3. Forwarder Chain: A DLL may reference another DLL’s functionality, but similar to the Time Stamp field above, this value is generally zero. Again, the details of this field are out of scope for this article, but you can search for “ForwarderChain” in this article for more information.

4. DLL Name RVA: A pointer (address) to the name of the imported DLL. In the case of advapi32.dll, the DLL Name RVA points to the string “ADVAPI32.DLL.”

5. Import Address Table (IAT) RVA: First, understand that the Import Address Table is populated by the loader when the executable and its imported DLLs are mapped into memory, and it is a table of pointers to the imported functions. Each entry in the table is called a “thunk” and the table is referred to as a “thunk table.” With that in mind, the RVA in this field points to the address of the imported function within the IAT. For example, double-clicking on OpenProcessToken at 0x14001F310 in Figure 10 takes us to the location below.

REMWorkstationVM.jpgFigure 13: Import Address Table

The reference to OpenProcessToken at 0x14001A620 represents the address in memory where the function code resides. In other words, 0x14001A620 is referenced when OpenProcessToken is called within notepad.exe. To emphasize this point, highlight OpenProcessToken and hit “x” on the keyboard. The xrefs window (below) shows a CALL to the OpenProcessToken API.

referenceFigure 14: OpenProcessToken references

Also note that the first address 0x14001A620 in Figure 13 matches the Import Address Table Directory RVA specified in Figure 6, if you add the ImageBase. This makes sense, because Figure 13 shows the start of the Import Address Table Directory.

Closing Thoughts

This article introduced the PE header and used it as a starting point to explore a file’s imports. To recap, we:

  1. Began with the MS-DOS Header
  2. Identified the PE Header
  3. Observed the ImageBase in the Optional Header
  4. Viewed the various Data Directories
  5. Jumped to the Import Directory Table VA using IDA
  6. Reviewed the components of an Import Directory Table entry, including the Import Lookup Table
  7. Found its reference to the Hint/Name Table
  8. Ended at the Import Address Table, which points to the imported functions in memory

If you want to learn more about the PE header and the structures it includes, there are many excellent resources to explore. Below are some of my favorites:

-Anuj Soni


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.

Python for Malware Analysis – Getting Started

Introduction

Improving your Python programming skills is likely on your to-do list – just like cleaning your closet, painting that wall, or tightening that loose screw (you know which one I’m talking about).

Scripting, in general, is a useful skill to have across most security disciplines.  Writing a script can help you automate a menial task, scale your analysis to large volumes of data, and share your work.

Although there are multiple programming languages to choose from, Python is the most popular language of choice because, among other reasons, it is cross platform and relatively easy to read and write. Many existing open-source security tools are also written in Python, so learning this language helps you better understand existing capabilities.

This blog post introduces Python programming for Portable Executable (PE) file analysis. In this context, a script can enable you to quickly parse an individual file and extract key characteristics, or scale that activity across numerous files to help prioritize work.

Note that this post assumes the reader has had some basic exposure to Python and programming concepts.

Learning Python

With some basic programming skills, it’s possible to improve your knowledge of Python by simply reviewing existing code, making changes as needed. While simply tweaking code may yield the desired results in some cases, many will likely benefit from a more formal introduction to the language. A quick online search will reveal many freely available written and video Python tutorials. For a structured, interactive introduction, I recommend Code Academy. If you’re available for a more rigorous, immersive Python learning experience, consider the SANS SEC573 “Automating Information Security with Python” course (full disclosure, I’m a SANS Certified Instructor).

Existing Tools

There are many Python-based malware analysis tools you can use today. Below are just a few that I find helpful for static file analysis:

These tools produce useful output and serve as excellent starting points for understanding Python. By simply viewing the source code and performing research as necessary, you can learn from what the authors wrote and modify the code to serve your own purpose. However, as you build experience in technical analysis, you will likely encounter scenarios where existing tools do not meet your needs, and a  customized solution must be developed. Rest assured, these cases do not require you to write code from scratch. Instead, you can rely upon existing Python libraries to extract data and manipulate output in a way specific to your needs.

A popular, long-standing library for PE file analysis is aptly called pefile. This module provides easy access to the structure of a portable executable. Another fairly recent and more versatile cross-platform library is called Library to Instrument Executable Formats (LIEF), and it includes a Python module for PE file analysis (documented here).

This blog post will focus on using Python 2 and pefile for file analysis. Note that pefile is a third-party module, not one that is built-in with a standard Python install.  As a result, you may have to install it it first; try pip install pefile.

Exploring pefile

For our environment, we will use the REMnux malware analysis Linux distribution, which you can download here. We begin by launching the Python interactive shell to explore the pefile module and write some initial code. Rather than diving straight into creating a script, the interactive shell is a great way to learn about available modules and perform quick testing. Simply type python at the terminal and you’ll see a prompt similar to the following:

python
Next, import pefile to make use of its functionality:

import_pefile.jpg

Let’s explore this module by viewing its help information. Type help(pefile). Below is an excerpt of the output.

help_pefile.jpg

In addition to an overview of the module, we see a description of classes contained within the module. Scrolling down provides information about each class. For now, we will only focus on the PE class:

class_PE

The description tells us that this class will give us access to the structure of a PE file, which is precisely what we need for our Windows file analysis. The output also explains how to create an instance of the PE class. Let’s read in a file for testing. For this post, we’ll use an emotet sample.

pefile_pe

We can return to the help menu to read more about the methods and attributes of the PE class. Alternatively, we can view a summary of this information by typing dir(pefile.PE). An excerpt of this output is below.

help_pefilepe

There is a lot of text here, and much of it may not make depending on your prior exposure to PE file analysis. However, let’s look for some basic terms we may recognize. We see references to multiple methods beginning with “get_” that are helpful for collecting some basic static information about a file. For example, get_impash() returns an MD5 hash of the Import Address Table (IAT). Let’s give this a try using our file instance.

file_instance

The get_imphash() method worked as expected, providing the file’s import table hash.

Another”get_” function I find valuable is get_warnings(). When pefile parses a Windows executable, it may encounter errors along the way. The get_warnings() function returns a list of warnings generated as the PE file is processed. Security analysis is all about investigating anomalies, so this output can reveal useful starting points for further review. For example, this function’s output may indicate the file is obfuscated, even if the specific packer cannot be identified by common tools that look for packer signatures (e.g., ExeInfo or PEid). In this particular case, however, executing the function did not provide errors:

get_warnings.jpg

Let’s continue our journey with pefile and extract other static information often reviewed during initial malware analysis. For example, how can we use pefile to understand which DLLs and functions are imported by this executable? To answer this question, we will again use the built-in help() system with some old fashioned trial and error. This methodology can be used with any well documented Python module.

First, let’s review our options by learning more about the PE class. We can type help(pefile.PE) and scroll through the output. An excerpt of interest is below:

sections.jpg

We see references to many “DIRECTORY_ENTRY_” attributes, which point to the location of key file components. Since we’re interested in imports, we will focus on DIRECTORY_ENTRY_IMPORT, which is described as a list of ImportDescData instances. Let’s begin by iterating through this list to see what information it provides:

item.jpg

Just as the the help output specified, we see a list of ImportDescData objects. What do these objects represent? We will return to help again and type help(pefile.ImportDescData):

ImportDescData.jpg

As shown above, this structure contains the name of the DLL and a list of imported symbols. This sounds like the information we need. Let’s again iterate to confirm:

ImportData.jpg

We’re making progress, but we have a new structure to investigate. We type help(pefile.ImportData):

ImportData2

For now, we will just focus on imports by name, so the name attribute should have the information we need. Let’s incorporate this into our code and make the output a bit more readable.

imports.jpg

Success! This code provided us with the name of an imported DLL and its corresponding imported function names. We could make this output more elegant, but the information we need is here.

Scaling

As discussed in the Introduction, automating work with a script enables you to scale a task across a larger volume of data. The individual file analysis performed above has its place, but if your day-to-day job involves malware analysis, you may have hundreds or thousands of files to sift through before choosing one for closer review. In these scenarios, extracting key information from all files allows you to group and prioritize samples for more efficient analysis.

Let’s again consider a file’s imphash. Across a large number of samples, grouping by imphash makes it easier to identify similar functionality or a common packer/packaging tool used to generate the binary. To explore this idea, we will write a small script to extract the imphash from a directory of files. The code should accomplish the following tasks:

  1. Create a list of all files in the directory (full path).
  2. Open an XLSX file for writing (I often use Excel for easy viewing/sorting, but you can certainly output to CSV or, even better, write this information to a database).
  3. Calculate and write each file’s sha256 hash and imphash to the XLSX file.
  4. Autofilter the data.

Below is one way to approach these tasks.

#~/usr/bin/env python
import sys,os
import pefile
import hashlib
import xlsxwriter

if __name__ == "__main__":

	#Identify specified folder with suspect files
	dir_path = sys.argv[1]

	#Create a list of files with full path
	file_list = []
	for folder, subfolder, files in os.walk(dir_path):
		for f in files:
			full_path = os.path.join(folder, f)
			file_list.append(full_path)

	#Open XLSX file for writing
	file_name = "pefull_output.xlsx"
	workbook = xlsxwriter.Workbook(file_name)
	bold = workbook.add_format({'bold':True})
	worksheet = workbook.add_worksheet()

	#Write column headings
	row = 0
	worksheet.write('A1', 'SHA256', bold)
	worksheet.write('B1', 'Imphash', bold)
	row += 1

	#Iterate through file_list to calculate imphash and sha256 file hash
	for item in file_list:

		#Get sha256
		fh = open(item, "rb")
		data = fh.read()
		fh.close()
		sha256 = hashlib.sha256(data).hexdigest()

		#Get import table hash
		pe = pefile.PE(item)
		ihash = pe.get_imphash()			 

		#Write hashes to doc
		worksheet.write(row, 0, sha256)
		worksheet.write(row, 1, ihash)
		row += 1

	#Autofilter the xlsx file for easy viewing/sorting
	worksheet.autofilter(0, 0, row, 2)
	workbook.close()

I titled the above script pe_stats.py and ran it against a directory named “suspect_files” with the command python pe_stats.py suspect_files. To populate the target directory, I downloaded 100 highly convicted files from VT (specifically, I used the basic VTI query “type:peexe positives:50+”)An excerpt of the resulting data, when opened in Microsoft Excel, is below.

xlsx

A quick glance at the first few rows immediately reveals a pattern in the imphash values. As a next step, perhaps you will investigate the largest cluster of import table hashes to understand why these groups of files have the same imphash. You may also revisit the pefile library documentation to explore additional static characteristics worth including in this spreadsheet. With more detail, this document could help you triage and prioritize samples for analysis . I leave these tasks to you for further exploration.

Conclusion

This post provided an initial approach to analyzing PE files using Python. Most importantly, it walked through how to use the built-in Python help feature and some basic knowledge of PE files to systematically explore a file’s characteristics and then scale that process to a larger set of files.

If you would like to learn more about malware analysis strategies, join me at an upcoming SANS FOR610 course.

Unpacking and Correlating Qakbot

I recently wrote a post for Cylance (my employer), discussing the polymorphic features of Qakbot. You can read it here.


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.

Analyze files, not malware

Let’s dive right in. In the last post, I mentioned the value of reviewing the import address table (IAT) when performing static file analysis. Take a look at some IAT excerpts from three files:

Screen Shot 2016-03-03 at 9.19.44 PM

Screen Shot 2016-03-03 at 9.20.20 PM

Several functions may pique your interest, including:

  • IsDebuggerPresent and GetTickCount: functions that may be used to detect debugging activity.
  • RegCreateKeyW, RegSetValueExW: functions used to manipulate the registry, perhaps to configure persistence.
  • LoadLibraryW and GetProcAddress: functions used to call other functions at runtime, a strategy that hinders static file analysis.
  • FindResourceW and LoadResource: functions used to access embedded resources, where additional code may reside.

Let’s look behind the curtain:

  • File A = notepad.exe
  • File B = searchindexer.exe
  • File C = spoolsv.exe

These are all legitimate files found on a clean Windows 7 64-bit system.

This is not meant to be a trick, but instead a reminder. The rush of successfully identifying malware is one we all yearn for, but that glorious destination must be earned through careful analysis. This might be as simple as matching a suspect file’s hash with a known bad file hash, or it might require more robust static, behavioral and code analysis. All observations are not created equal, so we must weigh the severity of each one (i.e., how definitively it indicates malicious behavior) and consider their cumulative value when deciding if a file is malware. Files are innocent until proven guilty, and I challenge you to demonstrate, beyond a reasonable doubt, that a particular file is bad.

So how can you sharpen your ability to spot unusual characteristics that may indicate nefarious activity? As in all areas of incident identification and response, we need to understand the normal to discover the anomalous. Pick your favorite legitimate Windows programs and apply your file analysis process. You will likely identify characteristics that might otherwise seem alerting and, with practice, this will increase your tolerance for suspicious characteristics.

Inspecting known good files can also help validate (or invalidate) indicators of potential compromise. Think you’ve discovered a group of API calls, a set of strings, or a particular PE file characteristic that only exists in malware? Search across a large sample of legitimate files to test your theory.

There is nothing wrong with identifying indications of evil during the file analysis process, and that’s arguably the point of initiating an investigation. However, it is critical to view your suspicions about a file as hypotheses that you prove or disprove based on empirical evidence. Otherwise, you might miscategorize a legitimate file as malware, and that not only reflects poorly on you if someone checks your work – but it makes the file sad too.

If you would like to learn more about malware analysis strategies, join me at an upcoming SANS FOR610 course.

-Anuj Soni


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.

 

REMnux v6 for Malware Analysis (Part 2): Static File Analayis

Introduction

In this post, we’ll continue exploring some of the helpful capabilities included in REMnux v6. Be sure to regularly update your REMnux VM by running the command update-remnux.

Analyzing suspect files can be overwhelming because there are often numerous paths to explore, and as you continue to observe activity and gather data, the additional areas of analysis seem to explode exponentially. One approach to guide your analysis is to focus first on answering key questions. Another (likely complimentary) approach is to apply the scientific method where you:

  1. Make an observation.
  2. Generate a hypothesis based on that observation.
  3. Test the hypothesis.
  4. Modify the hypothesis based on the outcome of the test and rerun the test.

Static file analysis, where you learn about a suspect file without launching it, can help generate observations that fuel this process.  As a reminder, static file analysis typically results in information such as file and section hashes, compile times, extracted strings, library and function dependencies, and digital signature information. Using the scientific method described above, your analysis of a suspect file may involve the following sequence of activities:

  1. As part of your static analysis process, you extract the ASCII strings from a file and observe the text “HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Run”.
  2. You hypothesize that the suspect file uses this registry key to maintain persistence on a victim machine.
  3. You run the sample within a Windows 7 virtual machine and realize that this registry key is never modified. You dig deeper via code analysis and realize a Run key is only created if the victim is a Windows XP machine.
  4. You can now modify your hypothesis to specify the Windows XP caveat, rerun the test in a Windows XP VM, and confirm your theory. In doing so, you’ve performed focused analysis, learned about the sample’s persistence mechanism (which can be translated to an IOC), and identified an associated constraint.

Static file analysis is challenging, not because it is technically difficult, but because it is so hard to resist double-clicking immediately. I feel your pain, the double-click is my favorite part too. However, it is worth developing the discipline to complete a static file review before executing the sample because it fosters methodical analysis and produces tangible results.

REMnux includes some great tools to perform static analysis, including the ones listed here. This post will highlight just a few of my favorites.

pecheck.py

pecheck.py, written by Didier Stevens, is a wrapper for the Python pefile module used to parse Windows PE files. Let’s explore this tool by analyzing the BACKSPACE backdoor malware described in FireEye’s APT 30 report. If you want to follow along, you can download the sample here (password: infected). As shown in the output below, running pecheck.py against the sample returns file hashes and file/section entropy calculations. Entropy is a measure of randomness, and more entropy indicates a higher likelihood of encoded or encrypted data. While this information is helpful, I want to focus on the “Dump Info:” section shown towards the end of the excerpt. This section basically runs the pefile dump_info() function, which parses the entire file and outputs, well, a lot of data (see the complete output here).

Screen Shot 2015-12-27 at 7.41.59 PM

Figure 1: pecheck.py output

Among other information, the output includes the contents of the file’s Import Address Table (IAT), which represents the shared libraries (i.e., DLLs) and functions within those DLLs that the program relies upon:

Screen Shot 2016-01-02 at 5.13.25 PM

Figure 2: pecheck.py Import Address Table (IAT) output

I like the <DLL>.<FUNCTION> format because 1) over time, it can help you remember which functions a DLL contains and 2) you can grep for the DLL name or function name and retrieve the entire line (not the case with output from other tools). In this particular excerpt, we can immediately see some Windows API calls that are often used for malicious purposes. For example, we see references to the CreateToolhelp32Snapshot, Process32First, and Process32Next functions commonly used by malware to capture a list of running processes and iterate through that list to enumerate activity or target specific programs. We could explore this hypothesis by using a debugger to set breakpoints on these API calls and determine if there is a certain process the code is looking for. Oh, and in case you’re wondering, the hint refers to the potential location of the function within the corresponding DLL – it’s an optimization that, in this case, is not helpful given that all values are zero.

In the case a program imports a function by ordinal and not name, this will be indicated clearly:

Screen Shot 2016-01-08 at 1.00.56 AM

Figure 3: pecheck.py Import Address Table (IAT) output by ordinal

Note that since the above functions are imported by ordinal only, the function names (e.g., “ioctlsocket”) will not be listed in the strings output:

Screen Shot 2016-01-09 at 5.32.25 PM

Figure 4: Grepping for Windows API

Beyond viewing the IAT output, pecheck.py output includes section hashes, version information, resource information and the ability to configure a PEiD database to search for packer signatures. While pecheck.py may not be the first script you turn to due to the large volume of output, I prefer it to others because I can extract the information I desire based on grep searches or modifications to the Python code. In addition, dump_info() sometimes results in parsing errors that may reveal other interesting anomalous characteristics associated with the target file.

pestr

pestr is part of the pev PE file analysis framework, and its primary purpose is to extract strings from Windows executable files. However, it goes beyond the traditional strings tool by providing options to show the offset of a string within a file and the section where it resides. For example, below are output excerpts after running pestr against the file analyzed above, using the –section option to print the section where the respective string is found (see complete output here):

Screen Shot 2016-01-09 at 6.39.15 PM.png

Figure 4: pestr output #1

Screen Shot 2016-01-09 at 10.15.07 PM

Figure 5: pestr output #2

Figure 4 shows the command executed and the beginning of the output. The first few strings are found in the PE header, so they are labeled as appearing in the “none” section. Figure 5 shows strings in the “.rdata” section, including DLL and Windows API function names. The “.rdata” section commonly contains the Import Address Table, which could explain the presence of these strings here. Looking at the pecheck.py output, we can confirm these strings are, in fact, present in the IAT.

Perusing the remaining pestr output shows additional strings, including the following:

Screen Shot 2016-01-09 at 11.16.06 PM

Figure 6: pestr output #3

Note the presence of GetTickCount, a Windows function that returns the number of milliseconds that have passed since the system was started. This is a popular anti-analysis function because it can help detect if too much time has elapsed during code execution (possibly due to debugging activity).  Interestingly, pestr ouput reveals this function name is located in the “.data” section, rather than “.rdata” section where the IAT resides. We might hypothesize that this is an attempt by the developer to evade traditional import table analysis by manually calling this function during program execution. We can dig deeper by finding the reference to this string in IDA Pro:

code_temp

Figure 7: IDA Pro string reference

While we will not dive into code analysis details in this post, Figure 7 makes it clear that the GetTickCount string reference is indeed used to call the function at runtime using LoadLibraryA and GetProcAddress.

readpe.py + pe-carv.py

readpe.py can output information such as PE header data, imports and exports. For this post, I’ll highlight its simple ability to detect an overlay. An overlay is data appended to the end of an executable (i.e., it falls outside of any data described in the PE header). Using the following command against a Neshta.A specimen, readpe.py can detect if an overlay exists:

Screen Shot 2016-02-06 at 12.00.23 AM

Figure 8: readpe.py overlay output

Upon detecting an overlay, the next step is to evaluate the contents of this additional data. Malware often includes executable content in the overlay, so you might consider using a tool called pe-carv.py, which is purpose-built to carve out embedded PE files:

Screen Shot 2016-02-06 at 12.13.28 AM

Figure 9: pe-carv.py extracted file

As shown in the figure above, pe-carv.py successfully extracted a file it called 1.exe, and we could proceed with further static file analysis to better understand this embedded content.

Closing Thoughts

Static analysis can generate useful data about a file, but it can also help direct your reverse engineering efforts. While running the tools mentioned above may get you the information you need, I encourage you to check out the source code and customize it based on your preferences. In particular, if you’re just getting started with Python, tweaking this code can serve as a great introduction and motivate further study.

If you would like to learn more about malware analysis strategies, join me at an upcoming SANS FOR610 course.

-Anuj Soni


About the Author:
Anuj Soni is a Senior Threat Researcher at Cylance, where he performs malware research and reverse engineering. He is also a SANS Certified Instructor and co-author of the course FOR610:Reverse-Engineering Malware. If you would like to learn more about malware analysis strategies, join him at an upcoming SANS FOR610 course.