Improve Software Debugging with Binary Analysis

One of the seriously underutilized tools of the trade in the software development world, at least in my experience, has been binary analysis. We have linters, unit tests, correctness proofs, and static analysis tools to help catch bugs in our software. However, when a bug inevitably pops up that escapes all these checks, it could be hard to fix. Binary analysis can enhance our debugging toolkit by catching bugs that stem from the compiler. While most binary analysis is done in the field of security, many of those principles can be brought into normal software development to fix hard-to-detect problems as well.

For example, I came across a problem recently where a program’s floating-point output would differ across different compilers and operating systems. The code compiled in both cases were the same, except that the computers were running dfferent operating systems, and had different versions of the compiler. While we only had support for one operating system, we also used the other operating system for testing purposes, since it had packages in its repo that we needed and weren’t on the other one. So, Dockerizing the program was not practical.

Since the only major difference between the two builds was the compiler and CPU, I suspected that it had something to do with the optimization flags, and the binary that the compiler creates. So, I opened up a binary analysis software to take a look at the offending .so files.

The two files were virtually identical, with similar control flow graphs and instruction sets. What stood out, however, was that the original binary was using lots of movapd where the other binary was using fmul. Hmm, movapd is a SSE2 instruction, while fmul is a x87 instruction. SSE2 is a much newer (and faster, if used correctly) instruction set than x87, with the former being released in 2000 and the latter being released in 1980. It seems like our two compiler versions had understood -O2 to mean different things.

As it turns out, GCC automatically enables SSE2 instructions for 64-bit programs with the argument -msse2. However, since we were building a 32-bit program, it defaulted to x87 instructions. This makes sense for general purpose computing, since the x86 was introduced in 1985, and there’s no guarantee that the processor would support the instrution set if it was 32-bit. Since the first AMD64 processor was released in 2003, and the first IA-64 processor in 2001, it was reasonable to assume that an IA-64 or AMD64 CPU would be able to handle SSE2. The difference here arises from how SSE2 calculates floating point operations at double precision, but x87 uses higher precision 80-bit precision values for intermediate operations.

This was a long aside, but this little bug was easily solved by adding a -msse2 to all builds, since we were pretty sure that none of our machines was running on a pre-2000 CPU. However, we would’ve never found this bug, and probably would have resorted to building all our dependencies manually, if we didn’t realize that the compilers created different binaries. So, hopefully having convinced you of the importance of binary analysis, I’ll do my best to run through the basics of it.

Frameworks, apps, and more

There’s quite a few binary analysis frameworks out there, each with their strengths. Some of the more common ones are:

IDA Pro, a popular and powerful proprietary tool with a limited free version. It’s the first tool I use to analyze a binary, though often I will have to use another tool to continue working with it, since the free version doesn’t have a decompiler. I’ve heard good things about the paid addons though.
Ghidra, a software reverse engineering tool with a very good decompiler. I also really like the binary diff feature, which works with assembly instructions. It’s made by the NSA.
angr, which is more of a Python library that happens to work well in a Python shell too. It’s great for writing automated code that works with binary code. It also has a very powerful solver engine that allows you to figure out the inputs of a program given the output. There’s a GUI frontend in development now, but it’s in its early stages.
radare2, which works very well in a CLI, and happens to have a GUI frontend called Cutter. I use it to run through code with the debugging feature.

While I personally use these tools for some specific features that they have, each of these tools have nearly the same functionality, though it may buried behind a different menu bar or shell command. It’s probably best to try each one out, and select the ones that you like best.

The Binary

The binary output of gcc or clang (or any compiler that compiles to assembly) is an ELF executable. By default this executable will be named a.out, but that can be overridden with -o. Simply put, ELF is a common standard for how data is arranged in a binary program that divides the file into segments. These include (in order):

The ELF Header, which contains basic information about the executable such as memory segments, sections, and other essential information.
The Program Header, which contains information about how to execute the program, such as where to load the instructions and the size of it.
Sections, which are of different types. For example, the .data section contains numerical constants, while the .rodata section contains constant strings. .bss sections contain static variables that have not been initialized. The most important section is the .text section, which contains the program’s assembly instructions.
- Debug Sections can be found in some ELF executables as well. These are DWARF debug sections that can be used with GDB or Valgrind. They’re identified with headers named .debug_*.

@corkami created a very nice visualization of the layout of an ELF file here:

Program files are not the only type of files that utilize ELF. Library files with .so and .a extensions also use ELF to arrange their data. This format plays and important part in getting the dynamic linker to work properly, by listing the required libraries in the .interp section.

By looking into specific sections of the ELF executable, we can find constants that are used in the code, as well as check for debug data in the executable.

The Control Flow Graph

What we saw above was the physical layout of the executable, which is how it is stored on the disk. However, a more interesting layout of the executable (at least for binary analysis) is the program’s control flow graph, or CFG. The CFG is a representation of the program’s behaviour, with blocks of commands as nodes, and instructions like jmp, return, and call being the edges. This gives us an easy way to understand the behaviour of the program, and we can often match functions to nodes in the program.

For example, take a look at this program, compiled with g++:

#include <iostream>
#include <cstdlib>

void do_thing() {
  std::cout << "thing" << std::endl;
}

void do_other_thing() {
  std::cout << "beep boop" << std::endl;
}

int main() {
  // Generate and compare 2 random numbers
  auto a = rand() % 10 + 1;
  auto b = rand() % 10 + 1;

  if (a < b)
    do_thing();
  else
    do_other_thing();

  return 0;
}

When we open the control flow graph of main() in IDA, we see this:

You can see how each set of instructions generates a and b, and particularly the call to rand() with the call _rand assembly instruction. We also see that the function splits into two at the if statement, jumping to either calling the _do_thing or do_other_thing() function, and then finally returning and ending the program after the function is called. The CFG is the easiest way to examine how an unknown binary works, and provides good intuition into what a program does.

Now, let’s see what happens when we run it again, but this time with g++ -O3:

This time, we see that the random number generation takes a lot less instructions than before. In this case, -O3 compared the two random numbers in the stack, instead of saving them into storage and then loading them again. That gets rid of the mov commands using the rbp registers, so in essense main() in the assembly code has become:

if (rand() % 10 + 1 < rand() % 10 + 1)
  do_thing();
else
  do_other_thing();

Let’s take another look at it, this time using the Intel C++ Compiler. We compile it this time with icpc -march=icelake -O3:

We can see that this time, our compiler inlined the single line do_thing() and do_other_thing() functions. We also see it using registers like r9d, which are 64-bit only, and an AVX instruction too, vstmxcsr. Intel’s compiler does this often since it’s used heavily in high performance computing.

These three compilations all ended up with different binaries, but as we can see the control flow graphs of these programs are almost the same. This shows why the CFG is so important in binary analysis. Not only does it make assembly easier to read, but we can spot similar functions using this as well.

Analyzing a Real CFG

I’m going to analyze hyx, a command-line hex editor with vim keybindings, as an example here. This is mostly due to the small size of the code base, the fact that the source code is in C so it’s easier to analyze, and the fact that it compiles into a single binary with no library dependencies. To start, we’ll take a look at the ELF headers with readelf:

~/hyx-2020.06.09
⟩ readelf -hl hyx
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0xd450
  Start of program headers:          64 (bytes into file)
  Start of section headers:          231864 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         11
  Size of section headers:           64 (bytes)
  Number of section headers:         34
  Section header string table index: 33

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x0000000000000268 0x0000000000000268  R      0x8
  INTERP         0x00000000000002a8 0x00000000000002a8 0x00000000000002a8
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000000caf0 0x000000000000caf0  R      0x1000
  LOAD           0x000000000000d000 0x000000000000d000 0x000000000000d000
                 0x0000000000011cf1 0x0000000000011cf1  R E    0x1000
  LOAD           0x000000000001f000 0x000000000001f000 0x000000000001f000
                 0x0000000000002110 0x0000000000002110  R      0x1000
  LOAD           0x0000000000021b88 0x0000000000022b88 0x0000000000022b88
                 0x000000000000a600 0x000000000000a858  RW     0x1000
  DYNAMIC        0x0000000000021b98 0x0000000000022b98 0x0000000000022b98
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x00000000000002c4 0x00000000000002c4 0x00000000000002c4
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_EH_FRAME   0x000000000001ffdc 0x000000000001ffdc 0x000000000001ffdc
                 0x000000000000033c 0x000000000000033c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000021b88 0x0000000000022b88 0x0000000000022b88
                 0x0000000000000478 0x0000000000000478  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .text .fini 
   04     .rodata .eh_frame_hdr .eh_frame 
   05     .init_array .fini_array .dynamic .got .data .bss 
   06     .dynamic 
   07     .note.gnu.build-id .note.ABI-tag 
   08     .eh_frame_hdr 
   09     
   10     .init_array .fini_array .dynamic .got 

We quickly see that this file is dynamically linked, since we have a .dynsym, .dynstr, and .dynamic section. That helps us, since we will be able to find API or cstdlib functions that are called in the binary and match them to the source. Furthermore, the presense of the .eh_frame_hdr and .eh_frame sections show that DWARF debug information is present. Note that in this case I used make debug to build hyx instead of just make, so we can preserve function names when we do the analysis. That’s why we will see ubsan function calls later.

We’re going to use the Cutter frontend of radare2 instead of IDA to examine the control flow graph this time, just because I like the color scheme better. Our main() function looks like a tentacle monster:

Instantly, we can tell that there are two loops somewhere in this function, since we see a green arrow on the left column and a blue arrow on the right column that creates cycles in the call graph. Looking in the source code, we can see those two loops in hyx.c:142-151 and hyx.c:177-190. This is helpful since it shows us that everything else in that cycle is part of the loop. Let’s take a closer look at the right side of the CFG:

Note that the only appearance of strcmp in the function is in the for loop at hyx.c:142-151. Therefore, we can assume that this entire cycle is what reads the arguments:

for (size_t i = 1; i < (size_t) argc; ++i) {
    if (!strcmp(argv[i], "-h") || !strcmp(argv[i], "--help"))
        help(0);
    else if (!strcmp(argv[i], "-v") || !strcmp(argv[i], "--version"))
        version();
    else if (!filename)
        filename = argv[i];
    else
        help(EXIT_FAILURE);
}

Now, on to the left side, which features an if statement:

We can match the calls to fileno and isatty in the topmost node of the cfg, as well as the blob_load at the bottommost node of the cfg to this snippet:

blob_init(&blob);
if (!isatty(fileno(stdin))) {
    if (filename) help(EXIT_FAILURE);
    blob_load_stream(&blob, stdin);
    if (!freopen("/dev/tty", "r", stdin))
        pdie("could not reopen controlling TTY");
}
else {
    blob_load(&blob, filename);
}

Notice here that passing blob by reference to blob_init was done with the lea rax, blob command and then calling the function. That assembly command stores the location of blob in rax, much like a pointer in normal C. This resemblance by design, since x86 was made to support languages like C.

The other sections of the if statement is self-explanatory. The middle two nodes with only two commands check if the filename exists, and if not prints out the error message. blob_load_stream and the following if statement are merged into one node.

As you can see, it’s surprisingly easy to figure out how the program works in assembly, even when we don’t build it with debug symbols. This really helps in debugging a program without those symbols as well, since we can still trace through the program even when GDB fails us. As we will soon see, symbolic execution can complement debuggers and sometimes even replace them when debug symbols are absent.

Symbolic Execution

This is one of the most important tools in a reverse engineering arsenal. It allows us to take an expected ouput, and figure out the inputs needed for the said output to show. This is especially useful for debugging edge cases where the output is known but the input is not. Let’s take a look at a program like this one:

#include <cstdio>
#include <cstring>

int main(int argc, char **argv) {
  // Fail if we don't have exactly one argument
  if (argc != 1)
    return 1;

  char nice[7] = "TOASTY", input[100];

  // We don't want to change the EOL operator
  for (int i = 0; i < 6; i++)
    nice[i] = nice[i] + i;

  printf("Enter something: \n");
  scanf("%s", input);

  if (strcmp(nice, input) != 0)
    printf("u suck\n");
  else
    printf("good job\n");

  return 0;
}

Now, if we put in the correct string, we will get the program to print out good job. However, the string is not exactly TOASTY anymore after the addition we did. Maybe the CFG can help us figure out the right input. Using radare2 again to generate an aesthetic CFG:

Here we see the if statement branching into one of two possible branches depending on the results of a call to the strcmp function, as expected. But something happens to the rsi pointer that stores TOASTY and it is now no longer TOASTY. We’ll have to use symbolic execution to get the correct input.

First, let’s take a look at the addresses of the “good job” command in the binary by right-clicking on the lea rdi, str.good_job assembly instruction. This happens at 0x0000122c for me. The same method shows that the lea rdi, str.u_suck instruction for the incorrect input is at 0x0000121e. That gives us the addresses that the symbolic executor will want to find and the ones that it will want to avoid. Now, we can do a symbolic execution run using angr in a Python 3 shell:

>>> import angr
>>> p = angr.Project('toasty-test')
WARNING | 2020-11-05 18:08:11,453 | cle.loader | The main binary is a position-independent executable. It is being loaded with a base address of 0x400000.
>>> p.factory.block(p.entry).pp()
0x401130:	endbr64	
0x401134:	xor	ebp, ebp
0x401136:	mov	r9, rdx
0x401139:	pop	rsi
0x40113a:	mov	rdx, rsp
0x40113d:	and	rsp, 0xfffffffffffffff0
0x401141:	push	rax
0x401142:	push	rsp
0x401143:	lea	r8, [rip + 0x156]
0x40114a:	lea	rcx, [rip + 0xdf]
0x401151:	lea	rdi, [rip - 0xe8]
0x401158:	call	qword ptr [rip + 0x2e8a]
>>> sm = p.factory.simulation_manager()
>>> sm.explore(find=0x40122c, avoid=0x40121e)
WARNING | 2020-11-05 18:08:58,679 | angr.storage.memory_mixins.default_filler_mixin | The program is accessing memory or registers with an unspecified value. This could indicate unwanted behavior.
WARNING | 2020-11-05 18:08:58,679 | angr.storage.memory_mixins.default_filler_mixin | angr will cope with this by generating an unconstrained symbolic variable and continuing. You can resolve this by:
WARNING | 2020-11-05 18:08:58,679 | angr.storage.memory_mixins.default_filler_mixin | 1) setting a value to the initial state
WARNING | 2020-11-05 18:08:58,679 | angr.storage.memory_mixins.default_filler_mixin | 2) adding the state option ZERO_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to make unknown regions hold null
WARNING | 2020-11-05 18:08:58,679 | angr.storage.memory_mixins.default_filler_mixin | 3) adding the state option SYMBOL_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to suppress these messages.
WARNING | 2020-11-05 18:08:58,680 | angr.storage.memory_mixins.default_filler_mixin | Filling memory at 0x7fffffffffeff40 with 8 unconstrained bytes referenced from 0xa8f900 (strcmp+0x0 in libc.so.6 (0x8f900))
<SimulationManager with 1 found, 1 avoid>
>>> sm.found[0].posix.stdin.concretize()[0]
b'TPCVX^\x00\x02I\xe0\x02\x89\x80\x02\xa4\x08\x02)\x08\x89\x8aI*J\x02\x89I\x89\x19\x01\x89I\x10\x19\x89\x02\x02\x02\x02\x01\x08\x02\x02\x08\x01\x02\x00\x02\x01\x01\x02\x01\x01\x01\x02\x01\x01\x01\x00\x02'

And there you have it. At the last line, entering TPCVX^ will provide us with the good job message. Everything after the \00 character can be removed, since that’s after the null operator. Notice how we found that the instruction for an incorrect input was at 0x0000121e, but we set the instruction in angr to 0x40121e. That’s because the cle loader that angr uses starts the executable at a base address of 0x400000, so our instruction offset we got above would become 0x400000 + 0x121e = 0x40121e when we use it in angr.

While the above code did seem like brute-forcing it until the correct assembly line was hit, the way angr and other symbolic execution frameworks solves problems like this is much more ingenious. By reading in the assembly code and interpreting it while keeping all inputs to the program as a symbol (like in the algebra sense), we can arrive at an expression for each input, with the contraints on the expression being the conditional statements that the program goes through.

Symbolic execution is an especially good use case for testing software, since it’s often hard to find the correct inputs to trigger the outputs that we may want to test. It’s kind of like a correctness proof but with less mathematical rigor. Automating symbolic exection and CFG analysis is easy too, especially with angr and radare2. Ghidra’s automation requires Java and a GUI, which is much harder, while IDA Pro’s automation costs a lot. This provides lots of opportunities for new ways to test code, espeically in places where the code coverage is low.

Conclusion

Binary analysis is already an invaluable tool for malware researchers and is pretty active as a research area. But it also has its uses for developers, and comes in especially handy when dealing with bugs caused by compilers. It can also be used for working out a bug without debug symbols, where GCC may not be of much use. It’s worth it to learn if only to understand how to better optimize code for compilers. It’s also very interesting (and free) to poke around at binaries, so why not try it?