Nov 19, 2025 •

Melissa Eckardt

Active since: 2019

Passionate about reverse engineering, loves deep dives into low-level affairs.

• malware,deobfuscation

~ Content warning: This post contains liberal amounts of x86 assembly! ~

Intro

Rhadamanthys is a notorious stealer malware that has been around since 2022. Check Point Research has recently published a very detailed blog post about changes in the latest Rhadamanthys version. If you’re interested in the stealer’s capabilities - or if you always wanted to know what a dark web sales site for malware looks like - definitely check out their post.

What piqued my interest is not the malware itself, but its native loader. The loader is interesting because it implements some rather complicated anti-sandboxing/anti-AV-emulation measures. Furthermore, the payload is encoded using a custom binary-to-string encoding that the malware authors try to keep hidden. This causes vast parts of the loader binary to contain gibberish ASCII text.

But most importantly, the loader is obfuscated using different layered techniques, making static analysis extremely hard. If you’ve followed my posts in the past (especially the ones about Pyarmor and .NET), you know I always like a good deobfuscation challenge - so let’s see what we’re dealing with in this instance!

Setting the stage

The sample that we will work on for the rest of the post is 8f54612f441c4a18564e6badf5709544370715e4529518d04b402dcd7f11b0fb. It’s packed, so you need to unpack it to see the actual Rhadamanthys loader.

When we open up the main function, we’re greeted by this ghastly sight.

.text:0040A560~0040A5C7 ... Normal(ish) function prologue
.text:0040A5C8                 jmp     short loc_40A5F0
.text:0040A5CA ; ----------------------------------------------------
.text:0040A5CA                 cmp     ebx, 19A7218Ah
.text:0040A5D0                 lea     eax, off_48B604
.text:0040A5D6                 lea     edi, off_48B850
.text:0040A5DC                 cmovz   eax, edi
.text:0040A5DF                 mov     eax, [eax]
.text:0040A5E1                 add     eax, esi
.text:0040A5E3                 jmp     eax
.text:0040A5E3 ; ----------------------------------------------------
.text:0040A5E5                 align 10h
.text:0040A5F0
.text:0040A5F0 loc_40A5F0:               ; CODE XREF: sub_40A560+68↑j
.text:0040A5F0                 cmp     ebx, 0BB2D365h
.text:0040A5F6                 mov     eax, edx
.text:0040A5F8                 lea     edi, off_48B8A4
.text:0040A5FE                 cmovl   eax, edi
.text:0040A601                 mov     eax, [eax]
.text:0040A603                 add     eax, esi
.text:0040A605                 jmp     eax
.text:0040A605 sub_40A560      endp  ; IDA thinks the function ends here, but it's far from the end
.text:0040A605
.text:0040A607 ; ----------------------------------------------------
.text:0040A607                 lea     eax, off_48B758
.text:0040A60D                 cmp     ebx, 357A351Eh
.text:0040A613                 jge     short loc_40A680
.text:0040A615                 mov     eax, [eax]
.text:0040A617                 add     eax, esi
.text:0040A619                 jmp     eax

The rest of the function looks more or less the same. Blocks that perform actual work, such as calling other functions, are slightly larger. In total, this main function has almost 3000 instructions and hundreds of blocks.

Two things are at play here:

Jump target obfuscation: The target addresses are hidden using a memory load and addition with a key register.
Control flow flattening: The blocks we see above are actually part of a typical CFF dispatcher. CFF is a technique where direct relations between basic blocks are removed. Instead, each time you have some sort of path decision, you set a state variable (ebx in this case). Then you jump into the dispatcher, which will eventually execute the correct block. At a higher level, you can think of it like a huge switch statement, where the cases are obnoxious high entropy numbers like 0x357A351E.
The code at hand is basically doing a binary search until it finds a target block where ebx matches exactly (je instruction, not depicted above).

Together, those techniques form a potent combination. While control flow flattening alone is already annoying, at least you can decompile the code and see something. With the jump obfuscation on top of it, disassemblers and decompilers will only see a bunch of apparently unrelated blocks - no children, no parents. This obviously completely breaks decompilation and any control flow graph analysis.

Unrelated to the above, another pattern that occurs somewhat frequently in the code is this:

call    sub_40F830
mov     ecx, dword_48AE7C
mov     edx, 0C4D77A40h
add     ecx, edx
cmp     eax, ecx

The function result eax is compared against a value in ecx, which is again obtained using a memory load and an addition. This is essentially constant obfuscation, which has been applied to all constants in the program (except for ones belonging to the obfuscator itself, such as the control flow state variable).

The memory load presents an additional twist: if the initial mov ecx had simply been another constant like edx, decompilers would automatically fold the addition as part of their built-in optimizations. But with the load, they typically won’t do so even if the referenced memory were to be marked read-only.

Deobfuscation

An important first insight is that the obfuscation works on a per-function level. Blocks belonging to one function are always contained in the function bounds; there’s no crazy jumping across the whole program. You may be wondering how one is supposed to determine the function bounds - after all, we don’t have a coherent control flow graph. Luckily, this one is simple: Functions touched by the obfuscator only have a single exit point. They always have a ret at their very end.

Any deobfuscation attempt should thus also focus on one function at a time.

First (failed) attempt

While browsing through the code in order to come to a decision on how to deal with it, I stumbled upon cases like this:

.text:0040BECC ; ---------------------------------------------------
.text:0040BECC                 mov     ebx, [esp+40h]
.text:0040BED0                 cmp     ebx, 0BB2D365h
.text:0040BED6                 mov     eax, edx
.text:0040BED8                 lea     edi, off_48B8A4
.text:0040BEDE                 cmovl   eax, edi
.text:0040BEE1                 mov     eax, [eax]
.text:0040BEE3                 add     eax, esi
.text:0040BEE5                 jmp     eax

As can be seen, the jump target depends on cmovl, which will be either edx (via eax) or edi. But what’s edx? It’s not set in this block, and the block doesn’t have any direct parents.

In the general case, one would have to assume edx can be arbitrarily set in one of the (unknown) parent blocks. For this reason, I chose to try symbolic execution. This technique can simulate everything from the beginning of the function, meaning we should always have some sort of value for the registers (they may be symbolic, but for those obfuscator pointers like edx, I would expect integers).

However, after trying both Triton and angr, I figured that symbolic execution is not suitable here, at least not in the fashion I intended. Due to the size of some of the functions in the malware, path explosion becomes a real problem. While the symbolic exploration is working in principle and gives us correct destinations for each indirect jump, it simply doesn’t finish in any acceptable timeframe.

With the insight gained during the symbolic execution attempts and some further manual exploring, it turned out I needn’t have worried about cases such as with the edx register above. For this obfuscation scheme, if a register is used in a block and has not been assigned in said block or a discernible parent, it will always have the value that was assigned to it in the function prologue block. In order to handle such situations, it’s thus enough to create a lookup with all constant register assignments that happen in the prologue block.

Static deobfuscation

Seeing the snippet from the initial section, you may be tempted to say “just define regexes for the 3-4 different patterns & be done with it”. Unfortunately, it’s not that simple. The obfuscation was most likely implemented as a compilation pass (possibly in LLVM), and the compiler has quite a bit of leeway in how it actually arranges and emits the obfuscated code.

You will encounter all of these situations and more:

Instructions can be scheduled however the compiler pleases; you can have completely unrelated (legit) instructions right before jmp eax.
The jmp register is not always eax, it differs per function.
The compiler may have had to temporarily save one of the registers involved in the obfuscation, so you’ll see save/restore pairs. Except in some situations, the saves will be in parent blocks (thankfully, those parents will be explicit and not hidden).
This can also affect the key register for the add, it can be temporarily aliased.
The compiler may have found it convenient to put an incoming branch edge into the middle of the pattern, e.g., towards a mov eax, [eax].

One way to deal with all of this is to use a technique called backwards data slicing. It collects a minimal subset of instructions that influence a given value (such as the jmp register). This is easier to reason about, especially when there are other “noisy” instructions sprinkled into a block. In case of incoming branches, we split the analysis and track the branch source. For the jump obfuscation, it’s enough to do this for a depth of 1 and ignore everything incoming if we’re already on a “split path”.

Here’s an example:

.text:0040BEE7                 lea     eax, off_48B5B0
.text:0040BEED                 cmp     ebx, 0F6A636EFh
.text:0040BEF3                 jl      short loc_40BEFB
.text:0040BEF5                 lea     eax, off_48B6CC
.text:0040BEFB
.text:0040BEFB loc_40BEFB:            ; CODE XREF: .text:0040BEF3↑j
.text:0040BEFB                 mov     eax, [eax]
.text:0040BEFD                 add     eax, esi
.text:0040BEFF                 jmp     eax

Slices:
[<mov eax, [eax]>, <add eax, esi>, <jmp eax>]  ; common path
[<lea eax, off_48B6CC>]  ; path "jcc not taken"
[<lea eax, off_48B5B0>, <jl short loc_40BEFB>]  ; path "jcc taken"

The branch target at 0x40BEFB causes a split into the two paths, where the register diverges.

For conditional moves (cmov), it works slightly differently:

.text:0040ADF2                 mov     ebx, 0F6A636EFh
.text:0040ADF7                 cmp     ebx, 0BB2D365h
.text:0040ADFD                 mov     eax, edx
.text:0040ADFF                 lea     edi, off_48B8A4
.text:0040AE05                 cmovl   eax, edi
.text:0040AE08                 mov     eax, [eax]
.text:0040AE0A                 add     eax, esi
.text:0040AE0C                 mov     edi, edx
.text:0040AE0E                 mov     edx, [esp+2Ch]
.text:0040AE12                 mov     [esp+44h], edx
.text:0040AE16                 mov     edx, edi
.text:0040AE18                 jmp     eax

Slices:
[<lea edi, off_48B8A4>, <cmovl eax, edi>, <mov eax, [eax]>, <add eax, esi>, <jmp eax>]  ; cmov did move (cc true)
[<mov eax, edx>, <cmovl eax, edi>]  ; cmov did not move (cc false)

On the main path, we pretend the cmov executed its move (edi is used), while the secondary slice represents what happens if it doesn’t move (eax is used).

In the examples above, the first element of the slices is the one of interest (i.e., the offset we need to dereference for mov eax, [eax]). This holds true in about 95% of the cases, but sometimes there are other instructions, e.g., there could be a mov to esi, since that register is also part of the overall computation. It might thus be beneficial to get another sub-slice starting at mov eax, [eax].

Note how we have edx here again in the last slice rather than an address, so we need to apply the lookup trick described in the previous section.

Lastly, there is this pattern:

.text:0040AE26                 xor     eax, eax
.text:0040AE28                 cmp     ebx, 13B0D3B2h
.text:0040AE2E                 setz    al
.text:0040AE31                 shl     eax, 6
.text:0040AE34                 mov     eax, off_48B650[eax]
.text:0040AE3A                 add     eax, esi
.text:0040AE3C                 jmp     eax

Slices:
[<xor eax, eax>, <setz al>, <shl eax, 6>, <mov eax, off_48B650[eax]>, <add eax, esi>, <jmp eax>]

The compiler probably emits this instead of cmov when the two addresses can be represented using such a set-condition-code-and-shift pattern. The dereferenced address is either 0x48B650 (zero flag not set) or 0x48B690 (zero flag set → 1<<6 == 0x40 is added).

You may be wondering what all the jump instructions are doing in the slices. They’re included for pure convenience, because as it happens, the set union of all slices represents all instructions that can be safely overwritten when we patch things up.

The patches themselves are pretty straight-forward. We assemble a jcc and jmp, where the condition code is either taken from an existing jcc, from a cmovcc, or from a setcc. The jump destinations are computed from the values we obtained from our data slices and from the function prologue (for the key register and any other random registers).

While the instructions that we overwrite generally provide enough space in total, they’re not always coherent. There may be unrelated instructions in between them. In such cases, some careful shuffling needs to be done in order to consolidate the “free space” by moving legit instructions out of the way.

The jump deobfuscation logic can be found in deob_jumps.py, with some utility code for instruction shifting and rewriting being in deob_util.py.

Results thus far

Functions now have visible control flow and can be decompiled! It’s still not exactly pretty, but definitely progress compared to before. We can work with this.

                  if ( v5 >= 2025505611 )
                  {
                    if ( v5 == 2025505611 )
                    {
                      TranslateMessage(&Msg);
                      DispatchMessageA(&Msg);
                      v7 = (int)&unk_48B77C;
                      v5 = -1113969622;
                      goto LABEL_132;
                    }
                    goto LABEL_5;
                  }
                  if ( v5 != 1949251370 )
                    goto LABEL_5;
                  v37 = *((_DWORD *)hMem + v21) + 4;
                  v23 = dword_48ADFC + 1222248860;
                  v6 = (int)a1;
                  v5 = 215978185;
                }
                else if ( v5 >= 1834411218 )
                {
                  if ( v5 >= 1920510847 )
                  {
                    if ( v5 != 1920510847 )
                      goto LABEL_5;
                    free((void *)Param[24]);
                    v7 = (int)&unk_48B77C;
                    v5 = 1750707187;
                  }

Or, if you’re a graphical person:

Control flow graph for loader main function (0x40A560).

(Very) small functions are problematic

The compiler apparently finds small functions “manageable” in terms of their optimization potential. It precomputes lots of stuff belonging to the obfuscation, merges blocks, and so on.

.text:00414FE8                 xor     edx, edx
.text:00414FEA                 xor     ebx, ebx
.text:00414FEC                 cmp     eax, 15BEFF06h
.text:00414FF1                 setl    dl
.text:00414FF4                 setnz   bl
.text:00414FF7                 add     edx, 0Bh
.text:00414FFA                 lea     esi, ds:5[ebx*4]
.text:00415001                 cmp     eax, 0E18D1F3Ch
.text:00415006                 mov     ebx, 0Dh
.text:0041500B                 mov     eax, 2
.text:00415010                 cmovz   ebx, eax
.text:00415013                 mov     [esp+14h], ebx

Then later on you get:

.text:00415048                 jmp     edi
.text:0041504A ; ----------------------------------------------------
.text:0041504A                 jmp     edx
.text:0041504C ; ----------------------------------------------------
.text:0041504C                 jmp     dword ptr [esp+10h]
.text:00415050 ; ----------------------------------------------------
.text:00415050                 mov     eax, dword_48CC00[ecx*4]
.text:00415057                 add     eax, ebp
.text:00415059                 jmp     eax
.text:0041505B ; ----------------------------------------------------
.text:0041505B                 mov     eax, [esp+1Ch]
.text:0041505F                 mov     eax, dword_48CC00[eax*4]
.text:00415066                 add     eax, ebp
.text:00415068                 jmp     eax

The condition processing all happens in a single block at the top of the function and is pretty much decoupled from the jumps. This is quite different from what we’ve been dealing with in larger functions. Our patching approach is not suitable here, because we’d have to pretty much rewrite the entire function from scratch.

In the end it’s not too big of a loss, though - these functions usually comprise only around 100 lines of assembly (most of which is obfuscator trash), so figuring out what they do by hand is doable.

Dealing with constants

Let’s take a break from the control flow stuff for a moment and deal with something easier.

For the constant obfuscation, there are basically two ways the memory load is done: Either there’s a mov with a load followed by an arithmetic operation with a constant - or it’s the other way around, where the arithmetic operation has the memory operand. So in essence, we just need to look out for all mov/movzx/add/sbb/sub/xor reg, [imm32] instructions.

One problem is that such instructions can definitely appear as part of the underlying program, so we need some way to differentiate which ones are from the obfuscator and which ones must remain untouched. As it happens, if you collect all addresses and sort them, the ones from the obfuscator are clustered together.

.data:0048AEB4 dword_48AEB4    dd 545AE7F4h    ; DATA XREF: sub_40A560+EF0↑r
.data:0048AEB8 dword_48AEB8    dd 372ECFB4h    ; DATA XREF: sub_40A560+EFD↑r
.data:0048AEBC dword_48AEBC    dd 9B874143h    ; DATA XREF: sub_40A560:loc_40C592↑r
.data:0048AEC0 dword_48AEC0    dd 7A9F2473h    ; DATA XREF: sub_40A560+203E↑r
.data:0048AEC4 dword_48AEC4    dd 381897B5h    ; DATA XREF: sub_40A560+205D↑r
.data:0048AEC8 byte_48AEC8     db 1            ; DATA XREF: sub_40A560+1DC2↑r
.data:0048AEC9                 align 4
.data:0048AECC dword_48AECC    dd 0D2B6F7Dh    ; DATA XREF: sub_40A560:loc_40AE3E↑r
.data:0048AED0 dword_48AED0    dd 0D9E7D1Bh    ; DATA XREF: sub_40A560+8EA↑r
.data:0048AED4 dword_48AED4    dd 0F813A068h   ; DATA XREF: sub_40C8B0+C↑r
.data:0048AED8 dword_48AED8    dd 66A48DFEh    ; DATA XREF: sub_40C8B0+A3↑r
.data:0048AEDC dword_48AEDC    dd 0E92E6D27h   ; DATA XREF: sub_40C8B0+73↑r
.data:0048AEE0 dword_48AEE0    dd 13A7AAC9h    ; DATA XREF: sub_40C8B0+17↑r

For this particular analysis, it makes sense to work at the program-level rather than individual functions. The xrefs are not strictly linear and the more complete your picture is, the better.

I found that valid addresses are at most 0x10 bytes away from each other, but sometimes there are gaps where there’s some other data in between clusters of obfuscator addresses. For this sample, if you mandate that a valid cluster needs at least 10 addresses following the 0x10 rule, you will catch all obfuscator addresses. Any outliers will be addresses belonging to the real code.

Once we have that, we can simply replace all memory operands with their immediate values. Patching is trivial, because in x86, reg, imm32 operands are always shorter or the same length as reg, [imm32] operands, so we can’t run into space problems.

I will forego actually fully folding the constants and leave that job to an optimizing decompiler. One reason for that is certainly that there are nasty occurrences such as this:

.text:0040A868                 mov     eax, dword_48AE28
.text:0040A86D                 lea     ecx, [esp+48Ch+var_3E4]
.text:0040A874                 xorps   xmm0, xmm0
.text:0040A877                 movups  xmmword ptr [eax+ecx-0C5DAC1Ch], xmm0
.text:0040A87F                 movups  xmmword ptr [eax+ecx-0C5DAC2Ch], xmm0

Replacing the mov eax is enough for IDA’s decompiler to be able to figure things out.

The constant inlining is implemented in deob_consts.py.

Control flow unflattening

The final problem is the flattening, so let’s see what can be done about that. I don’t want to see any huge-number-comparison spaghetti code anymore.

There are some IDA plugins for unflattening that are based on microcode manipulation, such as D-810, and more recently, hrtng (which contains an extended version of HexRaysDeob). They all fail to deal with this particular variant of flattening. I assume that it would be possible to make them compatible with some effort, but I’m not familiar with IDA microcode specifics and didn’t feel like getting into it for this project. Plus I figured I could re-use quite a bit of the rewriting code that we already came up with for the jump deobfuscation.

One (special?) thing about this flattener is that it doesn’t always go to the beginning of the dispatcher once it has assigned a new state variable. If it can find another suitable point in the dispatcher it can jump to, it will do so. This means that analyses looking at the parents of the “main” dispatcher block won’t work, because they’ll miss things.

I chose a rather simplistic approach that seems to work well for this particular flattener:

Look for comparisons against the state register that are followed by jz (or in rare cases jnz where things are flipped). Create a mapping “state key → target”.
Look for assignments to the state register. The assignment value is asserted to be contained in the mapping created in step 1. At the end of the block, create jump(s) to the real target(s).

Here’s an example of step 2 (unrelated instruction have been cut):

.text:0040B758                 cmp     [esp+48Ch+var_3B4], 0
.text:0040B760                 mov     ebx, 0B34CE2DFh  ; new state key
.text:0040B765                 jnz     loc_40C696
.text:0040B771                 cmp     ebx, 0BB2D365h
.text:0040B777                 jge     loc_40C6AD       ; never taken
.text:0040B77D                 jmp     loc_40B6C0

.text:0040C696                 mov     ebx, 0FA72F85Ah  ; new state key
.text:0040C6A1                 cmp     ebx, 0BB2D365h
.text:0040C6A7                 jl      loc_40B77D       ; always taken (then goes straight to 40B6C0)
.text:0040C6AD                 jmp     loc_40A607       ; dead

After rewriting:

.text:0040B758                 cmp     [esp+48Ch+var_3B4], 0
.text:0040B765                 jnz     loc_40C696
.text:0040B771                 cmp     ebx, 0BB2D365h  ; irrelevant leftover
.text:0040B777                 jmp     loc_40BCAF  ; destination for 0B34CE2DFh

.text:0040C696                 nop
                               ...
.text:0040C6A1                 jmp     loc_40C3AC  ; destination for 0FA72F85Ah

These sites are connected with a branch, but they are actually rewritten independently of each other; as you can see, jcc instructions unrelated to dispatching are left as-is (0x40B765). The part where it goes to the dispatcher with jge/jl/jmp can be ignored and overwritten.

Another example. Remember cmov? It has come back to haunt us.

.text:0040C4B4                 cmp     [esp+48Ch+var_448], 5
.text:0040C4B9                 mov     ebx, 2B8162DCh
.text:0040C4BE                 mov     eax, 456A4274h
.text:0040C4C3                 cmovz   ebx, eax
.text:0040C4C6                 cmp     ebx, 0BB2D365h
.text:0040C4CC                 jl      loc_40B6C0
.text:0040C4D2                 jmp     loc_40A607

After rewriting:

.text:0040C4B4                 cmp     [esp+48Ch+var_448], 5
.text:0040C4B9                 mov     ebx, 2B8162DCh  ; irrelevant leftover
.text:0040C4BE                 mov     eax, 456A4274h  ; irrelevant leftover
.text:0040C4C3                 jz      loc_40B199  ; destination for 456A4274h
.text:0040C4C9                 jmp     loc_40ADA2  ; destination for 2B8162DCh

As you can see, the condition code from the cmovcc is simply adapted for a jcc (jz here). It’s just a question of figuring out which operand of the cmov goes where (in logical terms), so you don’t swap the destinations by accident.

While this all seems “simple” enough, care still needs to be taken. Especially in larger blocks, there may be situations where the compiler put other flag-affecting instructions below the cmov. Then by the time you do your jcc, the flags won’t be what you expected anymore! For the most part, these were artifacts from the jump obfuscation that can be done away with by doing a more comprehensive cleanup in that previous deobfuscation phase. But it’s still something that needs to be checked for before rewriting, else you’re in for nasty logic bugs.

Lastly, there is a special case where the state register is not set to a constant, but to a stack variable. This is sometimes the case for decisions that can be evaluated in the function’s first block, so the branch condition and state keys will be up there. It poses quite a challenge, because in order to have a jcc in a block way down in the function, you also need to pull down the branch condition. That’s not always trivial, especially if there’s a comparison against some volatile register that is only valid in the first block.

; Up in the in the first block:
.text:0040A59D                 cmp     [ebp+arg_4], 0           ; we need this down at 40BEXX
.text:0040A5A1                 mov     eax, 0EC71CA67h
.text:0040A5A6                 mov     ecx, 0A0716E5Bh
.text:0040A5AB                 cmovz   ecx, eax
.text:0040A5AE                 mov     [esp+48Ch+var_44C], ecx

; Site of use (state var assignment):
.text:0040BECC                 mov     ebx, [esp+48Ch+var_44C]  ; load stack variable
.text:0040BED0                 cmp     ebx, 0BB2D365h
.text:0040BED6                 jl      loc_40B6C0
.text:0040BEDC                 jmp     loc_40A607

For simplicity’s sake, we’re going to make the following assumptions:

At sites where stack variables like that are used, the stack pointer has the same value as in the first block.
Relevant stack variables (such as arg_4 in the example) are not overwritten in the middle of the function. You may wonder why an argument would ever be overwritten (even if it’s not needed anymore in the rest of the function), but you’d be surprised how inventive compilers can get when looking for free space for temp vars.

These are pretty big assumptions that won’t hold in many programs, but in this case they seem to work and save us a lot of headaches.

That means comparisons such as cmp [ebp+arg_4], 0 can be moved as-is. In another case, we may find cmp ecx, 0 instead - then we need to track where ecx comes from. It could be another memory load like mov ecx, [ebp+arg_4], in which case we can simply propagate the operand and end up with cmp [ebp+arg_4], 0 again. But ecx could also be the result of a more complex computation or even a parameter if the fastcall calling convention is used. In such situations, we can employ a trick and re-use the var_44C variable that the obfuscator allocated. Instead of using it for the cmov result, we’ll place our ecx register into the stack var, so that we can load it again in the block further down. Obviously, the cmov and any dependent instructions need to be nopped for that to work.

The attentive reader may wonder what would happen if we had cmp ecx, edx and both registers were fastcall parameters. Yeah, that would suck; thankfully it never happens in the sample.

The code for the above-mentioned operand analysis can be found here.

So in summary, dealing with this particular flattening is not hard on a conceptual level. The bulk work is getting all the edge cases right.

Final results

At this point you probably want to see an updated graph. Unfortunately, we’re not nopping out all code belonging to the CFF dispatcher, so IDA will still create blocks for it. So we’ll simply pretend the top part of the CFG does not exist, okay? Okay.

Control flow graph for loader main function after unflattening — Looks enough like a CFG of a sane function to me!

The main function has turned out very pretty, despite being one of the longest functions in the program and the amount of obfuscation-related blocks it originally had:

  v18 = 0;
  v17 = 0;
  memset(Param, 0, sizeof(Param));
  v3 = -1603178917;
  if ( !lpMultiByteStr )
    v3 = -328086937;
  pNumArgs[3] = v3;
  if ( lpMultiByteStr )
  {
    v10 = MultiByteToWideChar(0, 0, lpMultiByteStr, -1, 0, 0);
    v11 = (const WCHAR *)calloc(v10, 2u);
    lpCmdLine = v11;
    if ( v11 )
    {
      pNumArgs[0] = 0;
      hMem = CommandLineToArgvW(lpCmdLine, pNumArgs);
      if ( hMem )
      {
        for ( i = 0; i < pNumArgs[0]; ++i )
        {
          if ( hMem[i] && lstrlenW(hMem[i]) == 66 && *hMem[i] == 45 && hMem[i][1] == 98 )
          {
            v26 = hMem[i] + 2;
            for ( j = 0; j < 64; j = j - 393718627 + 393718629 )
            {
              v7 = &v26[j];
              v8 = sub_40C8B0(*v7);
              v9 = sub_40C8B0(v7[1]);
              if ( v8 >= 0x10u || v9 >= 0x10u )
              {
                v21 = 5;
              }
              else
              {
                v32[j / 2] = v9 | (16 * v8);
                v21 = 0;
              }
              v24 = v21;
              if ( v24 == 5 )
                break;
            }
            if ( j == 64 )
              v18 = 1;
            else
              memset(v32, 0, sizeof(v32));
          }
        }
        LocalFree(hMem);
      }
      free((void *)lpCmdLine);
    }
  }
  v31 = &off_47B9C8;
  v28[1] = 0;
  v28[0] = 0;
  v29 = 0;
  v28[2] = (int)&unk_48A99C;
  v28[3] = (int)sub_40CDA0;
  v28[5] = (int)&v31;
  v28[6] = (int)hInstance;
  v28[7] = a3;
  v28[4] = (int)sub_40CF90;
  WndClass.style = 3;
  WndClass.lpfnWndProc = (WNDPROC)WndProc;
  WndClass.cbClsExtra = 0;
  WndClass.cbWndExtra = 0;
  WndClass.hInstance = hInstance;
  WndClass.hIcon = LoadIconA(0, (LPCSTR)0x7F00);
  WndClass.hCursor = LoadCursorA(0, (LPCSTR)0x7F00);
  WndClass.hbrBackground = (HBRUSH)GetStockObject(0);
  WndClass.lpszMenuName = 0;
  WndClass.lpszClassName = WindowName;
  dword_48E02C = (int)&dword_48E028;
  dword_48E028 = (int)&dword_48E028;
  chain[1] = (int)chain;
  chain[0] = (int)chain;
  Param[92] = 0;
  Param[23] = 0;
  Param[95] = 0;
  Param[24] = 0;
  Param[2] = (int)chain;
  Param[0] = 0;
  Param[22] = (int)&v17;
  Param[1] = (int)malloc(0x40000u);
  if ( Param[1] )
  {
    v19 = 1;
    RegisterClassW(&WndClass);
    memset((void *)Param[1], 0, 0x40000u);
    Param[31] = 0;
    Param[30] = 0;
    Param[4] = (int)v28;
    Param[3] = (int)&v31;
    sub_419808(WindowName, 28, &Param[13]);
    if ( v18 )
    {
      sub_405E50(v37);
      v36 = v37;
      sub_4069C0(v35, &Param[13]);
      sub_406B80(v35, 32, v32, &Param[5]);
    }
    else if ( sub_40F830()
           && MessageBoxW(0, L"Do you want to run a malware?\n(Crypt build to disable this message)", L"Warning", 0x34u) == 7 )
    {
      v19 = 0;
    }
    if ( v19 )
    {
      DesktopWindow = GetDesktopWindow();
      if ( DesktopWindow )
        GetWindowThreadProcessId(DesktopWindow, (LPDWORD)&Param[25]);
      Param[23] = (int)calloc(1u, 30008u);
      pNumArgs[2] = 13565952;
      pNumArgs[1] = 0x80000000;
      CreateWindowExW(
        0,
        WindowName,
        WindowName,
        0xCF0000u,
        0x80000000,
        0x80000000,
        500,
        300,
        HWND_MESSAGE,
        0,
        hInstance,
        Param);
      while ( GetMessageA(&Msg, 0, 0, 0) )
      {
        TranslateMessage(&Msg);
        DispatchMessageA(&Msg);
      }
      free((void *)Param[1]);
      if ( v17 && Param[4] )
      {
        v25 = v17;
        v17(v28);
      }
    }
  }
  if ( Param[24] )
    free((void *)Param[24]);
  if ( Param[95] )
    free((void *)Param[95]);
  if ( Param[23] )
    free((void *)Param[23]);
  while ( (int *)dword_48E028 != &dword_48E028 )
  {
    v4 = (_DWORD *)dword_48E028;
    v5 = *(_DWORD *)dword_48E028;
    v6 = *(_DWORD **)(dword_48E028 + 4);
    *v6 = *(_DWORD *)dword_48E028;
    *(_DWORD *)(v5 + 4) = v6;
    v22 = v4;
    if ( v4[2] )
      (*((void (__stdcall **)(_DWORD, _DWORD, int))v22 + 2))(*((_DWORD *)v22 + 4), 0, 0x8000);
    free(v22);
  }

There are only very few obfuscation artifacts left, e.g., the assignment to v3 in the beginning. In one case IDA failed to optimize a for loop update (where it says j = j - 393718627 + 393718629 instead of j += 2). We can live with that ¯\_(ツ)_/¯.

The ultimate test for a deobfuscator is to see if the deobfuscated application runs without issues. Unfortunately, at the time of writing, that’s not the case for the tooling we developed. There’s no outright crash and the initial message box appears, but a very subtle logic bug seems to prevent correct decoding of the payload.

So, what about the malware loader?

We went through all this effort to bring the code into a shape that can be reversed without too much pain. So let’s see what the Rhadamanthys loader is actually up to.

You can already glean some details from the main function printed in the previous section.

One of the first consequential things that happen is a function call to 0x40F830 followed by a very conspicuous message box if successful. As the message text implies, that function actually checks if the binary was not crypted/packed. This is implemented by doing some section parsing and reading the first 0x1000 .text bytes from disk; those are simply compared with what’s in memory.

Interestingly, the check disregards the fact that the binary was built with relocations and was flagged as movable. It works regardless, because the first function is huge and doesn’t use any addresses that would need to be relocated. The function was probably placed at that position intentionally - otherwise it’d be pure luck.

Other than that, there’s lots of windowing code to register a window class and create an invisible window. A structure called Param is initialized and pretty much contains the state for the rest of the program. It’s passed as the window’s user data pointer so that it can be retrieved later on. The rest of the main function doesn’t actually do much; it eventually enters a classic message loop in order to support the window.

Since a window is in play here, the next point of interest will usually be the so-called window procedure, or WndProc. This is called whenever the operating system has a message that either relates to the window directly or that could be relevant to the application in general (e.g., “the system is about to shut down”).

The procedure handles the following messages:

WM_CREATE: Called after the window has been created. The malware sets up a timer with ID 2.
WM_DESTROY: Called when the window is about to be destroyed. The malware calls PostQuitMessage to essentially end the process.
WM_TIMER: Timer callback.
WM_CAP_SET_CALLBACK_YIELD: Posted by the malware itself. Queues a function pointer.

From the above, you may be able to guess that the main functionality is actually implemented using timers. We find ourselves with a pretty convoluted scheme where many operations are executed as part of callbacks, and it’s not immediately obvious what happens in which order and under which conditions.

Timer ID 2 collects the current cursor position, foreground window and timestamp every 30 milliseconds, 1500 times. This causes a delay of at least 45 seconds until the actual payload is run.
Once it has collected everything, it will do a couple checks on the data. If the checks are successful, it queues function 0x40FBE0 and schedules timer ID 1. The checks are as follows:
- Whether the cursor position has changed at least 30 times.
- There must have been at least two foreground windows, and at least one of them must not belong to the desktop process.
- If the checks fail, the loader will enter another 45 seconds interval, and it will start doing advanced checks that include calculating Euclidean distances between cursor positions. We haven’t analyzed that part in detail.
Timer ID 1 runs any queued functions.
Function 0x40FBE0 posts WM_CAP_SET_CALLBACK_YIELD with parameter 0x411850; as described above, the handler for this message simply queues the parameter as a function pointer (to the queue that timer ID 1 is working on).
0x411850 is responsible for decoding and eventually running the stealer payload. It will post additional functions using WM_CAP_SET_CALLBACK_YIELD that do parts of the work.

These measures are designed to thwart execution in AV emulators and malware sandboxes. For example, if a sandbox sets a different cursor position every 3 seconds but doesn’t simulate “fluid” movement, the captured coordinates will not exceed the threshold of 30 within 45 seconds. CAPE and VMRay pass these checks fine, so in the end they aren’t too effective against common sandboxes.

Custom binary encoding

As promised in the intro, here’s a representation of the algorithm (function 0x416790) that is used to decode the payload strings into binary:

Input	Output
A-Z	0-25
a-z	26-51
0-9	52-61
!?#$%&()*+-,/:;<>=@[\]^`{\\|}~\n	62-90
`' '` + A-Z / a-z	91-142
`'.'` + A-Z / a-z / 0-9	143-204
`'_'` + A-Z	205-230
`'_'` + a-z	231-255 + 0

If you’re wondering what the + 0 is supposed to mean: 0 is encoded by A as well as _z. The malware doesn’t actually make use of this overflow “feature”, though.

While the first three ranges are reminiscent of Base64, the algorithm is completely different from base encodings. Up until 90, it makes use of a lookup consisting of special characters. For the rest up until 255, there’s a two-letter encoding scheme with a special prefix character that influences the output range.

I later found a post by Cormac Conlon stating that the .NET-based variant of the loader also uses this algorithm, where it’s called Flutter.

Note that in addition to this, the payload is also encrypted using a somewhat obscure Chinese block cipher called SM4. The tables/constants required by the algorithm (S-Box, FK, CK) are obscured in function 0x405E50; they’re only loaded temporarily when needed.

Outro

The Rhadamanthys operation has seen a disruption by Operation Endgame earlier this month. It remains to be seen if/how quickly they recover. A lot of data stolen by the stealer was also recovered by law enforcement and provided to Have I Been Pwned.

Our tooling that we developed as part of this research is available on GitHub. Perhaps it will prove useful in the future if the same obfuscator is used for other malware, or if a very similar one pops up to which the code can be adapted. Other than that, we hope the code can serve as a learning resource for how to implement a fast static deobfuscator utilizing the Capstone disassembly library.