Rhadamanthys Loader Deobfuscation
~ Content warning: This post contains liberal amounts of x86 assembly! ~
Intro
Rhadamanthys is a notorious stealer malware that has been around since 2022. Check Point Research has recently published a very detailed blog post about changes in the latest Rhadamanthys version. If you’re interested in the stealer’s capabilities - or if you always wanted to know what a dark web sales site for malware looks like - definitely check out their post.
What piqued my interest is not the malware itself, but its native loader. The loader is interesting because it implements some rather complicated anti-sandboxing/anti-AV-emulation measures. Furthermore, the payload is encoded using a custom binary-to-string encoding that the malware authors try to keep hidden. This causes vast parts of the loader binary to contain gibberish ASCII text.
But most importantly, the loader is obfuscated using different layered techniques, making static analysis extremely hard. If you’ve followed my posts in the past (especially the ones about Pyarmor and .NET), you know I always like a good deobfuscation challenge - so let’s see what we’re dealing with in this instance!
Setting the stage
The sample that we will work on for the rest of the post is 8f54612f441c4a18564e6badf5709544370715e4529518d04b402dcd7f11b0fb.
It’s packed, so you need to unpack it to see the actual Rhadamanthys loader.
When we open up the main function, we’re greeted by this ghastly sight.
.text:0040A560~0040A5C7 ... Normal(ish) function prologue
.text:0040A5C8 jmp short loc_40A5F0
.text:0040A5CA ; ----------------------------------------------------
.text:0040A5CA cmp ebx, 19A7218Ah
.text:0040A5D0 lea eax, off_48B604
.text:0040A5D6 lea edi, off_48B850
.text:0040A5DC cmovz eax, edi
.text:0040A5DF mov eax, [eax]
.text:0040A5E1 add eax, esi
.text:0040A5E3 jmp eax
.text:0040A5E3 ; ----------------------------------------------------
.text:0040A5E5 align 10h
.text:0040A5F0
.text:0040A5F0 loc_40A5F0: ; CODE XREF: sub_40A560+68↑j
.text:0040A5F0 cmp ebx, 0BB2D365h
.text:0040A5F6 mov eax, edx
.text:0040A5F8 lea edi, off_48B8A4
.text:0040A5FE cmovl eax, edi
.text:0040A601 mov eax, [eax]
.text:0040A603 add eax, esi
.text:0040A605 jmp eax
.text:0040A605 sub_40A560 endp ; IDA thinks the function ends here, but it's far from the end
.text:0040A605
.text:0040A607 ; ----------------------------------------------------
.text:0040A607 lea eax, off_48B758
.text:0040A60D cmp ebx, 357A351Eh
.text:0040A613 jge short loc_40A680
.text:0040A615 mov eax, [eax]
.text:0040A617 add eax, esi
.text:0040A619 jmp eax
The rest of the function looks more or less the same. Blocks that perform actual work, such as calling other functions, are slightly larger. In total, this main function has almost 3000 instructions and hundreds of blocks.
Two things are at play here:
- Jump target obfuscation: The target addresses are hidden using a memory load and addition with a key register.
- Control flow flattening: The blocks we see above are actually part of a typical CFF dispatcher. CFF is a technique where direct relations between basic blocks are removed. Instead, each time you have some sort of path decision, you set a state variable (
ebxin this case). Then you jump into the dispatcher, which will eventually execute the correct block. At a higher level, you can think of it like a huge switch statement, where the cases are obnoxious high entropy numbers like0x357A351E.
The code at hand is basically doing a binary search until it finds a target block whereebxmatches exactly (jeinstruction, not depicted above).
Together, those techniques form a potent combination. While control flow flattening alone is already annoying, at least you can decompile the code and see something. With the jump obfuscation on top of it, disassemblers and decompilers will only see a bunch of apparently unrelated blocks - no children, no parents. This obviously completely breaks decompilation and any control flow graph analysis.
Unrelated to the above, another pattern that occurs somewhat frequently in the code is this:
call sub_40F830
mov ecx, dword_48AE7C
mov edx, 0C4D77A40h
add ecx, edx
cmp eax, ecx
The function result eax is compared against a value in ecx, which is again obtained using a memory load and an addition.
This is essentially constant obfuscation, which has been applied to all constants in the program (except for ones belonging to the obfuscator itself, such as the control flow state variable).
The memory load presents an additional twist: if the initial mov ecx had simply been another constant like edx, decompilers would automatically fold the addition as part of their built-in optimizations.
But with the load, they typically won’t do so even if the referenced memory were to be marked read-only.
Deobfuscation
An important first insight is that the obfuscation works on a per-function level.
Blocks belonging to one function are always contained in the function bounds; there’s no crazy jumping across the whole program.
You may be wondering how one is supposed to determine the function bounds - after all, we don’t have a coherent control flow graph.
Luckily, this one is simple: Functions touched by the obfuscator only have a single exit point.
They always have a ret at their very end.
Any deobfuscation attempt should thus also focus on one function at a time.
First (failed) attempt
While browsing through the code in order to come to a decision on how to deal with it, I stumbled upon cases like this:
.text:0040BECC ; ---------------------------------------------------
.text:0040BECC mov ebx, [esp+40h]
.text:0040BED0 cmp ebx, 0BB2D365h
.text:0040BED6 mov eax, edx
.text:0040BED8 lea edi, off_48B8A4
.text:0040BEDE cmovl eax, edi
.text:0040BEE1 mov eax, [eax]
.text:0040BEE3 add eax, esi
.text:0040BEE5 jmp eax
As can be seen, the jump target depends on cmovl, which will be either edx (via eax) or edi.
But what’s edx? It’s not set in this block, and the block doesn’t have any direct parents.
In the general case, one would have to assume edx can be arbitrarily set in one of the (unknown) parent blocks.
For this reason, I chose to try symbolic execution.
This technique can simulate everything from the beginning of the function, meaning we should always have some sort of value for the registers (they may be symbolic, but for those obfuscator pointers like edx, I would expect integers).
However, after trying both Triton and angr, I figured that symbolic execution is not suitable here, at least not in the fashion I intended. Due to the size of some of the functions in the malware, path explosion becomes a real problem. While the symbolic exploration is working in principle and gives us correct destinations for each indirect jump, it simply doesn’t finish in any acceptable timeframe.
With the insight gained during the symbolic execution attempts and some further manual exploring, it turned out I needn’t have worried about cases such as with the edx register above.
For this obfuscation scheme, if a register is used in a block and has not been assigned in said block or a discernible parent, it will always have the value that was assigned to it in the function prologue block.
In order to handle such situations, it’s thus enough to create a lookup with all constant register assignments that happen in the prologue block.
Static deobfuscation
Seeing the snippet from the initial section, you may be tempted to say “just define regexes for the 3-4 different patterns & be done with it”. Unfortunately, it’s not that simple. The obfuscation was most likely implemented as a compilation pass (possibly in LLVM), and the compiler has quite a bit of leeway in how it actually arranges and emits the obfuscated code.
You will encounter all of these situations and more:
- Instructions can be scheduled however the compiler pleases; you can have completely unrelated (legit) instructions right before
jmp eax. - The
jmpregister is not alwayseax, it differs per function. - The compiler may have had to temporarily save one of the registers involved in the obfuscation, so you’ll see save/restore pairs.
Except in some situations, the saves will be in parent blocks (thankfully, those parents will be explicit and not hidden).
This can also affect the key register for theadd, it can be temporarily aliased. - The compiler may have found it convenient to put an incoming branch edge into the middle of the pattern, e.g., towards a
mov eax, [eax].
One way to deal with all of this is to use a technique called backwards data slicing.
It collects a minimal subset of instructions that influence a given value (such as the jmp register).
This is easier to reason about, especially when there are other “noisy” instructions sprinkled into a block.
In case of incoming branches, we split the analysis and track the branch source.
For the jump obfuscation, it’s enough to do this for a depth of 1 and ignore everything incoming if we’re already on a “split path”.
Here’s an example:
.text:0040BEE7 lea eax, off_48B5B0
.text:0040BEED cmp ebx, 0F6A636EFh
.text:0040BEF3 jl short loc_40BEFB
.text:0040BEF5 lea eax, off_48B6CC
.text:0040BEFB
.text:0040BEFB loc_40BEFB: ; CODE XREF: .text:0040BEF3↑j
.text:0040BEFB mov eax, [eax]
.text:0040BEFD add eax, esi
.text:0040BEFF jmp eax
Slices:
[<mov eax, [eax]>, <add eax, esi>, <jmp eax>] ; common path
[<lea eax, off_48B6CC>] ; path "jcc not taken"
[<lea eax, off_48B5B0>, <jl short loc_40BEFB>] ; path "jcc taken"
The branch target at 0x40BEFB causes a split into the two paths, where the register diverges.
For conditional moves (cmov), it works slightly differently:
.text:0040ADF2 mov ebx, 0F6A636EFh
.text:0040ADF7 cmp ebx, 0BB2D365h
.text:0040ADFD mov eax, edx
.text:0040ADFF lea edi, off_48B8A4
.text:0040AE05 cmovl eax, edi
.text:0040AE08 mov eax, [eax]
.text:0040AE0A add eax, esi
.text:0040AE0C mov edi, edx
.text:0040AE0E mov edx, [esp+2Ch]
.text:0040AE12 mov [esp+44h], edx
.text:0040AE16 mov edx, edi
.text:0040AE18 jmp eax
Slices:
[<lea edi, off_48B8A4>, <cmovl eax, edi>, <mov eax, [eax]>, <add eax, esi>, <jmp eax>] ; cmov did move (cc true)
[<mov eax, edx>, <cmovl eax, edi>] ; cmov did not move (cc false)
On the main path, we pretend the cmov executed its move (edi is used), while the secondary slice represents what happens if it doesn’t move (eax is used).
In the examples above, the first element of the slices is the one of interest (i.e., the offset we need to dereference for mov eax, [eax]).
This holds true in about 95% of the cases, but sometimes there are other instructions, e.g., there could be a mov to esi, since that register is also part of the overall computation.
It might thus be beneficial to get another sub-slice starting at mov eax, [eax].
Note how we have edx here again in the last slice rather than an address, so we need to apply the lookup trick described in the previous section.
Lastly, there is this pattern:
.text:0040AE26 xor eax, eax
.text:0040AE28 cmp ebx, 13B0D3B2h
.text:0040AE2E setz al
.text:0040AE31 shl eax, 6
.text:0040AE34 mov eax, off_48B650[eax]
.text:0040AE3A add eax, esi
.text:0040AE3C jmp eax
Slices:
[<xor eax, eax>, <setz al>, <shl eax, 6>, <mov eax, off_48B650[eax]>, <add eax, esi>, <jmp eax>]
The compiler probably emits this instead of cmov when the two addresses can be represented using such a set-condition-code-and-shift pattern.
The dereferenced address is either 0x48B650 (zero flag not set) or 0x48B690 (zero flag set → 1<<6 == 0x40 is added).
You may be wondering what all the jump instructions are doing in the slices. They’re included for pure convenience, because as it happens, the set union of all slices represents all instructions that can be safely overwritten when we patch things up.
The patches themselves are pretty straight-forward.
We assemble a jcc and jmp, where the condition code is either taken from an existing jcc, from a cmovcc, or from a setcc.
The jump destinations are computed from the values we obtained from our data slices and from the function prologue (for the key register and any other random registers).
While the instructions that we overwrite generally provide enough space in total, they’re not always coherent. There may be unrelated instructions in between them. In such cases, some careful shuffling needs to be done in order to consolidate the “free space” by moving legit instructions out of the way.
The jump deobfuscation logic can be found in deob_jumps.py, with some utility code for instruction shifting and rewriting being in deob_util.py.
Results thus far
Functions now have visible control flow and can be decompiled! It’s still not exactly pretty, but definitely progress compared to before. We can work with this.
if ( v5 >= 2025505611 )
{
if ( v5 == 2025505611 )
{
TranslateMessage(&Msg);
DispatchMessageA(&Msg);
v7 = (int)&unk_48B77C;
v5 = -1113969622;
goto LABEL_132;
}
goto LABEL_5;
}
if ( v5 != 1949251370 )
goto LABEL_5;
v37 = *((_DWORD *)hMem + v21) + 4;
v23 = dword_48ADFC + 1222248860;
v6 = (int)a1;
v5 = 215978185;
}
else if ( v5 >= 1834411218 )
{
if ( v5 >= 1920510847 )
{
if ( v5 != 1920510847 )
goto LABEL_5;
free((void *)Param[24]);
v7 = (int)&unk_48B77C;
v5 = 1750707187;
}
Or, if you’re a graphical person:
(Very) small functions are problematic
The compiler apparently finds small functions “manageable” in terms of their optimization potential. It precomputes lots of stuff belonging to the obfuscation, merges blocks, and so on.
.text:00414FE8 xor edx, edx
.text:00414FEA xor ebx, ebx
.text:00414FEC cmp eax, 15BEFF06h
.text:00414FF1 setl dl
.text:00414FF4 setnz bl
.text:00414FF7 add edx, 0Bh
.text:00414FFA lea esi, ds:5[ebx*4]
.text:00415001 cmp eax, 0E18D1F3Ch
.text:00415006 mov ebx, 0Dh
.text:0041500B mov eax, 2
.text:00415010 cmovz ebx, eax
.text:00415013 mov [esp+14h], ebx
Then later on you get:
.text:00415048 jmp edi
.text:0041504A ; ----------------------------------------------------
.text:0041504A jmp edx
.text:0041504C ; ----------------------------------------------------
.text:0041504C jmp dword ptr [esp+10h]
.text:00415050 ; ----------------------------------------------------
.text:00415050 mov eax, dword_48CC00[ecx*4]
.text:00415057 add eax, ebp
.text:00415059 jmp eax
.text:0041505B ; ----------------------------------------------------
.text:0041505B mov eax, [esp+1Ch]
.text:0041505F mov eax, dword_48CC00[eax*4]
.text:00415066 add eax, ebp
.text:00415068 jmp eax
The condition processing all happens in a single block at the top of the function and is pretty much decoupled from the jumps. This is quite different from what we’ve been dealing with in larger functions. Our patching approach is not suitable here, because we’d have to pretty much rewrite the entire function from scratch.
In the end it’s not too big of a loss, though - these functions usually comprise only around 100 lines of assembly (most of which is obfuscator trash), so figuring out what they do by hand is doable.
Dealing with constants
Let’s take a break from the control flow stuff for a moment and deal with something easier.
For the constant obfuscation, there are basically two ways the memory load is done: Either there’s a mov with a load followed by an arithmetic operation with a constant - or it’s the other way around, where the arithmetic operation has the memory operand.
So in essence, we just need to look out for all mov/movzx/add/sbb/sub/xor reg, [imm32] instructions.
One problem is that such instructions can definitely appear as part of the underlying program, so we need some way to differentiate which ones are from the obfuscator and which ones must remain untouched. As it happens, if you collect all addresses and sort them, the ones from the obfuscator are clustered together.
.data:0048AEB4 dword_48AEB4 dd 545AE7F4h ; DATA XREF: sub_40A560+EF0↑r
.data:0048AEB8 dword_48AEB8 dd 372ECFB4h ; DATA XREF: sub_40A560+EFD↑r
.data:0048AEBC dword_48AEBC dd 9B874143h ; DATA XREF: sub_40A560:loc_40C592↑r
.data:0048AEC0 dword_48AEC0 dd 7A9F2473h ; DATA XREF: sub_40A560+203E↑r
.data:0048AEC4 dword_48AEC4 dd 381897B5h ; DATA XREF: sub_40A560+205D↑r
.data:0048AEC8 byte_48AEC8 db 1 ; DATA XREF: sub_40A560+1DC2↑r
.data:0048AEC9 align 4
.data:0048AECC dword_48AECC dd 0D2B6F7Dh ; DATA XREF: sub_40A560:loc_40AE3E↑r
.data:0048AED0 dword_48AED0 dd 0D9E7D1Bh ; DATA XREF: sub_40A560+8EA↑r
.data:0048AED4 dword_48AED4 dd 0F813A068h ; DATA XREF: sub_40C8B0+C↑r
.data:0048AED8 dword_48AED8 dd 66A48DFEh ; DATA XREF: sub_40C8B0+A3↑r
.data:0048AEDC dword_48AEDC dd 0E92E6D27h ; DATA XREF: sub_40C8B0+73↑r
.data:0048AEE0 dword_48AEE0 dd 13A7AAC9h ; DATA XREF: sub_40C8B0+17↑r
For this particular analysis, it makes sense to work at the program-level rather than individual functions. The xrefs are not strictly linear and the more complete your picture is, the better.
I found that valid addresses are at most 0x10 bytes away from each other, but sometimes there are gaps where there’s some other data in between clusters of obfuscator addresses. For this sample, if you mandate that a valid cluster needs at least 10 addresses following the 0x10 rule, you will catch all obfuscator addresses. Any outliers will be addresses belonging to the real code.
Once we have that, we can simply replace all memory operands with their immediate values.
Patching is trivial, because in x86, reg, imm32 operands are always shorter or the same length as reg, [imm32] operands, so we can’t run into space problems.
I will forego actually fully folding the constants and leave that job to an optimizing decompiler. One reason for that is certainly that there are nasty occurrences such as this:
.text:0040A868 mov eax, dword_48AE28
.text:0040A86D lea ecx, [esp+48Ch+var_3E4]
.text:0040A874 xorps xmm0, xmm0
.text:0040A877 movups xmmword ptr [eax+ecx-0C5DAC1Ch], xmm0
.text:0040A87F movups xmmword ptr [eax+ecx-0C5DAC2Ch], xmm0
Replacing the mov eax is enough for IDA’s decompiler to be able to figure things out.
The constant inlining is implemented in deob_consts.py.
Control flow unflattening
The final problem is the flattening, so let’s see what can be done about that. I don’t want to see any huge-number-comparison spaghetti code anymore.
There are some IDA plugins for unflattening that are based on microcode manipulation, such as D-810, and more recently, hrtng (which contains an extended version of HexRaysDeob). They all fail to deal with this particular variant of flattening. I assume that it would be possible to make them compatible with some effort, but I’m not familiar with IDA microcode specifics and didn’t feel like getting into it for this project. Plus I figured I could re-use quite a bit of the rewriting code that we already came up with for the jump deobfuscation.
One (special?) thing about this flattener is that it doesn’t always go to the beginning of the dispatcher once it has assigned a new state variable. If it can find another suitable point in the dispatcher it can jump to, it will do so. This means that analyses looking at the parents of the “main” dispatcher block won’t work, because they’ll miss things.
I chose a rather simplistic approach that seems to work well for this particular flattener:
- Look for comparisons against the state register that are followed by
jz(or in rare casesjnzwhere things are flipped). Create a mapping “state key → target”. - Look for assignments to the state register. The assignment value is asserted to be contained in the mapping created in step 1. At the end of the block, create jump(s) to the real target(s).
Here’s an example of step 2 (unrelated instruction have been cut):
.text:0040B758 cmp [esp+48Ch+var_3B4], 0
.text:0040B760 mov ebx, 0B34CE2DFh ; new state key
.text:0040B765 jnz loc_40C696
.text:0040B771 cmp ebx, 0BB2D365h
.text:0040B777 jge loc_40C6AD ; never taken
.text:0040B77D jmp loc_40B6C0
.text:0040C696 mov ebx, 0FA72F85Ah ; new state key
.text:0040C6A1 cmp ebx, 0BB2D365h
.text:0040C6A7 jl loc_40B77D ; always taken (then goes straight to 40B6C0)
.text:0040C6AD jmp loc_40A607 ; dead
After rewriting:
.text:0040B758 cmp [esp+48Ch+var_3B4], 0
.text:0040B765 jnz loc_40C696
.text:0040B771 cmp ebx, 0BB2D365h ; irrelevant leftover
.text:0040B777 jmp loc_40BCAF ; destination for 0B34CE2DFh
.text:0040C696 nop
...
.text:0040C6A1 jmp loc_40C3AC ; destination for 0FA72F85Ah
These sites are connected with a branch, but they are actually rewritten independently of each other; as you can see, jcc instructions unrelated to dispatching are left as-is (0x40B765).
The part where it goes to the dispatcher with jge/jl/jmp can be ignored and overwritten.
Another example. Remember cmov? It has come back to haunt us.
.text:0040C4B4 cmp [esp+48Ch+var_448], 5
.text:0040C4B9 mov ebx, 2B8162DCh
.text:0040C4BE mov eax, 456A4274h
.text:0040C4C3 cmovz ebx, eax
.text:0040C4C6 cmp ebx, 0BB2D365h
.text:0040C4CC jl loc_40B6C0
.text:0040C4D2 jmp loc_40A607
After rewriting:
.text:0040C4B4 cmp [esp+48Ch+var_448], 5
.text:0040C4B9 mov ebx, 2B8162DCh ; irrelevant leftover
.text:0040C4BE mov eax, 456A4274h ; irrelevant leftover
.text:0040C4C3 jz loc_40B199 ; destination for 456A4274h
.text:0040C4C9 jmp loc_40ADA2 ; destination for 2B8162DCh
As you can see, the condition code from the cmovcc is simply adapted for a jcc (jz here).
It’s just a question of figuring out which operand of the cmov goes where (in logical terms), so you don’t swap the destinations by accident.
While this all seems “simple” enough, care still needs to be taken.
Especially in larger blocks, there may be situations where the compiler put other flag-affecting instructions below the cmov.
Then by the time you do your jcc, the flags won’t be what you expected anymore!
For the most part, these were artifacts from the jump obfuscation that can be done away with by doing a more comprehensive cleanup in that previous deobfuscation phase.
But it’s still something that needs to be checked for before rewriting, else you’re in for nasty logic bugs.
Lastly, there is a special case where the state register is not set to a constant, but to a stack variable.
This is sometimes the case for decisions that can be evaluated in the function’s first block, so the branch condition and state keys will be up there.
It poses quite a challenge, because in order to have a jcc in a block way down in the function, you also need to pull down the branch condition. That’s not always trivial, especially if there’s a comparison against some volatile register that is only valid in the first block.
; Up in the in the first block:
.text:0040A59D cmp [ebp+arg_4], 0 ; we need this down at 40BEXX
.text:0040A5A1 mov eax, 0EC71CA67h
.text:0040A5A6 mov ecx, 0A0716E5Bh
.text:0040A5AB cmovz ecx, eax
.text:0040A5AE mov [esp+48Ch+var_44C], ecx
; Site of use (state var assignment):
.text:0040BECC mov ebx, [esp+48Ch+var_44C] ; load stack variable
.text:0040BED0 cmp ebx, 0BB2D365h
.text:0040BED6 jl loc_40B6C0
.text:0040BEDC jmp loc_40A607
For simplicity’s sake, we’re going to make the following assumptions:
- At sites where stack variables like that are used, the stack pointer has the same value as in the first block.
- Relevant stack variables (such as
arg_4in the example) are not overwritten in the middle of the function. You may wonder why an argument would ever be overwritten (even if it’s not needed anymore in the rest of the function), but you’d be surprised how inventive compilers can get when looking for free space for temp vars.
These are pretty big assumptions that won’t hold in many programs, but in this case they seem to work and save us a lot of headaches.
That means comparisons such as cmp [ebp+arg_4], 0 can be moved as-is.
In another case, we may find cmp ecx, 0 instead - then we need to track where ecx comes from.
It could be another memory load like mov ecx, [ebp+arg_4], in which case we can simply propagate the operand and end up with cmp [ebp+arg_4], 0 again.
But ecx could also be the result of a more complex computation or even a parameter if the fastcall calling convention is used.
In such situations, we can employ a trick and re-use the var_44C variable that the obfuscator allocated.
Instead of using it for the cmov result, we’ll place our ecx register into the stack var, so that we can load it again in the block further down. Obviously, the cmov and any dependent instructions need to be nopped for that to work.
The attentive reader may wonder what would happen if we had cmp ecx, edx and both registers were fastcall parameters.
Yeah, that would suck; thankfully it never happens in the sample.
The code for the above-mentioned operand analysis can be found here.
So in summary, dealing with this particular flattening is not hard on a conceptual level. The bulk work is getting all the edge cases right.
Final results
At this point you probably want to see an updated graph. Unfortunately, we’re not nopping out all code belonging to the CFF dispatcher, so IDA will still create blocks for it. So we’ll simply pretend the top part of the CFG does not exist, okay? Okay.
The main function has turned out very pretty, despite being one of the longest functions in the program and the amount of obfuscation-related blocks it originally had:
v18 = 0;
v17 = 0;
memset(Param, 0, sizeof(Param));
v3 = -1603178917;
if ( !lpMultiByteStr )
v3 = -328086937;
pNumArgs[3] = v3;
if ( lpMultiByteStr )
{
v10 = MultiByteToWideChar(0, 0, lpMultiByteStr, -1, 0, 0);
v11 = (const WCHAR *)calloc(v10, 2u);
lpCmdLine = v11;
if ( v11 )
{
pNumArgs[0] = 0;
hMem = CommandLineToArgvW(lpCmdLine, pNumArgs);
if ( hMem )
{
for ( i = 0; i < pNumArgs[0]; ++i )
{
if ( hMem[i] && lstrlenW(hMem[i]) == 66 && *hMem[i] == 45 && hMem[i][1] == 98 )
{
v26 = hMem[i] + 2;
for ( j = 0; j < 64; j = j - 393718627 + 393718629 )
{
v7 = &v26[j];
v8 = sub_40C8B0(*v7);
v9 = sub_40C8B0(v7[1]);
if ( v8 >= 0x10u || v9 >= 0x10u )
{
v21 = 5;
}
else
{
v32[j / 2] = v9 | (16 * v8);
v21 = 0;
}
v24 = v21;
if ( v24 == 5 )
break;
}
if ( j == 64 )
v18 = 1;
else
memset(v32, 0, sizeof(v32));
}
}
LocalFree(hMem);
}
free((void *)lpCmdLine);
}
}
v31 = &off_47B9C8;
v28[1] = 0;
v28[0] = 0;
v29 = 0;
v28[2] = (int)&unk_48A99C;
v28[3] = (int)sub_40CDA0;
v28[5] = (int)&v31;
v28[6] = (int)hInstance;
v28[7] = a3;
v28[4] = (int)sub_40CF90;
WndClass.style = 3;
WndClass.lpfnWndProc = (WNDPROC)WndProc;
WndClass.cbClsExtra = 0;
WndClass.cbWndExtra = 0;
WndClass.hInstance = hInstance;
WndClass.hIcon = LoadIconA(0, (LPCSTR)0x7F00);
WndClass.hCursor = LoadCursorA(0, (LPCSTR)0x7F00);
WndClass.hbrBackground = (HBRUSH)GetStockObject(0);
WndClass.lpszMenuName = 0;
WndClass.lpszClassName = WindowName;
dword_48E02C = (int)&dword_48E028;
dword_48E028 = (int)&dword_48E028;
chain[1] = (int)chain;
chain[0] = (int)chain;
Param[92] = 0;
Param[23] = 0;
Param[95] = 0;
Param[24] = 0;
Param[2] = (int)chain;
Param[0] = 0;
Param[22] = (int)&v17;
Param[1] = (int)malloc(0x40000u);
if ( Param[1] )
{
v19 = 1;
RegisterClassW(&WndClass);
memset((void *)Param[1], 0, 0x40000u);
Param[31] = 0;
Param[30] = 0;
Param[4] = (int)v28;
Param[3] = (int)&v31;
sub_419808(WindowName, 28, &Param[13]);
if ( v18 )
{
sub_405E50(v37);
v36 = v37;
sub_4069C0(v35, &Param[13]);
sub_406B80(v35, 32, v32, &Param[5]);
}
else if ( sub_40F830()
&& MessageBoxW(0, L"Do you want to run a malware?\n(Crypt build to disable this message)", L"Warning", 0x34u) == 7 )
{
v19 = 0;
}
if ( v19 )
{
DesktopWindow = GetDesktopWindow();
if ( DesktopWindow )
GetWindowThreadProcessId(DesktopWindow, (LPDWORD)&Param[25]);
Param[23] = (int)calloc(1u, 30008u);
pNumArgs[2] = 13565952;
pNumArgs[1] = 0x80000000;
CreateWindowExW(
0,
WindowName,
WindowName,
0xCF0000u,
0x80000000,
0x80000000,
500,
300,
HWND_MESSAGE,
0,
hInstance,
Param);
while ( GetMessageA(&Msg, 0, 0, 0) )
{
TranslateMessage(&Msg);
DispatchMessageA(&Msg);
}
free((void *)Param[1]);
if ( v17 && Param[4] )
{
v25 = v17;
v17(v28);
}
}
}
if ( Param[24] )
free((void *)Param[24]);
if ( Param[95] )
free((void *)Param[95]);
if ( Param[23] )
free((void *)Param[23]);
while ( (int *)dword_48E028 != &dword_48E028 )
{
v4 = (_DWORD *)dword_48E028;
v5 = *(_DWORD *)dword_48E028;
v6 = *(_DWORD **)(dword_48E028 + 4);
*v6 = *(_DWORD *)dword_48E028;
*(_DWORD *)(v5 + 4) = v6;
v22 = v4;
if ( v4[2] )
(*((void (__stdcall **)(_DWORD, _DWORD, int))v22 + 2))(*((_DWORD *)v22 + 4), 0, 0x8000);
free(v22);
}
There are only very few obfuscation artifacts left, e.g., the assignment to v3 in the beginning.
In one case IDA failed to optimize a for loop update (where it says j = j - 393718627 + 393718629 instead of j += 2).
We can live with that ¯\_(ツ)_/¯.
The ultimate test for a deobfuscator is to see if the deobfuscated application runs without issues. Unfortunately, at the time of writing, that’s not the case for the tooling we developed. There’s no outright crash and the initial message box appears, but a very subtle logic bug seems to prevent correct decoding of the payload.
So, what about the malware loader?
We went through all this effort to bring the code into a shape that can be reversed without too much pain. So let’s see what the Rhadamanthys loader is actually up to.
You can already glean some details from the main function printed in the previous section.
One of the first consequential things that happen is a function call to 0x40F830 followed by a very conspicuous message box if successful.
As the message text implies, that function actually checks if the binary was not crypted/packed.
This is implemented by doing some section parsing and reading the first 0x1000 .text bytes from disk; those are simply compared with what’s in memory.
Interestingly, the check disregards the fact that the binary was built with relocations and was flagged as movable. It works regardless, because the first function is huge and doesn’t use any addresses that would need to be relocated. The function was probably placed at that position intentionally - otherwise it’d be pure luck.
Other than that, there’s lots of windowing code to register a window class and create an invisible window.
A structure called Param is initialized and pretty much contains the state for the rest of the program.
It’s passed as the window’s user data pointer so that it can be retrieved later on.
The rest of the main function doesn’t actually do much; it eventually enters a classic message loop in order to support the window.
Since a window is in play here, the next point of interest will usually be the so-called window procedure, or WndProc. This is called whenever the operating system has a message that either relates to the window directly or that could be relevant to the application in general (e.g., “the system is about to shut down”).
The procedure handles the following messages:
WM_CREATE: Called after the window has been created. The malware sets up a timer with ID 2.WM_DESTROY: Called when the window is about to be destroyed. The malware callsPostQuitMessageto essentially end the process.WM_TIMER: Timer callback.WM_CAP_SET_CALLBACK_YIELD: Posted by the malware itself. Queues a function pointer.
From the above, you may be able to guess that the main functionality is actually implemented using timers. We find ourselves with a pretty convoluted scheme where many operations are executed as part of callbacks, and it’s not immediately obvious what happens in which order and under which conditions.
- Timer ID 2 collects the current cursor position, foreground window and timestamp every 30 milliseconds, 1500 times. This causes a delay of at least 45 seconds until the actual payload is run.
- Once it has collected everything, it will do a couple checks on the data. If the checks are successful, it queues function 0x40FBE0 and schedules timer ID 1. The checks are as follows:
- Whether the cursor position has changed at least 30 times.
- There must have been at least two foreground windows, and at least one of them must not belong to the desktop process.
- If the checks fail, the loader will enter another 45 seconds interval, and it will start doing advanced checks that include calculating Euclidean distances between cursor positions. We haven’t analyzed that part in detail.
- Timer ID 1 runs any queued functions.
- Function 0x40FBE0 posts
WM_CAP_SET_CALLBACK_YIELDwith parameter 0x411850; as described above, the handler for this message simply queues the parameter as a function pointer (to the queue that timer ID 1 is working on). - 0x411850 is responsible for decoding and eventually running the stealer payload. It will post additional functions using
WM_CAP_SET_CALLBACK_YIELDthat do parts of the work.
These measures are designed to thwart execution in AV emulators and malware sandboxes. For example, if a sandbox sets a different cursor position every 3 seconds but doesn’t simulate “fluid” movement, the captured coordinates will not exceed the threshold of 30 within 45 seconds. CAPE and VMRay pass these checks fine, so in the end they aren’t too effective against common sandboxes.
Custom binary encoding
As promised in the intro, here’s a representation of the algorithm (function 0x416790) that is used to decode the payload strings into binary:
| Input | Output |
|---|---|
| A-Z | 0-25 |
| a-z | 26-51 |
| 0-9 | 52-61 |
!?#$%&()*+-,/:;<>=@[\]^`{\|}~\n |
62-90 |
' ' + A-Z / a-z |
91-142 |
'.' + A-Z / a-z / 0-9 |
143-204 |
'_' + A-Z |
205-230 |
'_' + a-z |
231-255 + 0 |
If you’re wondering what the + 0 is supposed to mean: 0 is encoded by A as well as _z.
The malware doesn’t actually make use of this overflow “feature”, though.
While the first three ranges are reminiscent of Base64, the algorithm is completely different from base encodings. Up until 90, it makes use of a lookup consisting of special characters. For the rest up until 255, there’s a two-letter encoding scheme with a special prefix character that influences the output range.
I later found a post by Cormac Conlon stating that the .NET-based variant of the loader also uses this algorithm, where it’s called Flutter.
Note that in addition to this, the payload is also encrypted using a somewhat obscure Chinese block cipher called SM4. The tables/constants required by the algorithm (S-Box, FK, CK) are obscured in function 0x405E50; they’re only loaded temporarily when needed.
Outro
The Rhadamanthys operation has seen a disruption by Operation Endgame earlier this month. It remains to be seen if/how quickly they recover. A lot of data stolen by the stealer was also recovered by law enforcement and provided to Have I Been Pwned.
Our tooling that we developed as part of this research is available on GitHub. Perhaps it will prove useful in the future if the same obfuscator is used for other malware, or if a very similar one pops up to which the code can be adapted. Other than that, we hope the code can serve as a learning resource for how to implement a fast static deobfuscator utilizing the Capstone disassembly library.