Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base64 and widebase64 modifiers #1185

Merged
merged 14 commits into from
Feb 17, 2020
Merged

base64 and widebase64 modifiers #1185

merged 14 commits into from
Feb 17, 2020

Conversation

wxsBSD
Copy link
Collaborator

@wxsBSD wxsBSD commented Dec 23, 2019

Add two new modifiers: base64 and widebase64.

These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.

I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:

VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90

Those three strings are then transformed into a RE that looks like this:

(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)

That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.

I've also implemented support for specifying custom alphabets:

base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")

This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.

To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.

Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.

Add two new modifiers: base64 and widebase64.

These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.

I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:

VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90

Those three strings are then transformed into a RE that looks like this:

(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)

That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.

I've also implemented support for specifying custom alphabets:

base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")

This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.

To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.

Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.
@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Dec 23, 2019

I'm putting this up for review so I can gather comments and iterate on it. I think this belongs in its own set of source files, instead of forcing it into re.(c|h) for the most part. I'll move things over to base64.(c|h) and get them into the build system, but would first like to gather feedback on the design.

The details are in the documentation (included in this PR) but the short version is that:

$a = "This program cannot" base64 will search for the three base64 representations of that string and $a = "This program cannot" base64 wide will apply the wide modifier and then the base64 - it does not include the plaintext version if you use base64.

The widebase64 modifier works just like the base64 modifier but inserts null bytes between each character in the base64 strings.

My first implementation of this was generating the RE AST by hand, but it was far too fragile so I switched over to the current method (generating an RE string and letting the RE compiler do the heavy lifting for me). This is an easy approach to this but I'm not super happy with it either. I think a better long-term answer here is to have a way to have different representations of strings but it is still only referenced by a single string internally. This is a lot more work and may not be worth the effort until a more concrete use-case is found.

Specifically, you can only use these modifiers with text strings (hex strings
and regular expressions are not supported) and you can't use the xor or nocase
modifiers in combination with these.
@plusvic
Copy link
Member

plusvic commented Jan 7, 2020

I have two concerns about these modifiers, which are:

  1. The more modifiers we have, the larger the number of possible combinations and therefore it's more difficult to understand how they interact between them. For example, a user may wonder what happens if base64 and nocase are used together. Does it take all possible case combinations for the original string and apply base64 to them? Does it find any case combination of the resulting base64 string? It's an error?. In this case it will be an error, but the more intuitive options if probably the first one. I'm afraid that by adding more modifiers is going to be difficult both for the user and for ourselves to reason about how the different modifiers interact between them.

  2. These modifiers are convenient, but not necessary. Other modifiers like nocase and xor are necessary because the alternative would be writing hundreds or even thousands of possible strings, but base64 can be replaced by just three strings instead of one. It's certainly easier for the user to use base64 than having to deal with writing the three strings by herself, but considering the maintenance burden and the added complexity (even for the user) I'm not convinced about adding them.

Thoughts?

@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Jan 7, 2020

The more modifiers we have, the larger the number of possible combinations and therefore it's more difficult to understand how they interact between them. For example, a user may wonder what happens if base64 and nocase are used together. Does it take all possible case combinations for the original string and apply base64 to them? Does it find any case combination of the resulting base64 string? It's an error?. In this case it will be an error, but the more intuitive options if probably the first one. I'm afraid that by adding more modifiers is going to be difficult both for the user and for ourselves to reason about how the different modifiers interact between them.

I understand this concern but I believe the benefit of a more expressive and readable language is worth it. You are right that it does increase the cognitive burden on users, but with good documentation (which I tried to provide a start to) we can help users understand what is and what is not supported. With tests, which I tried to make very thorough, I think we can lessen the burden on ourselves for maintenance.

These modifiers are convenient, but not necessary. Other modifiers like nocase and xor are necessary because the alternative would be writing hundreds or even thousands of possible strings, but base64 can be replaced by just three strings instead of one. It's certainly easier for the user to use base64 than having to deal with writing the three strings by herself, but considering the maintenance burden and the added complexity (even for the user) I'm not convinced about adding them.

You're right that these are more for convenience but not having them means each user needs to write code (or find someone else's) which does the base64 encoding with proper padding and trimming of the encoded strings. It's not hard to do, and there are a handful of implementations out there already. In fact, when I was writing this I was using one which turned out to be buggy. The fact that we can provide a convenient way for users to express this while also increasing readability makes this at least worth considering. The alternative is users needing to write or find a script to do this, and any bugs that may include. Also, this is more expressive than the versions I've seen that don't let you specify a custom alphabet or even wide support.

I'd much rather see $a = "This program cannot" base64 instead of the alternatives. :)

@thehellu
Copy link

thehellu commented Jan 8, 2020

Jumping on the debate as a Yara user, as some feedback was asked on Twitter.

At the moment, and probably as other folks, I use a third-party script to generate the 3 possible base64 strings for every input. This technique proved to be successful to monitor some threat actors.
When you search a specific string without knowing if it will be ascii or wide, you need to generate not only 3 but 6 strings. In Yara, this could be replaced by one line, with the "ascii" and "wide" modifier in addition to "base64" and "widebase64". So, as a user, those new modifiers would be useful, convenient, and would make rules more readable as they would turn 6 lines into 1. The ability to specify the alphabet is a nice bonus.

Regarding the risk of "user confusion", my feeling is that the basic user will not bother with the modifier, and the advanced user will read the documentation.

I cannot say how much burden it adds as a developer, my first thought was that the base64 algorithm is not going to change, so once the code is written and tested you should be fine, but then I realized you will need to handle the modifier against every new modifier that you may add in the future. You are in the best position to answer this question, and I will still cherish Yara if you decide not to implement the feature :)

@JohnLaTwC
Copy link

I strongly support this addition.

Base64 is a very common encoding format found across most major languages and platforms. It is used to encode multi-line parameters & scripts, encode binary content (e.g. compressed data), and to support common file encodings (e.g. X509 certificate encodings).

Attackers make use of Base64 encoding very often because it allows them to utilize native functionality to achieve simple obfuscation objectives. Payloads across .NET, PowerShell, Python, PHP, VBA Macros, and beyond have used base64. It is probably the #1 encoding used by attackers for simple evasion and obfuscation goals.

Matching encoded content is also the way an entry infosec practitioner can progress beyond matching on "plaintext strings" to find more advanced payloads. These payloads require additional analysis skills and tools (e.g. CyberChef) that help practitioners gain the skills necessary to defend their environments. Giving defenders more power in a convenient way simplifies the progression of mastery.

While this feature can be approximated by generating the appropriate strings in standalone tools, native support for this encoding would simplify the effort required by defenders to match content and eliminate points of error (e.g. forgetting to including multiple base64 strings to account for leading and trailing bytes, neglecting unicode, etc).

@plusvic
Copy link
Member

plusvic commented Jan 8, 2020

Given the arguments, and how passionately @wxsBSD has defended this feature, I'm changing my mind. Let me do the final review before merging it.

@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Jan 13, 2020

I think all of the code I added in re.c belongs in a new file (I'm thinking base64.c) - really the majority of the code has nothing to do with the re code. By moving it to it's own file it will make things cleaner. Do you want me to do that?

@plusvic
Copy link
Member

plusvic commented Jan 13, 2020

I totally agree on creating the base64.c file.

I haven't tested the build with bazel but the normal build process works and
tests are passing.
@plusvic
Copy link
Member

plusvic commented Feb 4, 2020

I was about to merge this PR and noticed that for some reason the tests in Travis CI are failing. They pass in my local machine though.

@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Feb 4, 2020

Not sure why those are failing and I can’t find a way to get the test logs to see what failed exactly. :(

@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Feb 5, 2020

The windows builds are broken because I didn't update the VS solution files - I can download a copy of VS in a VM if you want me to fix that, but it may be a day or two before I can get to it.

I have no idea why those tests are failing and I can't seem to get travis to give me the logs I need to debug it. Maybe it makes sense to modify the Makefile so it cats out the test-suite.log if tests fail? This way we can see the test failures in travis (and it just makes it easier to not have to cat that file all the time during development).

@plusvic
Copy link
Member

plusvic commented Feb 11, 2020

The tests fail because of memory leaks:

=================================================================
==27990==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 585 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b245b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b245b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 585 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b21a1 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21a1)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b23b8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b23b8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b231b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b231b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b21d8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21d8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 240 byte(s) in 6 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1010 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1010)
    #2 0x4b227b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b227b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 228 byte(s) in 3 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x44f420 in sized_string_new (/home/travis/build/VirusTotal/yara/test-rules+0x44f420)

Direct leak of 72 byte(s) in 3 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b245b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b245b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 72 byte(s) in 3 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b21d8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21d8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 48 byte(s) in 2 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b23b8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b23b8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 48 byte(s) in 2 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b21a1 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21a1)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 30 byte(s) in 2 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x42a91d in yara_yylex (/home/travis/build/VirusTotal/yara/test-rules+0x42a91d)

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b231b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b231b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b227b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b227b)

Indirect leak of 1008 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x44f504 in sized_string_convert_to_wide (/home/travis/build/VirusTotal/yara/test-rules+0x44f504)
    #2 0x4b21a1 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21a1)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 558 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1502 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1502)
    #2 0x4b245b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b245b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 552 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x44f504 in sized_string_convert_to_wide (/home/travis/build/VirusTotal/yara/test-rules+0x44f504)
    #2 0x4b231b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b231b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 368 byte(s) in 6 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x44f504 in sized_string_convert_to_wide (/home/travis/build/VirusTotal/yara/test-rules+0x44f504)
    #2 0x4b227b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b227b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 330 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1502 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1502)
    #2 0x4b21d8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21d8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 330 byte(s) in 9 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b1502 in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b1502)
    #2 0x4b23b8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b23b8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 192 byte(s) in 8 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b231b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b231b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 168 byte(s) in 7 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b23b8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b23b8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 168 byte(s) in 7 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b21a1 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21a1)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 144 byte(s) in 6 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b245b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b245b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 144 byte(s) in 6 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b21d8 in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b21d8)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

Indirect leak of 120 byte(s) in 5 object(s) allocated from:
    #0 0x7fbe984d8602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x4b0f6a in _yr_base64_create_nodes (/home/travis/build/VirusTotal/yara/test-rules+0x4b0f6a)
    #2 0x4b227b in yr_base64_ast_from_string (/home/travis/build/VirusTotal/yara/test-rules+0x4b227b)
    #3 0x4f1f7f  (/home/travis/build/VirusTotal/yara/test-rules+0x4f1f7f)

SUMMARY: AddressSanitizer: 7118 byte(s) leaked in 158 allocation(s).

@plusvic
Copy link
Member

plusvic commented Feb 11, 2020

In Mac OS X the leaks are not detected by AddressSanitizer :(

@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Feb 11, 2020

Sigh, will fix these tonight.

Most of these memory leaks were silly and were because I was forgetting to
free the encoded string or the nodes upon successful creation of the AST.

There were some leaks when the compiler would error in the case of incorrect
alphabet size.

Lastly, I realized that if you specify two different alphabets in your
modifier the last one used would apply to the other base64 modifier, which is
not intuitive behavior. Fix it by making different alphabets a compiler error.
Add a test case for this while I'm here.
@wxsBSD
Copy link
Collaborator Author

wxsBSD commented Feb 12, 2020

OK, I've fixed the leaks (thanks for pointing them out) and also added a fix for an edge case I hadn't considered.

I've verified the leaks are fixed and all tests pass on a linux system and on OS X.

Copy link
Member

@plusvic plusvic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'm just adding two more comments about things that discovered while looking again at the code.

FAIL_ON_ERROR_WITH_CLEANUP(
_yr_base64_create_nodes(wide_str, modifier.alphabet, 0, &head, &tail),
{
strcpy(error->message, "Failure encoding base64 wide string");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these errors necessary? I mean, if _yr_base64_create_nodes fails only on out-of-memory conditions, we don't need to add an error message here, and this function doesn't need to receive the pointer to RE_ERROR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way so it matches yr_re_parse_hex() and yr_re_parse() when called from parser.c. Also, I eventually call into yr_re_parse() to do the final parsing, which requires an RE_ERROR*. This is arguably the more important piece since if the RE parsing fails we would like to get detailed error information.

Dynamically allocate string to do base64 conversion.

While here, remove trailing new lines.
@plusvic plusvic merged commit 2e755ac into VirusTotal:master Feb 17, 2020
@wxsBSD wxsBSD deleted the base64 branch February 18, 2020 00:58
@malvidin
Copy link
Contributor

malvidin commented Apr 8, 2020

In the current (3.12.0 or 4.0.0) docs, I see that nocase and xor cannot be used with the base64 modifiers, and that base64 cannot be used with base64wide.

Should it also include a restriction on use with the fullword modifier?

tarterp pushed a commit to mandiant/yara that referenced this pull request Mar 31, 2022
Add two new modifiers: base64 and widebase64.

These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.

I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:

VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90

Those three strings are then transformed into a RE that looks like this:

(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)

That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.

I've also implemented support for specifying custom alphabets:

base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")

This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.

To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants