-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base64 and widebase64 modifiers #1185
Conversation
Add two new modifiers: base64 and widebase64. These modifiers take the given text string and generate 3 different strings being careful to trim off the bytes which are dependent upon leading or trailing bytes in the larger search space. I've implemented it by slightly cheating. I generate all the search strings in a list and then creating one large string suitable for the RE compiler to deal with. For example, the string "This program cannot" generates these three base64 encoded strings: VGhpcyBwcm9ncmFtIGNhbm5vd RoaXMgcHJvZ3JhbSBjYW5ub3 UaGlzIHByb2dyYW0gY2Fubm90 Those three strings are then transformed into a RE that looks like this: (VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90) That string is then passed to the RE compiler for parsing and AST generation. The AST is then emitted into the appropriate spot and YARA believes it was given a regex from that point on. I've also implemented support for specifying custom alphabets: base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu") This means that I have to be careful to escape any special RE characters in the string passed to the compiler. The base64 alphabet has to be 64 bytes long, and does support escaped bytes properly too. To avoid the need to deal with escaping I had a first implementation which attempted to generate the AST by hand, which was mostly working but was very cumbersome to maintain. In doing this I ended up improving yr_re_print_node() so that it indents the tree properly, which made debugging that attempt easier. My apologies for including it in this diff. I can split it out if you want. Lastly, I did most of the work in re.c but I think it belongs in it's own base64.c. I'm happy to move it out of re.c if you would prefer.
I'm putting this up for review so I can gather comments and iterate on it. I think this belongs in its own set of source files, instead of forcing it into re.(c|h) for the most part. I'll move things over to base64.(c|h) and get them into the build system, but would first like to gather feedback on the design. The details are in the documentation (included in this PR) but the short version is that:
The My first implementation of this was generating the RE AST by hand, but it was far too fragile so I switched over to the current method (generating an RE string and letting the RE compiler do the heavy lifting for me). This is an easy approach to this but I'm not super happy with it either. I think a better long-term answer here is to have a way to have different representations of strings but it is still only referenced by a single string internally. This is a lot more work and may not be worth the effort until a more concrete use-case is found. |
Specifically, you can only use these modifiers with text strings (hex strings and regular expressions are not supported) and you can't use the xor or nocase modifiers in combination with these.
I have two concerns about these modifiers, which are:
Thoughts? |
I understand this concern but I believe the benefit of a more expressive and readable language is worth it. You are right that it does increase the cognitive burden on users, but with good documentation (which I tried to provide a start to) we can help users understand what is and what is not supported. With tests, which I tried to make very thorough, I think we can lessen the burden on ourselves for maintenance.
You're right that these are more for convenience but not having them means each user needs to write code (or find someone else's) which does the base64 encoding with proper padding and trimming of the encoded strings. It's not hard to do, and there are a handful of implementations out there already. In fact, when I was writing this I was using one which turned out to be buggy. The fact that we can provide a convenient way for users to express this while also increasing readability makes this at least worth considering. The alternative is users needing to write or find a script to do this, and any bugs that may include. Also, this is more expressive than the versions I've seen that don't let you specify a custom alphabet or even wide support. I'd much rather see |
Jumping on the debate as a Yara user, as some feedback was asked on Twitter. At the moment, and probably as other folks, I use a third-party script to generate the 3 possible base64 strings for every input. This technique proved to be successful to monitor some threat actors. Regarding the risk of "user confusion", my feeling is that the basic user will not bother with the modifier, and the advanced user will read the documentation. I cannot say how much burden it adds as a developer, my first thought was that the base64 algorithm is not going to change, so once the code is written and tested you should be fine, but then I realized you will need to handle the modifier against every new modifier that you may add in the future. You are in the best position to answer this question, and I will still cherish Yara if you decide not to implement the feature :) |
I strongly support this addition. Base64 is a very common encoding format found across most major languages and platforms. It is used to encode multi-line parameters & scripts, encode binary content (e.g. compressed data), and to support common file encodings (e.g. X509 certificate encodings). Attackers make use of Base64 encoding very often because it allows them to utilize native functionality to achieve simple obfuscation objectives. Payloads across .NET, PowerShell, Python, PHP, VBA Macros, and beyond have used base64. It is probably the #1 encoding used by attackers for simple evasion and obfuscation goals. Matching encoded content is also the way an entry infosec practitioner can progress beyond matching on "plaintext strings" to find more advanced payloads. These payloads require additional analysis skills and tools (e.g. CyberChef) that help practitioners gain the skills necessary to defend their environments. Giving defenders more power in a convenient way simplifies the progression of mastery. While this feature can be approximated by generating the appropriate strings in standalone tools, native support for this encoding would simplify the effort required by defenders to match content and eliminate points of error (e.g. forgetting to including multiple base64 strings to account for leading and trailing bytes, neglecting unicode, etc). |
Given the arguments, and how passionately @wxsBSD has defended this feature, I'm changing my mind. Let me do the final review before merging it. |
I think all of the code I added in |
I totally agree on creating the |
I haven't tested the build with bazel but the normal build process works and tests are passing.
I was about to merge this PR and noticed that for some reason the tests in Travis CI are failing. They pass in my local machine though. |
Not sure why those are failing and I can’t find a way to get the test logs to see what failed exactly. :( |
The windows builds are broken because I didn't update the VS solution files - I can download a copy of VS in a VM if you want me to fix that, but it may be a day or two before I can get to it. I have no idea why those tests are failing and I can't seem to get travis to give me the logs I need to debug it. Maybe it makes sense to modify the Makefile so it cats out the test-suite.log if tests fail? This way we can see the test failures in travis (and it just makes it easier to not have to cat that file all the time during development). |
The tests fail because of memory leaks:
|
In Mac OS X the leaks are not detected by AddressSanitizer :( |
Sigh, will fix these tonight. |
Most of these memory leaks were silly and were because I was forgetting to free the encoded string or the nodes upon successful creation of the AST. There were some leaks when the compiler would error in the case of incorrect alphabet size. Lastly, I realized that if you specify two different alphabets in your modifier the last one used would apply to the other base64 modifier, which is not intuitive behavior. Fix it by making different alphabets a compiler error. Add a test case for this while I'm here.
OK, I've fixed the leaks (thanks for pointing them out) and also added a fix for an edge case I hadn't considered. I've verified the leaks are fixed and all tests pass on a linux system and on OS X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I'm just adding two more comments about things that discovered while looking again at the code.
FAIL_ON_ERROR_WITH_CLEANUP( | ||
_yr_base64_create_nodes(wide_str, modifier.alphabet, 0, &head, &tail), | ||
{ | ||
strcpy(error->message, "Failure encoding base64 wide string"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these errors necessary? I mean, if _yr_base64_create_nodes
fails only on out-of-memory conditions, we don't need to add an error message here, and this function doesn't need to receive the pointer to RE_ERROR
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way so it matches yr_re_parse_hex()
and yr_re_parse()
when called from parser.c. Also, I eventually call into yr_re_parse()
to do the final parsing, which requires an RE_ERROR*
. This is arguably the more important piece since if the RE parsing fails we would like to get detailed error information.
Dynamically allocate string to do base64 conversion. While here, remove trailing new lines.
In the current (3.12.0 or 4.0.0) docs, I see that nocase and xor cannot be used with the base64 modifiers, and that base64 cannot be used with base64wide. Should it also include a restriction on use with the fullword modifier? |
Add two new modifiers: base64 and widebase64. These modifiers take the given text string and generate 3 different strings being careful to trim off the bytes which are dependent upon leading or trailing bytes in the larger search space. I've implemented it by slightly cheating. I generate all the search strings in a list and then creating one large string suitable for the RE compiler to deal with. For example, the string "This program cannot" generates these three base64 encoded strings: VGhpcyBwcm9ncmFtIGNhbm5vd RoaXMgcHJvZ3JhbSBjYW5ub3 UaGlzIHByb2dyYW0gY2Fubm90 Those three strings are then transformed into a RE that looks like this: (VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90) That string is then passed to the RE compiler for parsing and AST generation. The AST is then emitted into the appropriate spot and YARA believes it was given a regex from that point on. I've also implemented support for specifying custom alphabets: base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu") This means that I have to be careful to escape any special RE characters in the string passed to the compiler. The base64 alphabet has to be 64 bytes long, and does support escaped bytes properly too. To avoid the need to deal with escaping I had a first implementation which attempted to generate the AST by hand, which was mostly working but was very cumbersome to maintain. In doing this I ended up improving yr_re_print_node() so that it indents the tree properly, which made debugging that attempt easier.
Add two new modifiers: base64 and widebase64.
These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.
I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:
VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90
Those three strings are then transformed into a RE that looks like this:
(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)
That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.
I've also implemented support for specifying custom alphabets:
base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")
This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.
To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.
Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.