-
Notifications
You must be signed in to change notification settings - Fork 610
Makes string length configurable and consistent across backends. Closes #1303 #2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…lso added condition based metadata extraction.
…d arguments to all file handlers in elffile.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Shajal-Kumar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request by @Shajal-Kumar. This PR addresses issue #1303 by making the minimum string length used during feature extraction configurable and ensuring this setting is applied consistently across different backends (IDA Pro, Vivisect, pefile, dnfile, elffile, and vmray). The changes involve introducing a min_str_len
parameter, setting a default value of 4, and refactoring extractor functions to pass configuration details via a context
dictionary instead of using **kwargs
. Comprehensive tests have been added to validate the new functionality and ensure consistency.
Highlights
- Configurable String Length: Introduced a
min_str_len
parameter to control the minimum length of strings extracted by capa. This value is now configurable instead of being hardcoded. - Default String Length: Set a default minimum string length of 4 characters in
capa/features/extractors/strings.py
. - Context Dictionary Refactor: Migrated from using
**kwargs
to passing a structuredcontext
dictionary in extractor functions. This improves maintainability and explicitly defines the data available to extractors. - Backend Consistency: Ensured that the
min_str_len
setting is respected and applied consistently across all supported static analysis backends (IDA, Vivisect, pefile, dnfile, elffile) and the VMRay dynamic extractor. - Testing: Added new tests in
tests/test_strings.py
to specifically validate themin_str_len
functionality, including boundary conditions and different length values for both ASCII and Unicode strings. - Changelog Update: Updated the
CHANGELOG.md
to reflect the addition of the configurable string length feature.
Changelog
Click here to see the changelog
- CHANGELOG.md
- Added an entry for the configurable string length feature (make string length configurable and consistent across backends #1303).
- capa/features/extractors/common.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toextract_file_strings
. - Passed
min_str_len
toextract_ascii_strings
andextract_unicode_strings
calls.
- Imported
- capa/features/extractors/dnfile/extractor.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toDnfileFeatureExtractor.__init__
and stored it. - Updated calls to file and function feature extraction functions to pass
min_str_len
via actx
dictionary.
- Imported
- capa/features/extractors/dnfile/file.py
- Modified file feature extraction functions to accept a
ctx
dictionary instead of direct parameters likepe
and**kwargs
. - Extracted
pe
andmin_str_len
from thectx
dictionary where needed. - Updated internal calls within
extract_features
to pass thectx
dictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/dnfile/insn.py
- Changed the string length check in
extract_insn_string_features
to usefh.ctx["min_str_len"]
instead of a hardcoded4
.
- Changed the string length check in
- capa/features/extractors/dotnetfile.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toDotnetFileFeatureExtractor.__init__
and stored it. - Modified file feature extraction functions to accept a
ctx
dictionary instead of direct parameters likepe
and**kwargs
. - Extracted
pe
andmin_str_len
from thectx
dictionary where needed. - Updated internal calls within
extract_file_features
to pass thectx
dictionary.
- Imported
- capa/features/extractors/elffile.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toElfFeatureExtractor.__init__
and stored it. - Modified file feature extraction functions to accept a
ctx
dictionary instead of direct parameters likeelf
,buf
, and**kwargs
. - Extracted
elf
,buf
, andmin_str_len
from thectx
dictionary where needed. - Updated internal calls within
extract_file_features
to pass thectx
dictionary.
- Imported
- capa/features/extractors/ida/extractor.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toIdaFeatureExtractor.__init__
and stored it. - Updated calls to file feature extraction functions and the
get_function
method to passmin_str_len
via actx
dictionary.
- Imported
- capa/features/extractors/ida/file.py
- Modified file feature extraction functions to accept a
ctx
dictionary instead of no parameters or specific ones. - Extracted
min_str_len
from thectx
dictionary for string extraction functions. - Updated internal calls within
extract_features
to pass thectx
dictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/ida/helpers.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default tofind_string_at
.
- Imported
- capa/features/extractors/ida/insn.py
- Changed the string length check in
extract_insn_string_features
to usefh.ctx["min_str_len"]
instead of a hardcoded4
. - Passed
fh.ctx["min_str_len"]
to thefind_string_at
call.
- Changed the string length check in
- capa/features/extractors/pefile.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toPefileFeatureExtractor.__init__
and stored it. - Modified file feature extraction functions to accept a
ctx
dictionary instead of direct parameters likepe
,buf
, and**kwargs
. - Extracted
pe
,buf
, andmin_str_len
from thectx
dictionary where needed. - Updated internal calls within
extract_file_features
to pass thectx
dictionary.
- Imported
- capa/features/extractors/strings.py
- Defined
DEFAULT_STRING_LENGTH = 4
. - Renamed internal regex constants (
ASCII_RE_4
,UNICODE_RE_4
) to use_DEFAULT
suffix. - Updated
extract_ascii_strings
andextract_unicode_strings
to accept amin_str_len
parameter with default and use it to build the regex.
- Defined
- capa/features/extractors/viv/extractor.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toVivisectFeatureExtractor.__init__
and stored it. - Updated calls to file feature extraction functions and the
FunctionHandle
constructor to passmin_str_len
via actx
dictionary.
- Imported
- capa/features/extractors/viv/file.py
- Modified file feature extraction functions to accept a
ctx
dictionary instead of direct parameters likevw
,buf
, and**kwargs
. - Extracted
vw
,buf
, andmin_str_len
from thectx
dictionary where needed. - Updated internal calls within
extract_features
to pass thectx
dictionary.
- Modified file feature extraction functions to accept a
- capa/features/extractors/viv/insn.py
- Changed the string length check in
extract_op_string_features
to usefh.ctx["min_str_len"]
instead of a hardcoded4
.
- Changed the string length check in
- capa/features/extractors/vmray/file.py
- Imported
DEFAULT_STRING_LENGTH
. - Passed
min_str_len=DEFAULT_STRING_LENGTH
to the call tocapa.features.extractors.common.extract_file_strings
.
- Imported
- capa/ida/plugin/form.py
- Changed the call to
self.rulegen_feature_extractor.get_function
to use the keyword argumentea
.
- Changed the call to
- capa/loader.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter with default toget_extractor
andget_file_extractors
. - Passed
min_str_len
to the constructors of various feature extractors (DnfileFeatureExtractor
,PefileFeatureExtractor
,VivisectFeatureExtractor
,ElfFeatureExtractor
,DotnetFileFeatureExtractor
).
- Imported
- capa/main.py
- Imported
DEFAULT_STRING_LENGTH
. - Added
min_str_len
parameter toget_file_extractors_from_cli
. - Passed
min_str_len
tocapa.loader.get_file_extractors
. - Passed
DEFAULT_STRING_LENGTH
tocapa.loader.get_extractor
andcapa.features.extractors.ida.extractor.IdaFeatureExtractor
. - Added a check for
FORMAT_RESULT
before callingfind_capabilities
.
- Imported
- rules
- Updated the subproject commit hash for the
rules
submodule.
- Updated the subproject commit hash for the
- tests/fixtures.py
- Imported
DEFAULT_STRING_LENGTH
. - Passed
DEFAULT_STRING_LENGTH
to the constructors of various feature extractors used in test fixtures (VivisectFeatureExtractor
,PefileFeatureExtractor
,DnfileFeatureExtractor
,DotnetFileFeatureExtractor
).
- Imported
- tests/test_strings.py
- Imported
pytest
. - Updated calls to
extract_ascii_strings
andextract_unicode_strings
to use themin_str_len
keyword argument instead ofn
. - Added a new test function
test_min_str_len
to verify the functionality and error handling of themin_str_len
parameter for both ASCII and Unicode string extraction.
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Strings, long or short,
Configurable now, you see,
Consistent across,
Backends, wild and free.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable minimum string length for feature extraction, which is a valuable enhancement for flexibility and consistency across different backends. The refactoring to use a context
dictionary in many places also improves code clarity and maintainability.
Overall, the changes are well-structured and the addition of tests for the new string length configuration is commendable. I've found a couple of areas where the new configurable string length isn't fully applied, which I've detailed in the comments.
Good work on this!
Summary of Findings
- Inconsistent
min_str_len
usage inDotnetFileFeatureExtractor
:DotnetFileFeatureExtractor.extract_file_features
usesDEFAULT_STRING_LENGTH
instead of the configuredself.min_str_len
. - Hardcoded string length in VMRay backend: The VMRay backend's string extraction (
extract_file_strings
incapa/features/extractors/vmray/file.py
) uses the hardcodedDEFAULT_STRING_LENGTH
and does not seem to support the new configurablemin_str_len
. - Docstring parameter name inconsistency (low severity, not commented): In
capa/features/extractors/strings.py
, the docstring forextract_ascii_strings
andextract_unicode_strings
usesmin_len
for the parameter description, while the parameter itself is namedmin_str_len
. This was not commented on due to review settings. - CLI configuration for
min_str_len
(low severity, not commented): The command-line interface incapa/main.py
does not yet expose a flag to setmin_str_len
. It currently usesDEFAULT_STRING_LENGTH
. This is likely a planned future enhancement and not an issue with the current PR's scope of internal configurability. This was not commented on due to review settings.
Merge Readiness
The pull request is a significant step towards configurable string length handling and better code structure. However, there are a couple of issues (one high severity, one medium severity) identified in the review comments that should be addressed to ensure consistency and correctness across all backends. Once these are resolved, the PR should be in good shape for merging. As an AI, I am not authorized to approve pull requests.
@@ -203,7 +211,7 @@ def extract_global_features(self): | |||
yield from extract_global_features(self.pe) | |||
|
|||
def extract_file_features(self): | |||
yield from extract_file_features(self.pe) | |||
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like DEFAULT_STRING_LENGTH
is used here, but the DotnetFileFeatureExtractor
is initialized with min_str_len
(which is stored as self.min_str_len
).
Should this be using self.min_str_len
to ensure the configured minimum string length is respected for file feature extraction in this extractor, similar to how it's handled in other extractors like PefileFeatureExtractor
or ElfFeatureExtractor
?
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH}) | |
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": self.min_str_len}) |
yield from capa.features.extractors.common.extract_file_strings( | ||
analysis.submission_bytes, min_str_len=DEFAULT_STRING_LENGTH | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min_str_len
here is hardcoded to DEFAULT_STRING_LENGTH
. To make the string length configurable for the VMRay backend consistent with other backends, should min_str_len
be passed down to this function?
This might involve updating VMRayExtractor
or how VMRayAnalysis
is handled to include the min_str_len
configuration. Currently, the VMRay backend won't respect a custom minimum string length.
closes #1303
Makes string length configurable and consistent across backends
Key Changes
min_str_len
argument to relevant functions infeatures/extractors
DEFAULT_STRING_LENGTH = 4
incapa/features/extractors/strings.py
kwargs
to structuredcontext
dictionary for better maintainability wherever applicable.CHANGELOG.md
.Motivation
Testing
test_min_str_len
tocapa/tests/test_strings.py
validating:Checklist