Skip to content

Makes string length configurable and consistent across backends. Closes #1303 #2678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Shajal-Kumar
Copy link

closes #1303

Makes string length configurable and consistent across backends

Key Changes

  • Added min_str_len argument to relevant functions in features/extractors
  • Set DEFAULT_STRING_LENGTH = 4 in capa/features/extractors/strings.py
  • Migrated from kwargs to structured context dictionary for better maintainability wherever applicable.
  • Ensured consistent string length handling in extractors.
  • Updated CHANGELOG.md.
  • Added tests.

Motivation

  • Decouples string length configuration from hardcoded values while setting a default.
  • Prepares codebase for future PyGhidra migration (#2600) by standardizing backend interfaces
  • Improves readability through explicit context passing

Testing

  • Added comprehensive test_min_str_len to capa/tests/test_strings.py validating:
    • Boundary checks for minimum string length
    • Error handling for invalid values
    • Consistent behavior across ASCII/Unicode extraction
  • Verified functionality across supported disassemblers

Checklist

  • [] No CHANGELOG update needed
  • [] No new tests needed
  • [] No documentation update needed

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Shajal-Kumar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request by @Shajal-Kumar. This PR addresses issue #1303 by making the minimum string length used during feature extraction configurable and ensuring this setting is applied consistently across different backends (IDA Pro, Vivisect, pefile, dnfile, elffile, and vmray). The changes involve introducing a min_str_len parameter, setting a default value of 4, and refactoring extractor functions to pass configuration details via a context dictionary instead of using **kwargs. Comprehensive tests have been added to validate the new functionality and ensure consistency.

Highlights

  • Configurable String Length: Introduced a min_str_len parameter to control the minimum length of strings extracted by capa. This value is now configurable instead of being hardcoded.
  • Default String Length: Set a default minimum string length of 4 characters in capa/features/extractors/strings.py.
  • Context Dictionary Refactor: Migrated from using **kwargs to passing a structured context dictionary in extractor functions. This improves maintainability and explicitly defines the data available to extractors.
  • Backend Consistency: Ensured that the min_str_len setting is respected and applied consistently across all supported static analysis backends (IDA, Vivisect, pefile, dnfile, elffile) and the VMRay dynamic extractor.
  • Testing: Added new tests in tests/test_strings.py to specifically validate the min_str_len functionality, including boundary conditions and different length values for both ASCII and Unicode strings.
  • Changelog Update: Updated the CHANGELOG.md to reflect the addition of the configurable string length feature.

Changelog

Click here to see the changelog
  • CHANGELOG.md
  • capa/features/extractors/common.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to extract_file_strings.
    • Passed min_str_len to extract_ascii_strings and extract_unicode_strings calls.
  • capa/features/extractors/dnfile/extractor.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to DnfileFeatureExtractor.__init__ and stored it.
    • Updated calls to file and function feature extraction functions to pass min_str_len via a ctx dictionary.
  • capa/features/extractors/dnfile/file.py
    • Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe and **kwargs.
    • Extracted pe and min_str_len from the ctx dictionary where needed.
    • Updated internal calls within extract_features to pass the ctx dictionary.
  • capa/features/extractors/dnfile/insn.py
    • Changed the string length check in extract_insn_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
  • capa/features/extractors/dotnetfile.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to DotnetFileFeatureExtractor.__init__ and stored it.
    • Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe and **kwargs.
    • Extracted pe and min_str_len from the ctx dictionary where needed.
    • Updated internal calls within extract_file_features to pass the ctx dictionary.
  • capa/features/extractors/elffile.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to ElfFeatureExtractor.__init__ and stored it.
    • Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like elf, buf, and **kwargs.
    • Extracted elf, buf, and min_str_len from the ctx dictionary where needed.
    • Updated internal calls within extract_file_features to pass the ctx dictionary.
  • capa/features/extractors/ida/extractor.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to IdaFeatureExtractor.__init__ and stored it.
    • Updated calls to file feature extraction functions and the get_function method to pass min_str_len via a ctx dictionary.
  • capa/features/extractors/ida/file.py
    • Modified file feature extraction functions to accept a ctx dictionary instead of no parameters or specific ones.
    • Extracted min_str_len from the ctx dictionary for string extraction functions.
    • Updated internal calls within extract_features to pass the ctx dictionary.
  • capa/features/extractors/ida/helpers.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to find_string_at.
  • capa/features/extractors/ida/insn.py
    • Changed the string length check in extract_insn_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
    • Passed fh.ctx["min_str_len"] to the find_string_at call.
  • capa/features/extractors/pefile.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to PefileFeatureExtractor.__init__ and stored it.
    • Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe, buf, and **kwargs.
    • Extracted pe, buf, and min_str_len from the ctx dictionary where needed.
    • Updated internal calls within extract_file_features to pass the ctx dictionary.
  • capa/features/extractors/strings.py
    • Defined DEFAULT_STRING_LENGTH = 4.
    • Renamed internal regex constants (ASCII_RE_4, UNICODE_RE_4) to use _DEFAULT suffix.
    • Updated extract_ascii_strings and extract_unicode_strings to accept a min_str_len parameter with default and use it to build the regex.
  • capa/features/extractors/viv/extractor.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to VivisectFeatureExtractor.__init__ and stored it.
    • Updated calls to file feature extraction functions and the FunctionHandle constructor to pass min_str_len via a ctx dictionary.
  • capa/features/extractors/viv/file.py
    • Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like vw, buf, and **kwargs.
    • Extracted vw, buf, and min_str_len from the ctx dictionary where needed.
    • Updated internal calls within extract_features to pass the ctx dictionary.
  • capa/features/extractors/viv/insn.py
    • Changed the string length check in extract_op_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
  • capa/features/extractors/vmray/file.py
    • Imported DEFAULT_STRING_LENGTH.
    • Passed min_str_len=DEFAULT_STRING_LENGTH to the call to capa.features.extractors.common.extract_file_strings.
  • capa/ida/plugin/form.py
    • Changed the call to self.rulegen_feature_extractor.get_function to use the keyword argument ea.
  • capa/loader.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter with default to get_extractor and get_file_extractors.
    • Passed min_str_len to the constructors of various feature extractors (DnfileFeatureExtractor, PefileFeatureExtractor, VivisectFeatureExtractor, ElfFeatureExtractor, DotnetFileFeatureExtractor).
  • capa/main.py
    • Imported DEFAULT_STRING_LENGTH.
    • Added min_str_len parameter to get_file_extractors_from_cli.
    • Passed min_str_len to capa.loader.get_file_extractors.
    • Passed DEFAULT_STRING_LENGTH to capa.loader.get_extractor and capa.features.extractors.ida.extractor.IdaFeatureExtractor.
    • Added a check for FORMAT_RESULT before calling find_capabilities.
  • rules
    • Updated the subproject commit hash for the rules submodule.
  • tests/fixtures.py
    • Imported DEFAULT_STRING_LENGTH.
    • Passed DEFAULT_STRING_LENGTH to the constructors of various feature extractors used in test fixtures (VivisectFeatureExtractor, PefileFeatureExtractor, DnfileFeatureExtractor, DotnetFileFeatureExtractor).
  • tests/test_strings.py
    • Imported pytest.
    • Updated calls to extract_ascii_strings and extract_unicode_strings to use the min_str_len keyword argument instead of n.
    • Added a new test function test_min_str_len to verify the functionality and error handling of the min_str_len parameter for both ASCII and Unicode string extraction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Strings, long or short,
Configurable now, you see,
Consistent across,
Backends, wild and free.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable minimum string length for feature extraction, which is a valuable enhancement for flexibility and consistency across different backends. The refactoring to use a context dictionary in many places also improves code clarity and maintainability.

Overall, the changes are well-structured and the addition of tests for the new string length configuration is commendable. I've found a couple of areas where the new configurable string length isn't fully applied, which I've detailed in the comments.

Good work on this!

Summary of Findings

  • Inconsistent min_str_len usage in DotnetFileFeatureExtractor: DotnetFileFeatureExtractor.extract_file_features uses DEFAULT_STRING_LENGTH instead of the configured self.min_str_len.
  • Hardcoded string length in VMRay backend: The VMRay backend's string extraction (extract_file_strings in capa/features/extractors/vmray/file.py) uses the hardcoded DEFAULT_STRING_LENGTH and does not seem to support the new configurable min_str_len.
  • Docstring parameter name inconsistency (low severity, not commented): In capa/features/extractors/strings.py, the docstring for extract_ascii_strings and extract_unicode_strings uses min_len for the parameter description, while the parameter itself is named min_str_len. This was not commented on due to review settings.
  • CLI configuration for min_str_len (low severity, not commented): The command-line interface in capa/main.py does not yet expose a flag to set min_str_len. It currently uses DEFAULT_STRING_LENGTH. This is likely a planned future enhancement and not an issue with the current PR's scope of internal configurability. This was not commented on due to review settings.

Merge Readiness

The pull request is a significant step towards configurable string length handling and better code structure. However, there are a couple of issues (one high severity, one medium severity) identified in the review comments that should be addressed to ensure consistency and correctness across all backends. Once these are resolved, the PR should be in good shape for merging. As an AI, I am not authorized to approve pull requests.

@@ -203,7 +211,7 @@ def extract_global_features(self):
yield from extract_global_features(self.pe)

def extract_file_features(self):
yield from extract_file_features(self.pe)
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It seems like DEFAULT_STRING_LENGTH is used here, but the DotnetFileFeatureExtractor is initialized with min_str_len (which is stored as self.min_str_len).

Should this be using self.min_str_len to ensure the configured minimum string length is respected for file feature extraction in this extractor, similar to how it's handled in other extractors like PefileFeatureExtractor or ElfFeatureExtractor?

Suggested change
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH})
yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": self.min_str_len})

Comment on lines +72 to +74
yield from capa.features.extractors.common.extract_file_strings(
analysis.submission_bytes, min_str_len=DEFAULT_STRING_LENGTH
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The min_str_len here is hardcoded to DEFAULT_STRING_LENGTH. To make the string length configurable for the VMRay backend consistent with other backends, should min_str_len be passed down to this function?

This might involve updating VMRayExtractor or how VMRayAnalysis is handled to include the min_str_len configuration. Currently, the VMRay backend won't respect a custom minimum string length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make string length configurable and consistent across backends
1 participant