Makes string length configurable and consistent across backends. Closes #1303 #2678

Shajal-Kumar · 2025-06-04T11:24:54Z

closes #1303

Makes string length configurable and consistent across backends

Key Changes

Added min_str_len argument to relevant functions in features/extractors
Set DEFAULT_STRING_LENGTH = 4 in capa/features/extractors/strings.py
Migrated from kwargs to structured context dictionary for better maintainability wherever applicable.
Ensured consistent string length handling in extractors.
Updated CHANGELOG.md.
Added tests.

Motivation

Decouples string length configuration from hardcoded values while setting a default.
Prepares codebase for future PyGhidra migration (#2600) by standardizing backend interfaces
Improves readability through explicit context passing

Testing

Added comprehensive test_min_str_len to capa/tests/test_strings.py validating:
- Boundary checks for minimum string length
- Error handling for invalid values
- Consistent behavior across ASCII/Unicode extraction
Verified functionality across supported disassemblers

Checklist

[] No CHANGELOG update needed
[] No new tests needed
[] No documentation update needed

…lso added condition based metadata extraction.

…d arguments to all file handlers in elffile.py

…r not

…s.py

gemini-code-assist

Hello @Shajal-Kumar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request by @Shajal-Kumar. This PR addresses issue #1303 by making the minimum string length used during feature extraction configurable and ensuring this setting is applied consistently across different backends (IDA Pro, Vivisect, pefile, dnfile, elffile, and vmray). The changes involve introducing a min_str_len parameter, setting a default value of 4, and refactoring extractor functions to pass configuration details via a context dictionary instead of using **kwargs. Comprehensive tests have been added to validate the new functionality and ensure consistency.

Highlights

Configurable String Length: Introduced a min_str_len parameter to control the minimum length of strings extracted by capa. This value is now configurable instead of being hardcoded.
Default String Length: Set a default minimum string length of 4 characters in capa/features/extractors/strings.py.
Context Dictionary Refactor: Migrated from using **kwargs to passing a structured context dictionary in extractor functions. This improves maintainability and explicitly defines the data available to extractors.
Backend Consistency: Ensured that the min_str_len setting is respected and applied consistently across all supported static analysis backends (IDA, Vivisect, pefile, dnfile, elffile) and the VMRay dynamic extractor.
Testing: Added new tests in tests/test_strings.py to specifically validate the min_str_len functionality, including boundary conditions and different length values for both ASCII and Unicode strings.
Changelog Update: Updated the CHANGELOG.md to reflect the addition of the configurable string length feature.

Changelog

Click here to see the changelog

CHANGELOG.md
- Added an entry for the configurable string length feature (make string length configurable and consistent across backends #1303).
capa/features/extractors/common.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to extract_file_strings.
- Passed min_str_len to extract_ascii_strings and extract_unicode_strings calls.
capa/features/extractors/dnfile/extractor.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to DnfileFeatureExtractor.__init__ and stored it.
- Updated calls to file and function feature extraction functions to pass min_str_len via a ctx dictionary.
capa/features/extractors/dnfile/file.py
- Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe and **kwargs.
- Extracted pe and min_str_len from the ctx dictionary where needed.
- Updated internal calls within extract_features to pass the ctx dictionary.
capa/features/extractors/dnfile/insn.py
- Changed the string length check in extract_insn_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
capa/features/extractors/dotnetfile.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to DotnetFileFeatureExtractor.__init__ and stored it.
- Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe and **kwargs.
- Extracted pe and min_str_len from the ctx dictionary where needed.
- Updated internal calls within extract_file_features to pass the ctx dictionary.
capa/features/extractors/elffile.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to ElfFeatureExtractor.__init__ and stored it.
- Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like elf, buf, and **kwargs.
- Extracted elf, buf, and min_str_len from the ctx dictionary where needed.
- Updated internal calls within extract_file_features to pass the ctx dictionary.
capa/features/extractors/ida/extractor.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to IdaFeatureExtractor.__init__ and stored it.
- Updated calls to file feature extraction functions and the get_function method to pass min_str_len via a ctx dictionary.
capa/features/extractors/ida/file.py
- Modified file feature extraction functions to accept a ctx dictionary instead of no parameters or specific ones.
- Extracted min_str_len from the ctx dictionary for string extraction functions.
- Updated internal calls within extract_features to pass the ctx dictionary.
capa/features/extractors/ida/helpers.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to find_string_at.
capa/features/extractors/ida/insn.py
- Changed the string length check in extract_insn_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
- Passed fh.ctx["min_str_len"] to the find_string_at call.
capa/features/extractors/pefile.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to PefileFeatureExtractor.__init__ and stored it.
- Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like pe, buf, and **kwargs.
- Extracted pe, buf, and min_str_len from the ctx dictionary where needed.
- Updated internal calls within extract_file_features to pass the ctx dictionary.
capa/features/extractors/strings.py
- Defined DEFAULT_STRING_LENGTH = 4.
- Renamed internal regex constants (ASCII_RE_4, UNICODE_RE_4) to use _DEFAULT suffix.
- Updated extract_ascii_strings and extract_unicode_strings to accept a min_str_len parameter with default and use it to build the regex.
capa/features/extractors/viv/extractor.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to VivisectFeatureExtractor.__init__ and stored it.
- Updated calls to file feature extraction functions and the FunctionHandle constructor to pass min_str_len via a ctx dictionary.
capa/features/extractors/viv/file.py
- Modified file feature extraction functions to accept a ctx dictionary instead of direct parameters like vw, buf, and **kwargs.
- Extracted vw, buf, and min_str_len from the ctx dictionary where needed.
- Updated internal calls within extract_features to pass the ctx dictionary.
capa/features/extractors/viv/insn.py
- Changed the string length check in extract_op_string_features to use fh.ctx["min_str_len"] instead of a hardcoded 4.
capa/features/extractors/vmray/file.py
- Imported DEFAULT_STRING_LENGTH.
- Passed min_str_len=DEFAULT_STRING_LENGTH to the call to capa.features.extractors.common.extract_file_strings.
capa/ida/plugin/form.py
- Changed the call to self.rulegen_feature_extractor.get_function to use the keyword argument ea.
capa/loader.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter with default to get_extractor and get_file_extractors.
- Passed min_str_len to the constructors of various feature extractors (DnfileFeatureExtractor, PefileFeatureExtractor, VivisectFeatureExtractor, ElfFeatureExtractor, DotnetFileFeatureExtractor).
capa/main.py
- Imported DEFAULT_STRING_LENGTH.
- Added min_str_len parameter to get_file_extractors_from_cli.
- Passed min_str_len to capa.loader.get_file_extractors.
- Passed DEFAULT_STRING_LENGTH to capa.loader.get_extractor and capa.features.extractors.ida.extractor.IdaFeatureExtractor.
- Added a check for FORMAT_RESULT before calling find_capabilities.
rules
- Updated the subproject commit hash for the rules submodule.
tests/fixtures.py
- Imported DEFAULT_STRING_LENGTH.
- Passed DEFAULT_STRING_LENGTH to the constructors of various feature extractors used in test fixtures (VivisectFeatureExtractor, PefileFeatureExtractor, DnfileFeatureExtractor, DotnetFileFeatureExtractor).
tests/test_strings.py
- Imported pytest.
- Updated calls to extract_ascii_strings and extract_unicode_strings to use the min_str_len keyword argument instead of n.
- Added a new test function test_min_str_len to verify the functionality and error handling of the min_str_len parameter for both ASCII and Unicode string extraction.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Strings, long or short,
Configurable now, you see,
Consistent across,
Backends, wild and free.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a configurable minimum string length for feature extraction, which is a valuable enhancement for flexibility and consistency across different backends. The refactoring to use a context dictionary in many places also improves code clarity and maintainability.

Overall, the changes are well-structured and the addition of tests for the new string length configuration is commendable. I've found a couple of areas where the new configurable string length isn't fully applied, which I've detailed in the comments.

Good work on this!

Summary of Findings

Inconsistent min_str_len usage in DotnetFileFeatureExtractor: DotnetFileFeatureExtractor.extract_file_features uses DEFAULT_STRING_LENGTH instead of the configured self.min_str_len.
Hardcoded string length in VMRay backend: The VMRay backend's string extraction (extract_file_strings in capa/features/extractors/vmray/file.py) uses the hardcoded DEFAULT_STRING_LENGTH and does not seem to support the new configurable min_str_len.
Docstring parameter name inconsistency (low severity, not commented): In capa/features/extractors/strings.py, the docstring for extract_ascii_strings and extract_unicode_strings uses min_len for the parameter description, while the parameter itself is named min_str_len. This was not commented on due to review settings.
CLI configuration for min_str_len (low severity, not commented): The command-line interface in capa/main.py does not yet expose a flag to set min_str_len. It currently uses DEFAULT_STRING_LENGTH. This is likely a planned future enhancement and not an issue with the current PR's scope of internal configurability. This was not commented on due to review settings.

Merge Readiness

The pull request is a significant step towards configurable string length handling and better code structure. However, there are a couple of issues (one high severity, one medium severity) identified in the review comments that should be addressed to ensure consistency and correctness across all backends. Once these are resolved, the PR should be in good shape for merging. As an AI, I am not authorized to approve pull requests.

gemini-code-assist · 2025-06-04T11:26:53Z

capa/features/extractors/dotnetfile.py

@@ -203,7 +211,7 @@ def extract_global_features(self):
        yield from extract_global_features(self.pe)

    def extract_file_features(self):
-        yield from extract_file_features(self.pe)
+        yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH})


It seems like DEFAULT_STRING_LENGTH is used here, but the DotnetFileFeatureExtractor is initialized with min_str_len (which is stored as self.min_str_len).

Should this be using self.min_str_len to ensure the configured minimum string length is respected for file feature extraction in this extractor, similar to how it's handled in other extractors like PefileFeatureExtractor or ElfFeatureExtractor?

Suggested change

yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH})

yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": self.min_str_len})

gemini-code-assist · 2025-06-04T11:26:53Z

capa/features/extractors/vmray/file.py

+        yield from capa.features.extractors.common.extract_file_strings(
+            analysis.submission_bytes, min_str_len=DEFAULT_STRING_LENGTH
+        )


The min_str_len here is hardcoded to DEFAULT_STRING_LENGTH. To make the string length configurable for the VMRay backend consistent with other backends, should min_str_len be passed down to this function?

This might involve updating VMRayExtractor or how VMRayAnalysis is handled to include the min_str_len configuration. Currently, the VMRay backend won't respect a custom minimum string length.

Shajal-Kumar and others added 18 commits March 18, 2025 17:08

added default string length and changed n to min_len in strings.py

f26d321

added min_str_len argument to common.py

59ef086

added ctx to elffile.py

0649227

added ctx to pefile.py and made minor changes to elffile.py

68d59cd

fixed argument related issues and added ctx to pefile.py

bb8ad59

added minimum string length functionality to loader.py and main.py. A…

fa234a6

…lso added condition based metadata extraction.

added minimum string length functionality to the Viv Extractor

eaa62c6

Added min_str_len and context dictionary to dnfile

bb19023

Added min_str_len and context dictionary to ida.

a3071fd

Updated tests/fixtures.py to incorporate DEFAULT_STRING_LENGTH

ddbbacd

Fixed dotnetfile.py, fixed issues with vmrayextractor, added ctx base…

b19cb70

…d arguments to all file handlers in elffile.py

Fixed dnfile to check if the parameter being passed is a dictionary o…

df5aa67

…r not

Fixed failing tests.

666501e

Passes all tests. Fixes linting issues.

954f7a2

Merge branch 'mandiant:master' into master

d64d4f7

Adds test to validate the changes related to the issue to test_string…

0d49215

…s.py

Updates changelog.

4129c09

Removes large files from PR.

265e6c8

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makes string length configurable and consistent across backends. Closes #1303 #2678

Makes string length configurable and consistent across backends. Closes #1303 #2678

Uh oh!

Shajal-Kumar commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 4, 2025

Uh oh!

gemini-code-assist bot Jun 4, 2025

Uh oh!

Uh oh!

	yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": DEFAULT_STRING_LENGTH})
	yield from extract_file_features(ctx={"pe": self.pe, "min_str_len": self.min_str_len})

Makes string length configurable and consistent across backends. Closes #1303 #2678

Are you sure you want to change the base?

Makes string length configurable and consistent across backends. Closes #1303 #2678

Uh oh!

Conversation

Shajal-Kumar commented Jun 4, 2025