Elevating Code Security and Reliability via LLM-Augmented Static Analysis: Enhancing Source Annotations, AST, CFG, and Call Graphs
Arjun Gopalakrishna
Static analysis stands as one of the most reliable techniques for detecting software vulnerabilities, code quality issues, and standard compliance problems early in the software development lifecycle. Traditionally, static analysis tools employ a multi-phase approach: they parse code into an Abstract Syntax Trees (AST), transform the code structure into Control Flow Graphs (CFGs) that model possible execution paths, and construct Call Graphs to represent interprocedural relationships. These representations are effective at capturing syntax and basic flow, but they often lack nuanced information regarding security posture, domain-specific roles, and annotations indicating how code should be used or validated.
Recent advances in Large Language Models (LLMs) have shown potential for bridging this gap by providing higher-level insights into source code. LLMs, trained on diverse codebases and textual resources, can reason about code semantics, identify suspicious patterns, and even propose code annotations that traditional static analysis might miss. This paper proposes a holistic approach to using LLMs in all phases of static analysis—from source code annotation (e.g., automatically suggesting SAL annotations for C/C++ parameters) to semantically enriching the AST, CFG, and Call Graph with metadata indicating potential security risks, data flow concerns, or domain-specific functionality. By integrating LLMs at every step, developers and security engineers can derive a richer representation of the code, yielding fewer false positives, fewer missed vulnerabilities, and deeper insights into the correctness and safety of the system.
Static analysis is central to modern software engineering practices. It involves analyzing the source code without executing it, thereby allowing software teams to uncover defects and vulnerabilities earlier than might otherwise be possible through testing or runtime monitoring. Within continuous integration pipelines, static analysis tools often act as the first line of defense, catching logic errors and security flaws at the commit stage and preventing them from propagating into production. Examples of widely used static analyzers include CodeQL and Semgrep.
Despite their utility, these analyzers can struggle with:
• False Positives: The tool highlights an issue that is not actually a bug or a vulnerability, leading to “alert fatigue.”
• False Negatives: The tool misses real problems because of incomplete or shallow contextual knowledge (e.g., complicated reflection code or framework-driven calls).
• Lack of Semantic Information: Many static analyzers rely on syntactic constructs (like AST nodes) or purely structural flow (CFG edges), but do not factor in broader semantic context—like domain knowledge or code usage patterns.
Addressing these challenges requires approaches that can intelligently interpret code, not just parse it.
Large Language Models (LLMs) — exemplified by OpenAI's GPT-4, Anthropic's Claude, and others—are neural networks trained on an immense corpus of natural language text, including open-source code repositories, API documentation, and even advanced mathematics. These models have begun to demonstrate sophisticated capabilities:
• Generating code based on textual specifications.
• Summarizing complex functions or modules.
• Identifying potential security weaknesses or style inconsistencies.
• Suggesting improvements or refactorings.
While free-form text output from an LLM is valuable for human developers, static analysis pipelines generally consume structured representations. Thus, the challenge lies in systematically capturing the LLM’s insights—whether about a function’s purpose, potential vulnerabilities, or recommended usage constraints—and integrating them into the code’s existing representations (source annotations, AST nodes, CFG edges, and so forth).
Rather than relegating LLMs to a post-hoc stage (e.g., generating commentary after the static analyzer has run), we propose injecting LLM-driven intelligence throughout the static analysis pipeline:
-
Source Code Annotations: By automatically proposing or refining SAL annotations (for C/C++), Javadoc parameters (for Java), or docstrings (for Python), LLMs can help instruct the static analyzer about pointer usage, nullability, or valid parameter ranges.
-
AST Enrichment: Once code is parsed, AST nodes can be tagged with semantic labels regarding security classification, potential domain usage, or recognized sanitizers.
-
CFG Annotations: Branches or edges can be labeled by an LLM with probable runtime conditions, risk ratings, or path-level assumptions (e.g., “this path is only reachable by admin users”).
-
Call Graph Clarifications: By analyzing naming patterns, docstrings, or partial reflection usage, LLMs can refine the call graph, adding edges or adjusting confidence levels for potential indirect calls.
In combination, these layers of intelligence lead to a more precise static analysis outcome, enabling advanced checks — like cross-function taint tracking with accurate source-sink identification — and drastically reducing the overhead of manual triage.
Source Annotation Language (SAL), frequently used within Microsoft’s C/C++ ecosystem, provides a means to explicitly annotate function parameters and return values with constraints—e.g., _In_
, _Out_
, _Inout_
, _Null_terminated_
, _Ret_maybenull_
, etc. These annotations guide static analysis tools in verifying pointer usage, buffer sizes, and potential null dereferences. Other languages have their own analogs, such as Javadoc tags in Java or docstrings in Python, which can be used for type hints and param clarifications.
Unfortunately, these annotations are often lacking in legacy or third-party code. LLMs can fill this gap by generating or refining them, ensuring the static analysis tool sees a more accurate specification of how each function or parameter should behave.
Before any deeper analysis, static analyzers rely on a parsed structural view of the code. The AST organizes nodes for variables, function calls, loops, etc. While essential, a standard AST does not encode why a function is called or how a parameter might be used in a broader context. As a result, purely AST-based checks often produce rudimentary pattern matches (e.g., “Look for all calls to strcpy in C code.”) but fail to consider essential details like whether strcpy is actually being used on a trusted buffer or if the size was previously validated.
CFGs outline the possible execution paths in a function or method. Each node may represent a basic block, with edges signifying branches. CFG-based analyses can detect unreachable code, potential infinite loops, or data flow anomalies. However, CFGs themselves remain structural: they do not inherently explain if a particular branch handles sensitive data or if a loop is only triggered under specific domain constraints (e.g., “Only run if user is an administrator”).
Static analysis often proceeds beyond single functions, requiring a call graph that represents interprocedural invocation relationships. Call graphs are crucial for advanced checks like taint analysis across function boundaries. Nonetheless, resolving call graphs accurately in the presence of dynamic languages, function pointers, or reflection is challenging. Tools might “give up” or produce an over-approximation, missing crucial calls or incorrectly inferring them.
• Reduced Human Effort: Rather than requiring domain experts to manually annotate thousands of functions, an LLM can make a best-effort pass that engineers can subsequently refine.
• Improved Automated Checks: Once SAL (or a similar annotation scheme) is in place, static analyzers automatically gain deeper awareness of pointer usage, possible null references, and expected buffer sizes.
• Incremental Adoption: Annotations can be added incrementally — function by function or module by module — allowing large projects to gradually enhance coverage.
-
Identification of Unannotated Functions: A scanning tool enumerates all function declarations and definitions lacking SAL.
-
Prompt Construction: For each function, the code snippet, existing type signatures, and partial usage context (where feasible) are provided to the LLM.
-
LLM Inference and Output: The LLM proposes an annotated signature (e.g.,
_In_
,_Out_
,_Inout_
, pointer sizes, and conditions like_Ret_maybenull_
). -
Developer Review: An engineer reviews the changes. If correct, they are merged into the codebase.
-
Static Analysis Update: Tools like Visual Studio’s Code Analysis or Clang’s static analyzer can now interpret these new annotations to issue more accurate warnings or confirm safety invariants.
Original C Function:
int sanitizeAndProcess(char* inputData, size_t maxLength) {
// Implementation omitted
}
Potential LLM Output:
// LLM Explanation:
// "The first parameter seems to be an in-out buffer possibly containing user input.
// The second parameter indicates the maximum length.
// We'll propose SAL to reflect that the buffer must not exceed maxLength in size."
int sanitizeAndProcess(
_Inout_(maxLength) char* inputData,
_In_ size_t maxLength
);
Having this annotation clarifies that inputData is not merely an input pointer but a buffer that can be modified and is size-bounded by maxLength. When the static analyzer runs next, it can check for out-of-bounds accesses or potential buffer overruns inside sanitizeAndProcess.
While SAL annotations help define function-level contracts, the AST is where intra-function details emerge, such as local variable declarations, conditional blocks, and function calls. By integrating LLM inferences (e.g., “variable userInput is tainted, coming from an untrusted source”), the static analyzer can make more informed decisions when checking for code issues.
-
Construct Baseline AST: Use a compiler front-end (e.g., Clang for C/C++, or Python’s
ast
module) to parse the code. -
Traverse the AST: For each node — especially nodes representing calls, assignments, or loops — the tool collects relevant snippets.
-
LLM Prompting: The snippet, plus any partial context (e.g., “This variable was declared earlier as an input parameter,” or “We suspect this function is a sanitizer”), is fed to the LLM.
-
LLM Output: The LLM returns structured data (often in JSON) with additional fields, such as
taint_status
,domain_role
,security_level
. -
AST Augmentation: The static analyzer merges these fields into the AST node, effectively creating an “extended AST” or storing the metadata in an auxiliary mapping keyed by node ID.
Consider a snippet in Python:
def run_command(cmd):
subprocess.call(cmd, shell=True)
LLM-Augmented AST could include:
{
"node_id": "Call_23",
"function": "subprocess.call",
"arguments": ["cmd"],
"llm_insights": {
"risk_category": "command_injection",
"taint_assumption": "user_supplied"
}
}
Now, if the static analyzer sees a flow from an untrusted source to cmd
, it can quickly identify a potential command injection vulnerability. Without the LLM’s insight, the analyzer might produce a generic “shell=True
call” warning or fail to detect the broader risk if it doesn’t treat cmd
as user input.
Once the AST is enriched, subsequent steps - like building the CFG or performing data flow analysis — can reference these insights. For instance, a node labeled as “taint source” can propagate that label forward through assignments, ensuring the analyzer tracks untrusted data comprehensively.
Although CFGs show how the program flows from one block to another (e.g., through if statements or loops), they do not inherently convey which branches are high risk, which ones handle sensitive data, or whether a loop is likely bounded. Such deficits can lead to false positives, for instance, in detecting infinite loops or in incorrectly assuming certain branches are relevant to security checks.
When the tool constructs a CFG, each edge typically corresponds to a branch or control transfer. The tool can gather the condition expression or code snippet associated with that edge and ask the LLM:
• “Does this condition indicate a security-critical check?”
• “Is this branch likely an error path, a success path, or an admin-only path?”
• “Are there domain-specific implications (finance, cryptography, user authentication) that the code or comments suggest?”
The LLM’s structured response can be integrated as an edge annotation or block-level metadata.
def process_transaction(user_input):
if authenticate(user_input):
complete_payment(user_input)
else:
log_error("Failed to authenticate")
Traditional CFG:
[Start]
→ [if authenticate(user_input)]
├─ True → [complete_payment(user_input)]
└─ False → [log_error("Failed to authenticate")]
[End]
LLM-Enriched CFG might add:
• True Edge: {"risk_level": "critical_financial_path", "likely_user_interaction": true}
• False Edge: {"purpose": "logging", "security_impact": "low"}
With these annotations, a security-focused analysis tool can prioritize scanning the “true” branch for potential injection or misconfiguration. Meanwhile, the “false” branch, though not necessarily irrelevant, is acknowledged as a fallback path that logs an error (reduced risk).
• Prioritization: Triage tools can surface branches annotated as high-risk at the top of their reports, reducing developer time wasted on trivial paths.
• Path-Specific Taint: If the LLM detects partial sanitization on one branch but not another, the CFG can reflect that difference, leading to more precise warnings.
• Accuracy and Overhead: Over-annotation can be a risk—if an LLM mislabels a path as “safe,” it might suppress needed warnings. Ensuring there is a fallback or a confidence mechanism is crucial.
A major benefit of the approach described is that each layer of representation can reinforce the next, creating a feedback loop:
-
Source Code Annotations (like SAL) guide the AST pass, clarifying pointer constraints or parameter roles.
-
AST Augmentation with LLM tags identifies tainted variables and partial sanitizers, influencing how CFG edges are labeled (“taint flows from X to Y under this condition”).
-
CFG Annotation highlights critical branches that must be explored in building or refining the Call Graph.
-
Call Graph refinements reveal cross-function data flows that, in turn, can feed back to the AST or SAL annotation pass—e.g., the LLM might realize that “Since function
sanitizeUserData
is always called beforeexecuteCommand
, there is partial sanitization, but it might not handle semicolons.”
This multi-layer synergy enables a far more sophisticated static analysis pipeline, one that merges the symbolic rigor of classical compiler techniques with the semantic fluency of LLM-based reasoning.
The paper presented a holistic framework for integrating Large Language Models (LLMs) into the static analysis pipeline at multiple layers:
-
Source Code Annotations (SAL or similar): Providing explicit pointers, size constraints, and usage expectations to improve tool accuracy.
-
AST Augmentation: Tagging nodes with taint statuses, domain roles, or partial sanitization knowledge.
-
CFG Enrichment: Labeling edges with risk levels, domain significance, and probable path conditions.
-
Call Graph Refinement: Identifying or confirming dynamic call targets to ensure no critical flows are overlooked.
By uniting these enhancements, static analysis can evolve from a purely syntactic or structural approach to a semantic approach that leverages contextual intelligence and domain knowledge.
• Reduced False Positives: Many spurious warnings arise from missing context. LLM-supplied insights can drastically reduce these.
• Fewer Missed Vulnerabilities: With advanced call resolution and better taint marking, real vulnerabilities are more likely to surface.
• Domain Awareness: LLMs can detect if a code path relates to cryptography, payments, or admin privileges, guiding specialized checks and deeper scrutiny.
• Scalable Annotation: Projects that never had robust SAL usage can bootstrap them quickly with LLM assistance.
In conclusion, this paper underscores that embedding LLMs into every stratum of static analysis—from the initial insertion of SAL annotations in the source code, to the sophisticated augmentation of ASTs, CFGs, and Call Graphs—can unlock a deeper level of semantic understanding. By combining the best of symbolic and data-driven analysis, software teams can yield more accurate, context-rich, and actionable warnings, ultimately enhancing the reliability and security of their codebases.