Skip to content

Implement Async, Generator-Based Lexer with Performance, Memory, and Unicode Support #46

Open
@Ze7111

Description

@Ze7111

Description

We need to implement a highly performant and memory-efficient lexer as the first phase of our compiler pipeline. The lexer must process tokens within a strict time frame (100-200 ns per token) and adhere to stringent memory constraints (no more than 1KB of RAM usage per token). It must operate asynchronously using a generator to avoid loading the entire file into memory, support tokenization from arbitrary file positions, and work with wide characters to properly handle Unicode, including emojis.

Requirements

  • Performance:
    • Process each token in between 100-200 ns.
  • Memory Efficiency:
    • Do not load the entire file into memory; operate using a generator.
    • Ensure no more than 1KB of RAM usage per token.
  • Asynchronous Safety:
    • Must be safe for use in asynchronous contexts.
  • File Handling:
    • Ability to start tokenizing from random points in the file.
  • Token Types & Wide Character Support:
    • Support all required token types (keywords, identifiers, literals, operators, delimiters, comments, etc.).
    • Work with wide characters (Unicode) to correctly handle all characters, including emojis.
  • Visitor Function:
    • Implement a visitor function to convert a token into its string representation.
  • Extensibility:
    • Design the lexer to be easily extendable for future token types or modifications.
  • Error Handling:
    • Provide complete and descriptive error handling, including precise line and column information.
  • Documentation & Testing:
    • Code must be well-documented.
    • Include comprehensive unit tests covering standard cases, edge cases, performance constraints, error conditions, and Unicode handling.

Implementation Steps

  1. Define Token Structures:
    • Create a token data structure that includes type, value, position (line and column), and any necessary metadata.
  2. Generator-Based Tokenization:
    • Implement the lexer to use a generator pattern, reading the source file incrementally without loading the entire file into memory.
    • Ensure that tokenization can start from arbitrary positions in the file.
  3. Performance Optimization:
    • Optimize token processing to ensure each token is handled in 100-200 ns.
    • Monitor and limit memory usage to 1KB per token.
  4. Asynchronous Support:
    • Ensure the lexer is async safe, making it suitable for asynchronous operations.
  5. Wide Character Support:
    • Modify tokenization routines to handle wide characters instead of just single-byte characters.
    • Ensure proper handling of Unicode characters and emojis throughout the lexing process.
  6. Visitor Function:
    • Develop a visitor function that converts tokens to their string representation for debugging or logging purposes.
  7. Error Handling:
    • Implement robust error reporting with clear messages and accurate source position data.
  8. Testing:
    • Write comprehensive unit tests that cover:
      • Standard tokenization of valid input.
      • Edge cases and error conditions.
      • Performance benchmarks.
      • Memory usage limits.
      • Correct handling of Unicode and emojis.
  9. Documentation:
    • Document the codebase thoroughly, explaining the design decisions, token structures, generator-based approach, and Unicode handling.
    • Include usage examples and integration guidelines.

Acceptance Criteria

  • The lexer tokenizes source code accurately while meeting performance (100-200 ns per token) and memory (max 1KB per token) requirements.
  • It operates asynchronously and supports tokenization from any position in the file.
  • Supports wide characters for proper Unicode and emoji handling.
  • A visitor function is implemented to convert tokens to strings.
  • Comprehensive unit tests validate functionality, performance, error handling, and Unicode support.
  • The code is well-documented and designed for easy extension.

Metadata

Metadata

Assignees

Labels

CompilerIssues in the compiler core or front-end parsing.High PriorityImportant issues requiring attention soon.In ProgressCurrently being worked on.ParsingIssues related to syntax parsing and tokenization.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions