Open
Description
Description
We need to implement a highly performant and memory-efficient lexer as the first phase of our compiler pipeline. The lexer must process tokens within a strict time frame (100-200 ns per token) and adhere to stringent memory constraints (no more than 1KB of RAM usage per token). It must operate asynchronously using a generator to avoid loading the entire file into memory, support tokenization from arbitrary file positions, and work with wide characters to properly handle Unicode, including emojis.
Requirements
- Performance:
- Process each token in between 100-200 ns.
- Memory Efficiency:
- Do not load the entire file into memory; operate using a generator.
- Ensure no more than 1KB of RAM usage per token.
- Asynchronous Safety:
- Must be safe for use in asynchronous contexts.
- File Handling:
- Ability to start tokenizing from random points in the file.
- Token Types & Wide Character Support:
- Support all required token types (keywords, identifiers, literals, operators, delimiters, comments, etc.).
- Work with wide characters (Unicode) to correctly handle all characters, including emojis.
- Visitor Function:
- Implement a visitor function to convert a token into its string representation.
- Extensibility:
- Design the lexer to be easily extendable for future token types or modifications.
- Error Handling:
- Provide complete and descriptive error handling, including precise line and column information.
- Documentation & Testing:
- Code must be well-documented.
- Include comprehensive unit tests covering standard cases, edge cases, performance constraints, error conditions, and Unicode handling.
Implementation Steps
- Define Token Structures:
- Create a token data structure that includes type, value, position (line and column), and any necessary metadata.
- Generator-Based Tokenization:
- Implement the lexer to use a generator pattern, reading the source file incrementally without loading the entire file into memory.
- Ensure that tokenization can start from arbitrary positions in the file.
- Performance Optimization:
- Optimize token processing to ensure each token is handled in 100-200 ns.
- Monitor and limit memory usage to 1KB per token.
- Asynchronous Support:
- Ensure the lexer is async safe, making it suitable for asynchronous operations.
- Wide Character Support:
- Modify tokenization routines to handle wide characters instead of just single-byte characters.
- Ensure proper handling of Unicode characters and emojis throughout the lexing process.
- Visitor Function:
- Develop a visitor function that converts tokens to their string representation for debugging or logging purposes.
- Error Handling:
- Implement robust error reporting with clear messages and accurate source position data.
- Testing:
- Write comprehensive unit tests that cover:
- Standard tokenization of valid input.
- Edge cases and error conditions.
- Performance benchmarks.
- Memory usage limits.
- Correct handling of Unicode and emojis.
- Write comprehensive unit tests that cover:
- Documentation:
- Document the codebase thoroughly, explaining the design decisions, token structures, generator-based approach, and Unicode handling.
- Include usage examples and integration guidelines.
Acceptance Criteria
- The lexer tokenizes source code accurately while meeting performance (100-200 ns per token) and memory (max 1KB per token) requirements.
- It operates asynchronously and supports tokenization from any position in the file.
- Supports wide characters for proper Unicode and emoji handling.
- A visitor function is implemented to convert tokens to strings.
- Comprehensive unit tests validate functionality, performance, error handling, and Unicode support.
- The code is well-documented and designed for easy extension.