Skip to content

Chunked parsing issue #189

Open
Open
@skapix

Description

@skapix

Parsing is done with chunking with the following code:

myhtml_tree_t* Parse(myhtml_t* myhtml, const std::string& body,
                     size_t chunk_sz) {
  myhtml_tree_t* tree = myhtml_tree_create();
  myhtml_tree_init(tree, myhtml);
  size_t body_chunk_pos = 0;
  while (body_chunk_pos < body.size()) {
    size_t current_chunk_sz = std::min(chunk_sz, body.size() - body_chunk_pos);
    mystatus_t parse_status = myhtml_parse_chunk_single(
        tree, body.c_str() + body_chunk_pos, current_chunk_sz);
    if (parse_status != MyHTML_STATUS_OK) {
      myhtml_tree_destroy(tree);
      return nullptr;
    }
    body_chunk_pos += current_chunk_sz;
  }
  return tree;
}

And called with arguments:

myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
std::string body = "<html><head><style>a</style></head><body>f</body></html>";
size_t chunk_sz = 13;
myhtml_tree_t* tree = Parse(myhtml, body, chunk_sz);

Depending on build options, there may be various results.
In some cases serialized tree looks like this:

<html><head><style>a</style></head><body>f</body></html></style></head><body></body></html>

In some cases looks like this

<html><head><style></style></head></html>

While it should be:

<html><head><style>a</style></head><body>f</body></html>

After some investigation I found out, that the issue is inside myhtml_tokenizer_state_rawtext_end_tag_name with token_node->raw_begin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions