What's Changed
- release 3.1.2 by @dt-yy in #377
- Main by @dt-yy in #378
- Em html by @yogacc33 in #379
- change the version number to single quotes by @yogacc33 in #382
- feat: add pre_data_json unit test by @yogacc33 in #383
- fix: update html_layout_cosin.py & test_html_layout_cosin.py, add similarity func by @renpengli01 in #381
- docs: update readme.md by @yogacc33 in #384
- feat: code extract on googlesource.com by @NgZiming in #386
- docs: qwen-72b-instruct deploy doc by @drunkpig in #387
- layout batch parser by @dt-yy in #388
- feat: 代表HTML网页选中、HTML精简 by @LollipopsAndWine in #389
- : add tag_mapping.py codes, output element dict of main html dom tree by @papayalove in #385
- : fix tag_mapping.py codes, fix target_list output, make it more accurate by @papayalove in #391
- fix dom推广异常 by @dt-yy in #394
- 修复单测case by @dt-yy in #396
- Dev element dict improvement by @papayalove in #397
- fix: 优化精简v1 by @LollipopsAndWine in #401
- feat: add ccstore pipeline by @e06084 in #398
- fix wiki web not complete by @dt-yy in #405
- feat: compress_and_decompress_str func standard_utils.py & test_standard_utils.py & fix: html_layout_cosin.py & test_html_layout_cosin.py .2f by @renpengli01 in #404
- feat: use llm select html main content node by @drunkpig in #406
- feat: select html content node by LLM by @drunkpig in #407
- 修改推广的字段 by @dt-yy in #409
- : add template html main tree extract success verification by tree structure similarity between template main html and original html. by @papayalove in #410
- : add main html extract success verification by tree structure similarity between template main html and main html. by @papayalove in #411
- : add raw tag html xpath info to element dict by @papayalove in #412
- feat: Sub/sup retains the original / tag format and does no… by @yogacc33 in #413
- feat: mv cc_store code to jupyter dir by @e06084 in #408
- : fix same layer definition in layout_batch_parser.py by @papayalove in #416
- fix: img math display mode by @e06084 in #414
- feat: add jupyter package in lint workflow. by @yogacc33 in #419
- fead: add layout_index_webkit.ipynb & nbconvert==7.16.6,notebook==7.4.2,jupyter==1.1.1 & fix: pre commit achieved clear all output data of jupyter file by @renpengli01 in #420
- html-cls m4 by @darkrush in #422
- some change about timeout by @ddfinshes in #425
- 识别paragraph部分bug修复 by @ddfinshes in #427
- : add dynamic id match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #428
- update: optimize cc_domain_index_gen and add en readme by @e06084 in #429
- docs: update domain cluster readme by @e06084 in #430
- feat: add cluster layout series jupyter & fix pre commit by @renpengli01 in #431
- : add dynamic classid match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #432
- fix: 根据模型评测调整精简 by @LollipopsAndWine in #434
- fix: use stream read in cc domain index generation by @e06084 in #436
- feat: 精简属性只保留图片src和元素class、id by @LollipopsAndWine in #438
- feat:清理元素属性,保留图片的有效src(排除base64)、alt,以及所有元素的class和id by @LollipopsAndWine in #440
- fix: get_feature add is_ignore_tag & similarity by html_layout_cosin.py by @renpengli01 in #442
- feat: 精简控制是否获取XPATH by @LollipopsAndWine in #443
- : add dynamic classid match switch by @papayalove in #445
- feat: add jupyter files: cc dedup by hash html & add readme cc dedup by @renpengli01 in #447
- 修复bug 1:部分输入丢失命名空间,无法匹配xsl模板; 2:部分公式段落划分错误; 3: 形如如 \text{...}的公式内容,花括号前被错误添加\left和\right by @1041206149 in #435
- : fix tag map get_feature None error by @papayalove in #449
- fix: jupyter:combines a four-step clustering procedure into a single … by @renpengli01 in #448
- : add parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #452
- fix: layout cluster dynamic properties & unit test by @renpengli01 in #453
- feat: use http url as markdown image path by @drunkpig in #454
- : fix parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #459
- 增加元素识别和抽取magic-html的接口 by @dt-yy in #457
- update readme by @dt-yy in #461
- 识别部分bug修复 by @ddfinshes in #456
- feat: 自定义标签'marked-tail', 'marked-text'配置为行内标签 by @LollipopsAndWine in #462
- fix pylint by @dt-yy in #463
- 新增知乎公式提取 by @1041206149 in #451
- update readme by @dt-yy in #465
- 修改语言检测文档,涉政模型文档,敏感词代码及文档 by @darkrush in #426
- fix: 重命名自定义标签名称 by @LollipopsAndWine in #468
- bench: fix MagicHTMLFIleFormatorExtractor by @e06084 in #471
- fix: layout cluster & unit test by @renpengli01 in #476
- mathjax渲染器方案优化 by @1041206149 in #470
- : add some inline tag noise, make the extraction more robust and fixed id classid strip() bug by @papayalove in #477
- feat: cc_dedup_fir add exception handle by @e06084 in #478
- mathjax渲染器方案逻辑修改 by @1041206149 in #480
- : fix image loss problem in new tag and modified the dynamic_classid_similarity_threshold by @papayalove in #482
- feat: add code detect fasttext model by @yogacc33 in #483
- feat: add math detector model by @yogacc33 in #484
- Feature/math model by @yogacc33 in #485
- add dedup by @dt-yy in #489
- : add extract main html by model response to README.md by @papayalove in #488
- fix: 修复多语种拼接规则 by @LollipopsAndWine in #490
- table 结构丢失,只保留caption问题修复 by @dt-yy in #493
- feat: add get_cc_select_html by @e06084 in #496
- feat: noclip管线新增预处理:删除表单交互式元素 by @LollipopsAndWine in #495
- : fix table tag integrity by @papayalove in #497
- mathjax添加ascii支持 by @1041206149 in #494
- feat: noclip配置文件中新增预处理配置 by @LollipopsAndWine in #499
- fix: standard_utils.py update json_loads by @renpengli01 in #498
- : fix response 0 by @papayalove in #502
- title公式修复 by @1041206149 in #503
- : modify typical main html similarity threshold to 0.92 by @papayalove in #504
- : fix remove script tail by @papayalove in #505
- fix: 修复title、list、table、text管线中换行不正确以及缺失内容 by @LollipopsAndWine in #506
- Release v3.2.0 by @e06084 in #507
New Contributors
- @papayalove made their first contribution in #385
- @ddfinshes made their first contribution in #425
- @1041206149 made their first contribution in #435
Full Changelog: v3.1.2-released...v3.2.0-released