Releases: ccprocessor/llm-webkit-mirror
Releases · ccprocessor/llm-webkit-mirror
v3.2.0-released
What's Changed
- release 3.1.2 by @dt-yy in #377
- Main by @dt-yy in #378
- Em html by @yogacc33 in #379
- change the version number to single quotes by @yogacc33 in #382
- feat: add pre_data_json unit test by @yogacc33 in #383
- fix: update html_layout_cosin.py & test_html_layout_cosin.py, add similarity func by @renpengli01 in #381
- docs: update readme.md by @yogacc33 in #384
- feat: code extract on googlesource.com by @NgZiming in #386
- docs: qwen-72b-instruct deploy doc by @drunkpig in #387
- layout batch parser by @dt-yy in #388
- feat: 代表HTML网页选中、HTML精简 by @LollipopsAndWine in #389
- : add tag_mapping.py codes, output element dict of main html dom tree by @papayalove in #385
- : fix tag_mapping.py codes, fix target_list output, make it more accurate by @papayalove in #391
- fix dom推广异常 by @dt-yy in #394
- 修复单测case by @dt-yy in #396
- Dev element dict improvement by @papayalove in #397
- fix: 优化精简v1 by @LollipopsAndWine in #401
- feat: add ccstore pipeline by @e06084 in #398
- fix wiki web not complete by @dt-yy in #405
- feat: compress_and_decompress_str func standard_utils.py & test_standard_utils.py & fix: html_layout_cosin.py & test_html_layout_cosin.py .2f by @renpengli01 in #404
- feat: use llm select html main content node by @drunkpig in #406
- feat: select html content node by LLM by @drunkpig in #407
- 修改推广的字段 by @dt-yy in #409
- : add template html main tree extract success verification by tree structure similarity between template main html and original html. by @papayalove in #410
- : add main html extract success verification by tree structure similarity between template main html and main html. by @papayalove in #411
- : add raw tag html xpath info to element dict by @papayalove in #412
- feat: Sub/sup retains the original / tag format and does no… by @yogacc33 in #413
- feat: mv cc_store code to jupyter dir by @e06084 in #408
- : fix same layer definition in layout_batch_parser.py by @papayalove in #416
- fix: img math display mode by @e06084 in #414
- feat: add jupyter package in lint workflow. by @yogacc33 in #419
- fead: add layout_index_webkit.ipynb & nbconvert==7.16.6,notebook==7.4.2,jupyter==1.1.1 & fix: pre commit achieved clear all output data of jupyter file by @renpengli01 in #420
- html-cls m4 by @darkrush in #422
- some change about timeout by @ddfinshes in #425
- 识别paragraph部分bug修复 by @ddfinshes in #427
- : add dynamic id match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #428
- update: optimize cc_domain_index_gen and add en readme by @e06084 in #429
- docs: update domain cluster readme by @e06084 in #430
- feat: add cluster layout series jupyter & fix pre commit by @renpengli01 in #431
- : add dynamic classid match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #432
- fix: 根据模型评测调整精简 by @LollipopsAndWine in #434
- fix: use stream read in cc domain index generation by @e06084 in #436
- feat: 精简属性只保留图片src和元素class、id by @LollipopsAndWine in #438
- feat:清理元素属性,保留图片的有效src(排除base64)、alt,以及所有元素的class和id by @LollipopsAndWine in #440
- fix: get_feature add is_ignore_tag & similarity by html_layout_cosin.py by @renpengli01 in #442
- feat: 精简控制是否获取XPATH by @LollipopsAndWine in #443
- : add dynamic classid match switch by @papayalove in #445
- feat: add jupyter files: cc dedup by hash html & add readme cc dedup by @renpengli01 in #447
- 修复bug 1:部分输入丢失命名空间,无法匹配xsl模板; 2:部分公式段落划分错误; 3: 形如如 \text{...}的公式内容,花括号前被错误添加\left和\right by @1041206149 in #435
- : fix tag map get_feature None error by @papayalove in #449
- fix: jupyter:combines a four-step clustering procedure into a single … by @renpengli01 in #448
- : add parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #452
- fix: layout cluster dynamic properties & unit test by @renpengli01 in #453
- feat: use http url as markdown image path by @drunkpig in #454
- : fix parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #459
- 增加元素识别和抽取magic-html的接口 by @dt-yy in #457
- update readme by @dt-yy in #461
- 识别部分bug修复 by @ddfinshes in #456
- feat: 自定义标签'marked-tail', 'marked-text'配置为行内标签 by @LollipopsAndWine in #462
- fix pylint by @dt-yy in #463
- 新增知乎公式提取 by @1041206149 in #451
- update readme by @dt-yy in #465
- 修改语言检测文档,涉政模型文档,敏感词代码及文档 by @darkrush in #426
- fix: 重命名自定义标签名称 by @LollipopsAndWine in #468
- bench: fix MagicHTMLFIleFormatorExtractor by @e06084 in #471
- fix: layout cluster & unit test by @renpengli01 in #476
- mathjax渲染器方案优化 by @1041206149 in #470
- : add some inline tag noise, make the extraction more robust and fixed id classid strip() bug by @papayalove in #477
- feat: cc_dedup_fir add exception handle by @e06084 in #478
- mathjax渲染器方案逻辑修改 by @1041206149 in #480
- : fix image loss problem in new tag and modified the dynamic_classid_similarity_threshold by @papayalove in #482
- feat: add code detect fasttext model by @yogacc33 in #483
- feat: add math detector model by @yogacc33 in #484
- Feature/math model by @yogacc33 in #485
- add dedup by @dt-yy in #489
- : add extract main html by model response to README.md by @papayalove in #488
- fix: 修复多语种拼接规则 by @LollipopsAndWine in #490
- table 结构丢失,只保留caption问题修复 by @dt-yy in #493
- feat: add get_cc_select_html by @e06084 in #496
- feat: noclip管线新增预处理:删除表单交互式元素 by @LollipopsAndWine in #495
- : fix table tag integrity by @papayalove in #497
- mathjax...
v3.1.2-released
What's Changed
- fix: 乱码问题;去掉空的列表内段落 by @drunkpig in #332
- fix: dollar char by @e06084 in #331
- fix: default mathjax render by @e06084 in #333
- fix normalize table by @dt-yy in #337
- fix: formula content normalize with normalize_ctl_text by @e06084 in #335
- fix: codes with newline are not inline-codes by @NgZiming in #334
- feat: add slurm job stop script by @drunkpig in #338
- extractor chain add track_back exception info by @yogacc33 in #339
- 兼容解析不规范的table by @dt-yy in #340
- table的单元格数量计算更新 by @dt-yy in #341
- add html encode in magic-html by @yogacc33 in #342
- fix: html parser fail in non-close tag by @e06084 in #344
- feat: support formular break by br tag by @e06084 in #343
- 修改语言检测功能的逻辑,测试以及文档 by @2471023025 in #336
- update table complex except by @dt-yy in #346
- feat: add magic html extract method in datajson by @yogacc33 in #352
- feat: add title normalize&tail repeat by @dt-yy in #354
- fix: to_md lost image url by @drunkpig in #356
- fix: add svg unit test test_image.py & svg 特殊符号异常修复 by @renpengli01 in #353
- Add Rule-based safety model and xlmr porn model by @darkrush in #351
- fix: add ut by @e06084 in #358
- feat: detect code by classes by @NgZiming in #355
- release v3.1.1 by @dt-yy in #357
- Revert "release v3.1.1 " by @yogacc33 in #359
- fix test case & list lack by @dt-yy in #360
- feat: Delete nested images in title by @yogacc33 in #363
- feat: add domain special code extract rules by @NgZiming in #362
- add layout 聚类算法及单测 by @renpengli01 in #365
- fix table lack content by @dt-yy in #366
- sub/sup reg in para by @yogacc33 in #367
- fix release pipeline by @dt-yy in #368
- add sub/sup tag process in list by @yogacc33 in #369
- refactor: 重构解析段落元素逻辑 by @LollipopsAndWine in #370
- fix: svg html异常处理 & 单测 & html格式化 by @renpengli01 in #375
- list模块重构 by @dt-yy in #376
Full Changelog: v3.1.0-released...v3.1.2-released
v3.1.1-released
What's Changed
- fix: 乱码问题;去掉空的列表内段落 by @drunkpig in #332
- fix: dollar char by @e06084 in #331
- fix: default mathjax render by @e06084 in #333
- fix normalize table by @dt-yy in #337
- fix: formula content normalize with normalize_ctl_text by @e06084 in #335
- fix: codes with newline are not inline-codes by @NgZiming in #334
- feat: add slurm job stop script by @drunkpig in #338
- extractor chain add track_back exception info by @yogacc33 in #339
- 兼容解析不规范的table by @dt-yy in #340
- table的单元格数量计算更新 by @dt-yy in #341
- add html encode in magic-html by @yogacc33 in #342
- fix: html parser fail in non-close tag by @e06084 in #344
- feat: support formular break by br tag by @e06084 in #343
- 修改语言检测功能的逻辑,测试以及文档 by @2471023025 in #336
- update table complex except by @dt-yy in #346
- feat: add magic html extract method in datajson by @yogacc33 in #352
- feat: add title normalize&tail repeat by @dt-yy in #354
- fix: to_md lost image url by @drunkpig in #356
- fix: add svg unit test test_image.py & svg 特殊符号异常修复 by @renpengli01 in #353
- Add Rule-based safety model and xlmr porn model by @darkrush in #351
- fix: add ut by @e06084 in #358
- feat: detect code by classes by @NgZiming in #355
Full Changelog: v3.1.0-released...v3.1.1-released
v3.1.0-released
What's Changed
- Framework simplified v1.0 by @yogacc33 in #211
- add zh-en-article quality model by @ideaflow in #208
- bugfix: Make CleanExp inherit from WebKitBaseException by @darkrush in #223
- fix: html simplify noscript tag by @feifei2023 in #222
- [CI]: add python3.12&3.11 env by @e06084 in #220
- resolve nest table by @dt-yy in #225
- fix: mml to latex and math extract no full by @lsp213 in #219
- fix: DataJson construct method do not change outer variable by @drunkpig in #226
- fix: code test case by @NgZiming in #221
- [fix]: fix math nonetype by @e06084 in #229
- add document of quality model by @ideaflow in #230
- fix:math no text by @lsp213 in #233
- feat: add CleanModule to provide general clean function by @darkrush in #234
- Add a new interface in political model to accommodate the new requirement by @ideaflow in #210
- feat: add two case of st by @e06084 in #235
- [CI]: python3.11 and 3.12 run when requirements modify by @e06084 in #236
- Revise quality document,and change english stop words reading method by @ideaflow in #241
- feat: add CleanModule to provide general clean function by @darkrush in #246
- lang_id doc revise by @2471023025 in #245
- feat: add CleanModule to provide general clean function by @darkrush in #232
- fix: move html_simplify_classify.md to docs/llm_web_kit/model by @darkrush in #247
- Exception refactoring by @yogacc33 in #249
- refactor: remove old clean exception add model-related exceptions by @darkrush in #252
- feat: 添加 CleanModule 文档及参数说明,支持内容质量清洗 by @darkrush in #250
- 修改语言分类模型文档 by @2471023025 in #251
- init unsafe_words_detector.py by @Adela-Yu-Coder in #194
- feat: 添加线程安全的文件下载功能,支持文件哈希校验和锁机制 by @darkrush in #254
- fix: 修复文件锁定机制,确保锁文件在异常情况下被正确删除 by @darkrush in #256
- Dev lid218 by @2471023025 in #255
- 调整解析顺序&更新标准 by @dt-yy in #248
- docs: 更新HTML简化分类文档,添加模型下载配置示例,fix: 修改质量模型预测方法,限制线程数为1 by @darkrush in #260
- add test case by @dt-yy in #258
- Dev lid218 by @2471023025 in #261
- [fix]: remove class=d-none tag by @e06084 in #268
- update document of quality and political model by @ideaflow in #270
- x by @renpengli01 in #269
- 补充表格单元格的tail by @dt-yy in #267
- refact: make resource_utils to use project defined exception and refact code for readability by @darkrush in #274
- [feat]: add CleanTagsPreExtractor by @e06084 in #278
- 解决list和table等问题 by @dt-yy in #279
- add list test case & fix list nest level by @dt-yy in #282
- 修复table的实体标记问题 by @dt-yy in #285
- use SoftFilelock ot ensure model resouce processed correctly by @darkrush in #280
- fead: html_layout_classify/* 模型分类处理 & html_layout_classify.md by @renpengli01 in #288
- refact import transformers and make cache dir alap by @darkrush in #290
- fix: json utils error with different python version by @drunkpig in #292
- Exception dynamically set dataset_name by @yogacc33 in #293
- mock CACHE_TMP_DIR in test cases by @darkrush in #296
- feat: 修复获取文本未保留换行问题、增加段落文本的测试用例 by @LollipopsAndWine in #297
- feat: use the first item in predict result as langurage_details by @2471023025 in #298
- fix: 修复空内容返回的语言详情,在176版本下language_details返回"not_defined" by @darkrush in #300
- fix: simplify test cases by directly setting CACHE_TMP_DIR mock value by @darkrush in #302
- fix: empty extract fomula by @e06084 in #304
- [fix]: fix 一些特殊错误的公式 by @e06084 in #299
- feat: page classify by @drunkpig in #308
- fix: code and text unit test by @drunkpig in #310
- feat: math extract support mjx-container tag by @e06084 in #311
- feat: add MM_NODE_LIST in to_nlp_md by @shijinpjlab in #305
- feat: simple user api to extract html to markdown by @drunkpig in #313
- feat: add title to DataJson by @drunkpig in #314
- fix: html parser support xml_declaration by @e06084 in #315
- [feat] support math extract from mathjax config by @e06084 in #303
- RuleBasedSafetyModule by @Adela-Yu-Coder in #257
- feat: 将换行更换为双换行、配置全局常量 by @LollipopsAndWine in #320
- Revert "feat: 将换行更换为双换行、配置全局常量 (#320)" by @yogacc33 in #321
- build: setup add config files by @yogacc33 in #319
- 性能提升&问题修复 by @dt-yy in #317
- bench: update main_html extractor by @e06084 in #318
- update version by @dt-yy in #323
- feat: 将获取段落文本时,br换成双换行 by @LollipopsAndWine in #325
- update branch by @dt-yy in #327
New Contributors
- @Adela-Yu-Coder made their first contribution in #194
- @shijinpjlab made their first contribution in #305
Full Changelog: llm-web-kit==3.0.1released...v3.1.0-released
llm-web-kit==3.0.1released
What's Changed
- 更新setup.py发布whl包 by @dt-yy in #78
- fix: wrap_math rm brace handle by @e06084 in #79
- 【CI】yml中增加realse自动发布whl包 by @dt-yy in #81
- feat: add bench by @e06084 in #83
- fix: update CC_spec.md by @renpengli01 in #80
- feat: p + script by @imMid-Star in #82
- Revert "feat: p + script" by @e06084 in #85
- feat: add p tag and script_math_tex by @e06084 in #87
- 修改tag_span_script和get_equation_type 测试集合整合 by @lsp213 in #91
- fix: extract code fail if table exists by @NgZiming in #89
- test: unit test of pipelineSuit by @drunkpig in #92
- fix: code language detect issue by @drunkpig in #95
- feat:wrap latex formula by @e06084 in #97
- feat: add system exception by @dt-yy in #93
- docs: update README.md by @LollipopsAndWine in #99
- feat: math add img tag by @e06084 in #100
- fix: check if a html node is cc-node by @drunkpig in #101
- fix: list parse dup issue by @drunkpig in #102
- add table unittest by @dt-yy in #103
- fix: [cccode] error when one code tag only by @NgZiming in #105
- fix: pipeline image.py & tests test_image.py by @renpengli01 in #106
- test: add pipelineSuit test case by @drunkpig in #108
- feat: bench add code data and support our extract method by @e06084 in #107
- feat: content list to txt format by @drunkpig in #110
- feat: tag_script by @lsp213 in #111
- feat: refine data rw package by @e06084 in #112
- fix: [code] line breaks in code by @NgZiming in #109
- fix: refine data pkg to dataio by @e06084 in #113
- feat: code classification model by @minrui3 in #115
- fix: title_content is none by @NgZiming in #114
- fix: add ccmath exception and refine by @e06084 in #118
- feat: add tag_asciimath fix: math_katex_latex_2 bug by @lsp213 in #119
- feat: escape special chars in markdown format by @drunkpig in #121
- feat: add code docs and pre block code by @NgZiming in #124
- feat: get main html form content list by @drunkpig in #125
- feat: remove more labels add unitest by @dt-yy in #127
- feat: eval add to_nlp_md and magic_html by @e06084 in #128
- fix: do not add content list node with empty text by @drunkpig in #129
- bench: add csdn test data by @e06084 in #131
- fix: inline code to md by @NgZiming in #132
- fix: tag_asciimath by @lsp213 in #135
- Revert "fix: tag_asciimath" by @drunkpig in #140
- feat: inline code and fix all tests by @NgZiming in #137
- feat: update release version by @dt-yy in #138
- fix: html table to markdown table issue by @drunkpig in #141
- bench: add 2 table test data by @e06084 in #142
- fix: some text lost by @drunkpig in #143
- feat:img tag modified by @imMid-Star in #136
- fix: wrap inline equation by @e06084 in #145
- fix: space by @imMid-Star in #146
- feat: Replace consecutive whitespace characters in the text with a single one by @drunkpig in #150
- fix: inline and interline math wrap rm () and [] by @e06084 in #149
- fix: fix html to md bug by @dt-yy in #148
- feat: 离散code标签合并算法改 by @NgZiming in #151
- fix: some text lost in list by @drunkpig in #153
- add unittest case by @dt-yy in #154
- feat: add internline ASCIIMath support by @e06084 in #155
- fix table img label by @dt-yy in #157
- fix: 修复段落提取文本时,节点 tail 中的文本拼接在子元素的文本前 by @NgZiming in #158
- fix: 列表中的非li的元素tail被丢弃及列表文本顺序 by @NgZiming in #159
- fix: 删除 title 中的 tail / 修复 code 规则的nbsp被删除的问题 / 图片没有path导致转换md失败 by @NgZiming in #162
- feat: detect code lineno and remove lineno by @NgZiming in #163
- fix: image.py & test_image.py by @renpengli01 in #167
- feat: add magic-html command tool by @yogacc33 in #172
- feat: remove common spaces prefix by @NgZiming in #173
- feature: add math prefix func by @imMid-Star in #177
- [bench]: add statics oof content list calculate in bench by @e06084 in #180
- feat: Add HTML tag process and simplify for classification algorithm by @darkrush in #181
- add remove format table by @dt-yy in #178
- feat: Ignore bad styles in the code by @sixgad in #187
- fix: doc.oracle.com extractor by @drunkpig in #186
- feat: add get statics of contentlist in post extractor by @e06084 in #184
- Dev lid218 by @2471023025 in #191
- fix: del unused dependencies by @drunkpig in #195
- fix: node extract incorrect add: inline display by @lsp213 in #182
- [test]: add st test in CI by @e06084 in #199
- [bench]: add type_acc calculate by @e06084 in #203
- feat: 添加 HTML 分类功能,分类为文章,论坛和other by @feifei2023 in #196
- lang_id doc by @2471023025 in #198
- remove empty table by @dt-yy in #205
- test: add DataJson test by @drunkpig in #202
- [CI]: add python3.13 env by @e06084 in #206
- Merge pull request #212 from ccprocessor/dev by @e06084 in #213
- rebase main by @e06084 in #216
- sync dev by @e06084 in #217
New Contributors
- @imMid-Star made their first contribution in #82
- @LollipopsAndWine made their first contribution in #99
- @minrui3 made their first contribution in #115
- @darkrush made their first contribution in #181
- @2471023025 made their first contribution in #191
- @feifei2023 made their first contribution in #196
Full Changelog: llm-web-kit==3.0.0...llm-web-kit==3.0.1released
llm-web-kit==3.0.0
What's Changed
- [config]: split requirements with runtime and dev by @e06084 in #9
- fix: save extracted content and raw_html simultaneously by @drunkpig in #12
- docs: content_list specification by @drunkpig in #14
- docs: markdown spec by @drunkpig in #16
- docs: txt spec by @drunkpig in #17
- [feature]: add math extract by @e06084 in #18
- feat: add magic_html extractor by @sixgad in #19
- feat: Add code recog base by @NgZiming in #20
- [feature]: add math render detect method by @e06084 in #21
- feat: html split by tag by @drunkpig in #23
- docs: how to use html split by @drunkpig in #24
- docs: update content_list spec by @drunkpig in #25
- feat: html->content-list dict by @drunkpig in #26
- add table parse by @dt-yy in #28
- feat: implement to_content_list of HTMLExtractor by @drunkpig in #29
- docs: pdf pipeline input index file format spec by @drunkpig in #30
- [feat]: add to_content_list_node method and get_equation_type by @e06084 in #31
- fix: git pre-commit error by @drunkpig in #32
- [feature]: add contains_math by @e06084 in #33
- fix: extract node changed tree structure && feat: add new test case by @NgZiming in #34
- feat: get raw_html from user defined html tag by @drunkpig in #35
- [fix]: update html_split_by_tags usage in ccmath by @e06084 in #36
- feat: attach html parent node path when split html by tag by @drunkpig in #39
- feat: remove magic number by @NgZiming in #37
- feat: add math-container extract by @e06084 in #38
- fix: add function to test if a html segment contain a cc html tag by @drunkpig in #41
- fix: html split by tag by @drunkpig in #43
- feat: add ccmath-inline and ccmath-interline by @e06084 in #44
- feat: add math tag recognize by @e06084 in #46
- feat: title recognize by @drunkpig in #48
- fix: span math modify by @e06084 in #49
- feat: parse list by @drunkpig in #50
- refact: add tail text when build cc element by @drunkpig in #51
- fix: change parse order by @drunkpig in #52
- fix: use _build_html_tree in ccmath by @e06084 in #53
- feat: add porn detector and corresponding unit test code. by @ideaflow in #15
- remove dependency of model_config.jsonc in model assets by @ideaflow in #55
- feat: add pipeline--image.py & test image htmls & test_image.py & do… by @renpengli01 in #57
- fix table conflict by @dt-yy in #58
- docs[lxml]: document of lxml by @drunkpig in #61
- fix: ccmath use lxml.html to parse html by @e06084 in #65
- feat: add mathml test case by @e06084 in #67
- fix: build html element without parent by @drunkpig in #68
- feat: add html tag_span_script by @lsp213 in #69
- fix: pre-commit by @e06084 in #72
- fix: mml to latex by @e06084 in #74
- replace html-element by @dt-yy in #76
New Contributors
- @e06084 made their first contribution in #1
- @renpengli01 made their first contribution in #7
- @drunkpig made their first contribution in #11
- @sixgad made their first contribution in #19
- @NgZiming made their first contribution in #20
- @dt-yy made their first contribution in #28
Full Changelog: https://github.com/ccprocessor/llm-webkit-mirror/commits/llm-web-kit==3.0.0