Releases · ccprocessor/llm-webkit-mirror

01 Aug 03:23

e06084

v3.2.0-released

b1bc533

v3.2.0-released Latest

Latest

What's Changed

release 3.1.2 by @dt-yy in #377
Main by @dt-yy in #378
Em html by @yogacc33 in #379
change the version number to single quotes by @yogacc33 in #382
feat: add pre_data_json unit test by @yogacc33 in #383
fix: update html_layout_cosin.py & test_html_layout_cosin.py, add similarity func by @renpengli01 in #381
docs: update readme.md by @yogacc33 in #384
feat: code extract on googlesource.com by @NgZiming in #386
docs: qwen-72b-instruct deploy doc by @drunkpig in #387
layout batch parser by @dt-yy in #388
feat: 代表HTML网页选中、HTML精简 by @LollipopsAndWine in #389
: add tag_mapping.py codes, output element dict of main html dom tree by @papayalove in #385
: fix tag_mapping.py codes, fix target_list output, make it more accurate by @papayalove in #391
fix dom推广异常 by @dt-yy in #394
修复单测case by @dt-yy in #396
Dev element dict improvement by @papayalove in #397
fix: 优化精简v1 by @LollipopsAndWine in #401
feat: add ccstore pipeline by @e06084 in #398
fix wiki web not complete by @dt-yy in #405
feat: compress_and_decompress_str func standard_utils.py & test_standard_utils.py & fix: html_layout_cosin.py & test_html_layout_cosin.py .2f by @renpengli01 in #404
feat: use llm select html main content node by @drunkpig in #406
feat: select html content node by LLM by @drunkpig in #407
修改推广的字段 by @dt-yy in #409
: add template html main tree extract success verification by tree structure similarity between template main html and original html. by @papayalove in #410
: add main html extract success verification by tree structure similarity between template main html and main html. by @papayalove in #411
: add raw tag html xpath info to element dict by @papayalove in #412
feat: Sub/sup retains the original _{/^{tag format and does no… by @yogacc33 in #413}}
feat: mv cc_store code to jupyter dir by @e06084 in #408
: fix same layer definition in layout_batch_parser.py by @papayalove in #416
fix: img math display mode by @e06084 in #414
feat: add jupyter package in lint workflow. by @yogacc33 in #419
fead: add layout_index_webkit.ipynb & nbconvert==7.16.6，notebook==7.4.2，jupyter==1.1.1 & fix: pre commit achieved clear all output data of jupyter file by @renpengli01 in #420
html-cls m4 by @darkrush in #422
some change about timeout by @ddfinshes in #425
识别paragraph部分bug修复 by @ddfinshes in #427
: add dynamic id match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #428
update: optimize cc_domain_index_gen and add en readme by @e06084 in #429
docs: update domain cluster readme by @e06084 in #430
feat: add cluster layout series jupyter & fix pre commit by @renpengli01 in #431
: add dynamic classid match in layout_batch_parser.py, enabled by switch variable DYNAMIC_ID_ENABLE, and add TYPICAL_DICT_HTML output by tag_map by @papayalove in #432
fix: 根据模型评测调整精简 by @LollipopsAndWine in #434
fix: use stream read in cc domain index generation by @e06084 in #436
feat: 精简属性只保留图片src和元素class、id by @LollipopsAndWine in #438
feat:清理元素属性，保留图片的有效src（排除base64）、alt，以及所有元素的class和id by @LollipopsAndWine in #440
fix: get_feature add is_ignore_tag & similarity by html_layout_cosin.py by @renpengli01 in #442
feat: 精简控制是否获取XPATH by @LollipopsAndWine in #443
: add dynamic classid match switch by @papayalove in #445
feat: add jupyter files: cc dedup by hash html & add readme cc dedup by @renpengli01 in #447
修复bug 1：部分输入丢失命名空间，无法匹配xsl模板； 2：部分公式段落划分错误; 3: 形如如 \text{...}的公式内容，花括号前被错误添加\left和\right by @1041206149 in #435
: fix tag map get_feature None error by @papayalove in #449
fix: jupyter:combines a four-step clustering procedure into a single … by @renpengli01 in #448
: add parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #452
fix: layout cluster dynamic properties & unit test by @renpengli01 in #453
feat: use http url as markdown image path by @drunkpig in #454
: fix parse_single in MapItemToHtmlTagsParser for single html extraction by @papayalove in #459
增加元素识别和抽取magic-html的接口 by @dt-yy in #457
update readme by @dt-yy in #461
识别部分bug修复 by @ddfinshes in #456
feat: 自定义标签'marked-tail', 'marked-text'配置为行内标签 by @LollipopsAndWine in #462
fix pylint by @dt-yy in #463
新增知乎公式提取 by @1041206149 in #451
update readme by @dt-yy in #465
修改语言检测文档,涉政模型文档,敏感词代码及文档 by @darkrush in #426
fix: 重命名自定义标签名称 by @LollipopsAndWine in #468
bench: fix MagicHTMLFIleFormatorExtractor by @e06084 in #471
fix: layout cluster & unit test by @renpengli01 in #476
mathjax渲染器方案优化 by @1041206149 in #470
: add some inline tag noise, make the extraction more robust and fixed id classid strip() bug by @papayalove in #477
feat: cc_dedup_fir add exception handle by @e06084 in #478
mathjax渲染器方案逻辑修改 by @1041206149 in #480
: fix image loss problem in new tag and modified the dynamic_classid_similarity_threshold by @papayalove in #482
feat: add code detect fasttext model by @yogacc33 in #483
feat: add math detector model by @yogacc33 in #484
Feature/math model by @yogacc33 in #485
add dedup by @dt-yy in #489
: add extract main html by model response to README.md by @papayalove in #488
fix: 修复多语种拼接规则 by @LollipopsAndWine in #490
table 结构丢失，只保留caption问题修复 by @dt-yy in #493
feat: add get_cc_select_html by @e06084 in #496
feat: noclip管线新增预处理：删除表单交互式元素 by @LollipopsAndWine in #495
: fix table tag integrity by @papayalove in #497
mathjax...

Contributors

darkrush, e06084, and 9 other contributors

Assets 3

15 Apr 15:07

dt-yy

v3.1.2-released

4ccae39

v3.1.2-released

What's Changed

fix: 乱码问题；去掉空的列表内段落 by @drunkpig in #332
fix: dollar char by @e06084 in #331
fix: default mathjax render by @e06084 in #333
fix normalize table by @dt-yy in #337
fix: formula content normalize with normalize_ctl_text by @e06084 in #335
fix: codes with newline are not inline-codes by @NgZiming in #334
feat: add slurm job stop script by @drunkpig in #338
extractor chain add track_back exception info by @yogacc33 in #339
兼容解析不规范的table by @dt-yy in #340
table的单元格数量计算更新 by @dt-yy in #341
add html encode in magic-html by @yogacc33 in #342
fix: html parser fail in non-close tag by @e06084 in #344
feat: support formular break by br tag by @e06084 in #343
修改语言检测功能的逻辑，测试以及文档 by @2471023025 in #336
update table complex except by @dt-yy in #346
feat: add magic html extract method in datajson by @yogacc33 in #352
feat: add title normalize&tail repeat by @dt-yy in #354
fix: to_md lost image url by @drunkpig in #356
fix: add svg unit test test_image.py & svg 特殊符号异常修复 by @renpengli01 in #353
Add Rule-based safety model and xlmr porn model by @darkrush in #351
fix: add ut by @e06084 in #358
feat: detect code by classes by @NgZiming in #355
release v3.1.1 by @dt-yy in #357
Revert "release v3.1.1 " by @yogacc33 in #359
fix test case & list lack by @dt-yy in #360
feat: Delete nested images in title by @yogacc33 in #363
feat: add domain special code extract rules by @NgZiming in #362
add layout 聚类算法及单测 by @renpengli01 in #365
fix table lack content by @dt-yy in #366
sub/sup reg in para by @yogacc33 in #367
fix release pipeline by @dt-yy in #368
add sub/sup tag process in list by @yogacc33 in #369
refactor: 重构解析段落元素逻辑 by @LollipopsAndWine in #370
fix: svg html异常处理 & 单测 & html格式化 by @renpengli01 in #375
list模块重构 by @dt-yy in #376

Full Changelog: v3.1.0-released...v3.1.2-released

Contributors

darkrush, e06084, and 7 other contributors

Assets 4

31 Mar 12:49

dt-yy

v3.1.1-released

2450ea9

v3.1.1-released

What's Changed

fix: 乱码问题；去掉空的列表内段落 by @drunkpig in #332
fix: dollar char by @e06084 in #331
fix: default mathjax render by @e06084 in #333
fix normalize table by @dt-yy in #337
fix: formula content normalize with normalize_ctl_text by @e06084 in #335
fix: codes with newline are not inline-codes by @NgZiming in #334
feat: add slurm job stop script by @drunkpig in #338
extractor chain add track_back exception info by @yogacc33 in #339
兼容解析不规范的table by @dt-yy in #340
table的单元格数量计算更新 by @dt-yy in #341
add html encode in magic-html by @yogacc33 in #342
fix: html parser fail in non-close tag by @e06084 in #344
feat: support formular break by br tag by @e06084 in #343
修改语言检测功能的逻辑，测试以及文档 by @2471023025 in #336
update table complex except by @dt-yy in #346
feat: add magic html extract method in datajson by @yogacc33 in #352
feat: add title normalize&tail repeat by @dt-yy in #354
fix: to_md lost image url by @drunkpig in #356
fix: add svg unit test test_image.py & svg 特殊符号异常修复 by @renpengli01 in #353
Add Rule-based safety model and xlmr porn model by @darkrush in #351
fix: add ut by @e06084 in #358
feat: detect code by classes by @NgZiming in #355

Full Changelog: v3.1.0-released...v3.1.1-released

Contributors

darkrush, e06084, and 6 other contributors

Assets 3

20 Mar 14:56

dt-yy

v3.1.0-released

c44a490

v3.1.0-released

What's Changed

Framework simplified v1.0 by @yogacc33 in #211
add zh-en-article quality model by @ideaflow in #208
bugfix: Make CleanExp inherit from WebKitBaseException by @darkrush in #223
fix: html simplify noscript tag by @feifei2023 in #222
[CI]: add python3.12&3.11 env by @e06084 in #220
resolve nest table by @dt-yy in #225
fix: mml to latex and math extract no full by @lsp213 in #219
fix: DataJson construct method do not change outer variable by @drunkpig in #226
fix: code test case by @NgZiming in #221
[fix]: fix math nonetype by @e06084 in #229
add document of quality model by @ideaflow in #230
fix:math no text by @lsp213 in #233
feat: add CleanModule to provide general clean function by @darkrush in #234
Add a new interface in political model to accommodate the new requirement by @ideaflow in #210
feat: add two case of st by @e06084 in #235
[CI]: python3.11 and 3.12 run when requirements modify by @e06084 in #236
Revise quality document，and change english stop words reading method by @ideaflow in #241
feat: add CleanModule to provide general clean function by @darkrush in #246
lang_id doc revise by @2471023025 in #245
feat: add CleanModule to provide general clean function by @darkrush in #232
fix: move html_simplify_classify.md to docs/llm_web_kit/model by @darkrush in #247
Exception refactoring by @yogacc33 in #249
refactor: remove old clean exception add model-related exceptions by @darkrush in #252
feat: 添加 CleanModule 文档及参数说明，支持内容质量清洗 by @darkrush in #250
修改语言分类模型文档 by @2471023025 in #251
init unsafe_words_detector.py by @Adela-Yu-Coder in #194
feat: 添加线程安全的文件下载功能，支持文件哈希校验和锁机制 by @darkrush in #254
fix: 修复文件锁定机制，确保锁文件在异常情况下被正确删除 by @darkrush in #256
Dev lid218 by @2471023025 in #255
调整解析顺序&更新标准 by @dt-yy in #248
docs: 更新HTML简化分类文档，添加模型下载配置示例，fix: 修改质量模型预测方法，限制线程数为1 by @darkrush in #260
add test case by @dt-yy in #258
Dev lid218 by @2471023025 in #261
[fix]: remove class=d-none tag by @e06084 in #268
update document of quality and political model by @ideaflow in #270
x by @renpengli01 in #269
补充表格单元格的tail by @dt-yy in #267
refact: make resource_utils to use project defined exception and refact code for readability by @darkrush in #274
[feat]: add CleanTagsPreExtractor by @e06084 in #278
解决list和table等问题 by @dt-yy in #279
add list test case & fix list nest level by @dt-yy in #282
修复table的实体标记问题 by @dt-yy in #285
use SoftFilelock ot ensure model resouce processed correctly by @darkrush in #280
fead: html_layout_classify/* 模型分类处理 & html_layout_classify.md by @renpengli01 in #288
refact import transformers and make cache dir alap by @darkrush in #290
fix: json utils error with different python version by @drunkpig in #292
Exception dynamically set dataset_name by @yogacc33 in #293
mock CACHE_TMP_DIR in test cases by @darkrush in #296
feat: 修复获取文本未保留换行问题、增加段落文本的测试用例 by @LollipopsAndWine in #297
feat: use the first item in predict result as langurage_details by @2471023025 in #298
fix: 修复空内容返回的语言详情，在176版本下language_details返回"not_defined" by @darkrush in #300
fix: simplify test cases by directly setting CACHE_TMP_DIR mock value by @darkrush in #302
fix: empty extract fomula by @e06084 in #304
[fix]: fix 一些特殊错误的公式 by @e06084 in #299
feat: page classify by @drunkpig in #308
fix: code and text unit test by @drunkpig in #310
feat: math extract support mjx-container tag by @e06084 in #311
feat: add MM_NODE_LIST in to_nlp_md by @shijinpjlab in #305
feat: simple user api to extract html to markdown by @drunkpig in #313
feat: add title to DataJson by @drunkpig in #314
fix: html parser support xml_declaration by @e06084 in #315
[feat] support math extract from mathjax config by @e06084 in #303
RuleBasedSafetyModule by @Adela-Yu-Coder in #257
feat: 将换行更换为双换行、配置全局常量 by @LollipopsAndWine in #320
Revert "feat: 将换行更换为双换行、配置全局常量 (#320)" by @yogacc33 in #321
build: setup add config files by @yogacc33 in #319
性能提升&问题修复 by @dt-yy in #317
bench: update main_html extractor by @e06084 in #318
update version by @dt-yy in #323
feat: 将获取段落文本时，br换成双换行 by @LollipopsAndWine in #325
update branch by @dt-yy in #327

New Contributors

@Adela-Yu-Coder made their first contribution in #194
@shijinpjlab made their first contribution in #305

Full Changelog: llm-web-kit==3.0.1released...v3.1.0-released

Contributors

darkrush, ideaflow, and 12 other contributors

Assets 3

21 Feb 14:36

e06084

llm-web-kit==3.0.1released

852cf7f

llm-web-kit==3.0.1released

What's Changed

更新setup.py发布whl包 by @dt-yy in #78
fix: wrap_math rm brace handle by @e06084 in #79
【CI】yml中增加realse自动发布whl包 by @dt-yy in #81
feat: add bench by @e06084 in #83
fix: update CC_spec.md by @renpengli01 in #80
feat: p + script by @imMid-Star in #82
Revert "feat: p + script" by @e06084 in #85
feat: add p tag and script_math_tex by @e06084 in #87
修改tag_span_script和get_equation_type 测试集合整合 by @lsp213 in #91
fix: extract code fail if table exists by @NgZiming in #89
test: unit test of pipelineSuit by @drunkpig in #92
fix: code language detect issue by @drunkpig in #95
feat：wrap latex formula by @e06084 in #97
feat: add system exception by @dt-yy in #93
docs: update README.md by @LollipopsAndWine in #99
feat: math add img tag by @e06084 in #100
fix: check if a html node is cc-node by @drunkpig in #101
fix: list parse dup issue by @drunkpig in #102
add table unittest by @dt-yy in #103
fix: [cccode] error when one code tag only by @NgZiming in #105
fix: pipeline image.py & tests test_image.py by @renpengli01 in #106
test: add pipelineSuit test case by @drunkpig in #108
feat: bench add code data and support our extract method by @e06084 in #107
feat: content list to txt format by @drunkpig in #110
feat: tag_script by @lsp213 in #111
feat: refine data rw package by @e06084 in #112
fix: [code] line breaks in code by @NgZiming in #109
fix: refine data pkg to dataio by @e06084 in #113
feat: code classification model by @minrui3 in #115
fix: title_content is none by @NgZiming in #114
fix: add ccmath exception and refine by @e06084 in #118
feat: add tag_asciimath fix: math_katex_latex_2 bug by @lsp213 in #119
feat: escape special chars in markdown format by @drunkpig in #121
feat: add code docs and pre block code by @NgZiming in #124
feat: get main html form content list by @drunkpig in #125
feat: remove more labels add unitest by @dt-yy in #127
feat: eval add to_nlp_md and magic_html by @e06084 in #128
fix: do not add content list node with empty text by @drunkpig in #129
bench: add csdn test data by @e06084 in #131
fix: inline code to md by @NgZiming in #132
fix: tag_asciimath by @lsp213 in #135
Revert "fix: tag_asciimath" by @drunkpig in #140
feat: inline code and fix all tests by @NgZiming in #137
feat: update release version by @dt-yy in #138
fix: html table to markdown table issue by @drunkpig in #141
bench: add 2 table test data by @e06084 in #142
fix: some text lost by @drunkpig in #143
feat:img tag modified by @imMid-Star in #136
fix: wrap inline equation by @e06084 in #145
fix: space by @imMid-Star in #146
feat: Replace consecutive whitespace characters in the text with a single one by @drunkpig in #150
fix: inline and interline math wrap rm () and [] by @e06084 in #149
fix: fix html to md bug by @dt-yy in #148
feat: 离散code标签合并算法改 by @NgZiming in #151
fix: some text lost in list by @drunkpig in #153
add unittest case by @dt-yy in #154
feat: add internline ASCIIMath support by @e06084 in #155
fix table img label by @dt-yy in #157
fix: 修复段落提取文本时，节点 tail 中的文本拼接在子元素的文本前 by @NgZiming in #158
fix: 列表中的非li的元素tail被丢弃及列表文本顺序 by @NgZiming in #159
fix: 删除 title 中的 tail / 修复 code 规则的nbsp被删除的问题 / 图片没有path导致转换md失败 by @NgZiming in #162
feat: detect code lineno and remove lineno by @NgZiming in #163
fix: image.py & test_image.py by @renpengli01 in #167
feat: add magic-html command tool by @yogacc33 in #172
feat: remove common spaces prefix by @NgZiming in #173
feature: add math prefix func by @imMid-Star in #177
[bench]: add statics oof content list calculate in bench by @e06084 in #180
feat: Add HTML tag process and simplify for classification algorithm by @darkrush in #181
add remove format table by @dt-yy in #178
feat: Ignore bad styles in the code by @sixgad in #187
fix: doc.oracle.com extractor by @drunkpig in #186
feat: add get statics of contentlist in post extractor by @e06084 in #184
Dev lid218 by @2471023025 in #191
fix: del unused dependencies by @drunkpig in #195
fix: node extract incorrect add: inline display by @lsp213 in #182
[test]: add st test in CI by @e06084 in #199
[bench]: add type_acc calculate by @e06084 in #203
feat: 添加 HTML 分类功能，分类为文章，论坛和other by @feifei2023 in #196
lang_id doc by @2471023025 in #198
remove empty table by @dt-yy in #205
test: add DataJson test by @drunkpig in #202
[CI]: add python3.13 env by @e06084 in #206
Merge pull request #212 from ccprocessor/dev by @e06084 in #213
rebase main by @e06084 in #216
sync dev by @e06084 in #217

New Contributors

@imMid-Star made their first contribution in #82
@LollipopsAndWine made their first contribution in #99
@minrui3 made their first contribution in #115
@darkrush made their first contribution in #181
@2471023025 made their first contribution in #191
@feifei2023 made their first contribution in #196

Full Changelog: llm-web-kit==3.0.0...llm-web-kit==3.0.1released

Contributors

darkrush, sixgad, and 12 other contributors

Assets 2

13 Jan 07:58

dt-yy

llm-web-kit==3.0.0

9c2e619

llm-web-kit==3.0.0

What's Changed

[config]: split requirements with runtime and dev by @e06084 in #9
fix: save extracted content and raw_html simultaneously by @drunkpig in #12
docs: content_list specification by @drunkpig in #14
docs: markdown spec by @drunkpig in #16
docs: txt spec by @drunkpig in #17
[feature]: add math extract by @e06084 in #18
feat: add magic_html extractor by @sixgad in #19
feat: Add code recog base by @NgZiming in #20
[feature]: add math render detect method by @e06084 in #21
feat: html split by tag by @drunkpig in #23
docs: how to use html split by @drunkpig in #24
docs: update content_list spec by @drunkpig in #25
feat: html->content-list dict by @drunkpig in #26
add table parse by @dt-yy in #28
feat: implement to_content_list of HTMLExtractor by @drunkpig in #29
docs: pdf pipeline input index file format spec by @drunkpig in #30
[feat]: add to_content_list_node method and get_equation_type by @e06084 in #31
fix: git pre-commit error by @drunkpig in #32
[feature]: add contains_math by @e06084 in #33
fix: extract node changed tree structure && feat: add new test case by @NgZiming in #34
feat: get raw_html from user defined html tag by @drunkpig in #35
[fix]: update html_split_by_tags usage in ccmath by @e06084 in #36
feat: attach html parent node path when split html by tag by @drunkpig in #39
feat: remove magic number by @NgZiming in #37
feat: add math-container extract by @e06084 in #38
fix: add function to test if a html segment contain a cc html tag by @drunkpig in #41
fix: html split by tag by @drunkpig in #43
feat: add ccmath-inline and ccmath-interline by @e06084 in #44
feat: add math tag recognize by @e06084 in #46
feat: title recognize by @drunkpig in #48
fix: span math modify by @e06084 in #49
feat: parse list by @drunkpig in #50
refact: add tail text when build cc element by @drunkpig in #51
fix: change parse order by @drunkpig in #52
fix: use _build_html_tree in ccmath by @e06084 in #53
feat: add porn detector and corresponding unit test code. by @ideaflow in #15
remove dependency of model_config.jsonc in model assets by @ideaflow in #55
feat: add pipeline--image.py & test image htmls & test_image.py & do… by @renpengli01 in #57
fix table conflict by @dt-yy in #58
docs[lxml]: document of lxml by @drunkpig in #61
fix: ccmath use lxml.html to parse html by @e06084 in #65
feat: add mathml test case by @e06084 in #67
fix: build html element without parent by @drunkpig in #68
feat: add html tag_span_script by @lsp213 in #69
fix: pre-commit by @e06084 in #72
fix: mml to latex by @e06084 in #74
replace html-element by @dt-yy in #76

New Contributors

@e06084 made their first contribution in #1
@renpengli01 made their first contribution in #7
@drunkpig made their first contribution in #11
@sixgad made their first contribution in #19
@NgZiming made their first contribution in #20
@dt-yy made their first contribution in #28

Full Changelog: https://github.com/ccprocessor/llm-webkit-mirror/commits/llm-web-kit==3.0.0

Contributors

sixgad, ideaflow, and 6 other contributors

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

Releases: ccprocessor/llm-webkit-mirror

v3.2.0-released

What's Changed

Contributors

Uh oh!

v3.1.2-released

What's Changed

Contributors

Uh oh!

v3.1.1-released

What's Changed

Contributors

Uh oh!

v3.1.0-released

What's Changed

New Contributors

Contributors

Uh oh!

llm-web-kit==3.0.1released

What's Changed

New Contributors

Contributors

Uh oh!

llm-web-kit==3.0.0

What's Changed

New Contributors

Contributors

Uh oh!