- 2021年10月 哈工大|NLP数据增强方法?我有15种
- link: https://github.com/elisemercury/AutoClean
- pypi: https://pypi.org/project/py-AutoClean/
- author: Elise Landman
- note: python package for automated data preprocessing & cleaning.
- link: https://github.com/refuel-ai/autolabel
- discord: https://discord.com/invite/fweVnRx6CU
- author: refuel-ai
- note: label, clean and enrich text datasets with LLMs.
- link: https://github.com/425776024/nlpcda
- pypi: https://pypi.org/project/nlpcda/
- author: 425776024
- note: a nlp Chinese data augmentation package.
- link: https://github.com/jasonwei20/eda_nlp
- author: Jason Wei
- chinese eda: EDA_NLP_for_Chinese
- pypi: edalize
- note: easy data augmentation techniques for boosting performance on text classification tasks.
- extra: AEDA(paper, code)
- link: https://github.com/google-research/uda
- author: Google Research
- paper: Xie, Q. , Dai, Z. , Hovy, E. , Luong, M. T. , & Le, Q. V. . (2019). Unsupervised data augmentation for consistency training.
- note: unsupervised aata augmentation code.
- link: https://github.com/makcedward/nlpaug
- author: Edward Ma
- tutorial: https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28
- note: a python library helps people with augmenting nlp for your machine learning projects, for English text.
- link: https://github.com/QData/TextAttack
- author: QData
- note: a python framework for adversarial attacks, data augmentation, and model training in nlp.
- link: https://github.com/tongchangD/text_data_enhancement_with_LaserTagger
- author: tongchangD
- note: a method to retell Chinese text on modified LaserTagger Model.
- link: https://github.com/mozillazg/python-pinyin
- author: Huang Huang
- documentation: https://pypinyin.readthedocs.io/zh_CN/master/
- pypi: https://pypi.org/project/pypinyin/
- link: https://github.com/textflint/textflint
- author: Fudan University NLP Group
- note: a multilingual robustness evaluation platform for natural language processing, which unifies text transformation, sub-population, adversarial attack, and their combinations to provide a comprehensive robustness analysis.
- link: https://github.com/google-research/deduplicate-text-datasets
- author: Google Research
- note: code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better"
- link: https://github.com/facebookresearch/AugLy
- author: facebookresearch
- note: a data augmentations library for audio, image, text, and video.
- link: https://github.com/infinitylogesh/mutate
- author: Logesh kumar
- note: a library to synthesize text datasets using Large Language Models (LLM).
- link: https://github.com/berknology/text-preprocessing
- author: Berknology
- note: a python package for text preprocessing task in natural language processing.