Knowledge_Graph_Wander/content/Data_Augment.md at master · Schlampig/Knowledge_Graph_Wander · GitHub

Data Augmentation

Blog

2021年10月哈工大｜NLP数据增强方法？我有15种

AutoClean

link: https://github.com/elisemercury/AutoClean
pypi: https://pypi.org/project/py-AutoClean/
author: Elise Landman
note: python package for automated data preprocessing & cleaning.

AutoLabel

link: https://github.com/refuel-ai/autolabel
discord: https://discord.com/invite/fweVnRx6CU
author: refuel-ai
note: label, clean and enrich text datasets with LLMs.

NLPCDA

link: https://github.com/425776024/nlpcda
pypi: https://pypi.org/project/nlpcda/
author: 425776024
note: a nlp Chinese data augmentation package.

EDA

link: https://github.com/jasonwei20/eda_nlp
author: Jason Wei
chinese eda: EDA_NLP_for_Chinese
pypi: edalize
note: easy data augmentation techniques for boosting performance on text classification tasks.
extra: AEDA(paper, code)

UDA

link: https://github.com/google-research/uda
author: Google Research
paper: Xie, Q. , Dai, Z. , Hovy, E. , Luong, M. T. , & Le, Q. V. . (2019). Unsupervised data augmentation for consistency training.
note: unsupervised aata augmentation code.

NLPAUG

link: https://github.com/makcedward/nlpaug
author: Edward Ma
tutorial: https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28
note: a python library helps people with augmenting nlp for your machine learning projects, for English text.

TextAttack

link: https://github.com/QData/TextAttack
author: QData
note: a python framework for adversarial attacks, data augmentation, and model training in nlp.

LaserTagger

link: https://github.com/tongchangD/text_data_enhancement_with_LaserTagger
author: tongchangD
note: a method to retell Chinese text on modified LaserTagger Model.

python-pinyin

link: https://github.com/mozillazg/python-pinyin
author: Huang Huang
documentation: https://pypinyin.readthedocs.io/zh_CN/master/
pypi: https://pypi.org/project/pypinyin/

TextFlint

link: https://github.com/textflint/textflint
author: Fudan University NLP Group
note: a multilingual robustness evaluation platform for natural language processing, which unifies text transformation, sub-population, adversarial attack, and their combinations to provide a comprehensive robustness analysis.

deduplicate-text-datasets

link: https://github.com/google-research/deduplicate-text-datasets
author: Google Research
note: code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better"

AugLy

link: https://github.com/facebookresearch/AugLy
author: facebookresearch
note: a data augmentations library for audio, image, text, and video.

Mutate

link: https://github.com/infinitylogesh/mutate
author: Logesh kumar
note: a library to synthesize text datasets using Large Language Models (LLM).

Text preprocessing for Natural Language Processing

link: https://github.com/berknology/text-preprocessing
author: Berknology
note: a python package for text preprocessing task in natural language processing.