Error message here!

Hide Error message here!


Error message here!


Hide Error message here!


Error message here!



data augmentation

franztao 2022-11-24 23:37:19 阅读数:9 评论数:0 点赞数:0 收藏数:0

Evaluate data augmentation on training data splits,以增加高质量训练样本的数量.


It is often desirable to increase the size and diversity of the training data through data augmentation.It involves generating synthetic but realistic examples using existing samples.

  1. 拆分数据集.Want to split the dataset first,Because if allowed to place the generated samples in different data splits,Many enhancement techniques result in some form of data leakage.

    例如,Some augmentations involve generating synonyms for certain key tokens in sentences.If generated sentences from the same source sentence are allowed to enter different splits,Samples with nearly identical embedding representations may be leaked across different splits.

  2. 增加训练拆分.Just want to apply data augmentation on the training set,Because validation and test splits should be used to provide accurate estimates of actual data points.

  3. 检查和验证.If the augmented data sample is not an input the model may encounter in production,Then augmentation just to increase the training sample size is useless.

The exact method of data augmentation depends largely on the data type and application.Here are a few ways that different data schemas can be enhanced:

The data augmentation type
The data augmentation type

使用 Snorkel 进行数据扩充

  • 一般:归一化、平滑、随机噪声、合成过采样( SMOTE)等.
  • 自然语言处理(NLP):替换(同义词、tfidf、嵌入、masked model)、随机噪声、拼写错误等.
  • 计算机视觉(CV):裁剪、翻转、旋转、填充、饱和、增加亮度等.


虽然某些数据模式(例如图像)的转换很容易检查和验证,But other modes may introduce silent errors.例如,改变文本中标记的顺序可以显着改变含义(“这真的很酷”→“这真的很酷吗”).因此,It is important to measure the noise that the augmentation strategy will introduce,并对发生的转换进行精细控制.


根据特征类型和任务,There are many data augmentation libraries that allow augmenting the training data.

自然语言处理 (NLP)

  • NLPAug:NLP 的数据增强.
  • TextAttack:用于 NLP Adversarial attacks in 、数据增强和模型训练的框架.
  • TextAugment:文本增强库.


  • Imgaug:用于机器学习实验的图像增强.
  • Albumentations:快速图像增强库.
  • Augmentor:用于机器学习的 Python 图像增强库.
  • Kornia.augmentation:在 GPU 中执行数据增强的模块.
  • SOLT:A data augmentation library for deep learning,支持图像、分割掩码、标签和关键点.


  • Snorkel:Systems that generate training data under weak supervision.
  • DeltaPy⁠⁠:Tabular data augmentation and feature engineering.
  • Audiomentations:一个用于音频数据增强的 Python 库.
  • Tsaug:One for time series augmentation Python 包.


让使用nlpauglibrary to augment datasets and evaluate the quality of generated samples.

pip install nlpaug==1.1.0 transformers==3.0.2 -q
pip install snorkel==0.9.8 -q

import nlpaug.augmenter.word as naw
# Load tokenizers and transformers
substitution = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="substitute")
insertion = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="insert")
text = "Conditional image generation using Variational Autoencoders and GANs."

# Substitutions

Substitution doesn't seem like a good idea for pairs,Because certain keywords provide strong signals for tags,So don't want to change them.另外请注意,这些增强不是确定性的,It will be different every time you run them.Let's try plugging in...

# Insertions


A bit better but still fragile,Now it may insert keywords that may affect the occurrence of false positive tags.Perhaps instead of replacing or inserting new tags,Instead let's try simply swapping the machine-learned relevant keywords with their aliases.将使用 Snorkel 的转换函数来轻松实现这一点.

# Replace dashes from tags & aliases
def replace_dash(x):
    return x.replace("-"" ")

# Aliases
aliases_by_tag = {
    "computer-vision": ["cv""vision"],
    "mlops": ["production"],
    "natural-language-processing": ["nlp""nlproc"]

# Flatten dict
flattened_aliases = {}
for tag, aliases in aliases_by_tag.items():
    tag = replace_dash(x=tag)
    if len(aliases):
        flattened_aliases[tag] = aliases
    for alias in aliases:
        _aliases = aliases + [tag]
        flattened_aliases[alias] = _aliases

print (flattened_aliases["natural language processing"])
print (flattened_aliases["nlp"])

['nlp', 'nlproc'] ['nlproc', 'natural language processing']

Labels and aliases will now be used as-is,但可以使用inflectaliases_by_tag包来解释多个标签,或者在替换别名之前应用词干提取等.

# We want to match with the whole word only
print ("gan" in "This is a gan.")
print ("gan" in "This is gandalf.")

# \b matches spaces
def find_word(word, text):
    word = word.replace("+""\+")
    pattern = re.compile(fr"\b({word})\b", flags=re.IGNORECASE)

# Correct behavior (single instance)
print (find_word("gan""This is a gan."))
print (find_word("gan""This is gandalf."))

<re.Match object; span=(10, 13), match='gan'> None

Let's use now snorkeltransformation_functionSystematically apply this transformation to the data.

from snorkel.augmentation import transformation_function

def swap_aliases(x):
    """Swap ML keywords with their aliases."""
    # Find all matches
    matches = []
    for i, tag in enumerate(flattened_aliases):
        match = find_word(tag, x.text)
        if match:
    # Swap a random match with a random alias
    if len(matches):
        match = random.choice(matches)
        tag = x.text[match.start():match.end()]
        x.text = f"{x.text[:match.start()]}{random.choice(flattened_aliases[tag])}{x.text[match.end():]}"
    return x

# Swap
for i in range(3):
    sample_df = pd.DataFrame([{"text""a survey of reinforcement learning for nlp tasks."}])
    sample_df.text = sample_df.text.apply(preprocess, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)

# Undesired behavior (needs contextual insight)
for i in range(3):
    sample_df = pd.DataFrame([{"text""Autogenerate your CV to apply for jobs using NLP."}])
    sample_df.text = sample_df.text.apply(preprocess, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)

autogenerate vision apply jobs using nlp autogenerate cv apply jobs using natural language processing autogenerate cv apply jobs using nlproc

使用 nlp Automatically generate vision application jobs Automatically generate resumes to apply for jobs using natural language processing 使用 nlproc Automatically generate resumes to apply for jobs

One will now be defined增强策略to apply transformation functions and certain rules(生成多少样本,是否保留原始数据点等)

from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier

# Transformation function (TF) policy
policy = ApplyOnePolicy(n_per_original=5, keep_original=True)
tf_applier = PandasTFApplier([swap_aliases], policy)
train_df_augmented = tf_applier.apply(train_df)
train_df_augmented.drop_duplicates(subset=["text"], inplace=True)

len(train_df), len(train_df_augmented)

(668, 913)

现在,Data augmentation will be skipped,Because it is fickle,而且根据经验,It doesn't improve performance significantly.但是,Once it is possible to control the type of vocabulary being expanded and exactly what is being expanded,You will see that this can be very effective.


无论使用什么方法,It's important to verify that augmentation isn't just for augmentation's sake.Can be implemented by any existing 数据验证测试甚至创建特定的测试来应用于增强数据来做到这一点.

The main body of this article comes from the link below:

    author       = {Goku Mohandas},
    title        = { Made With ML },
    howpublished = {\url{}},
    year         = {2022}
Copyright statement
In this paper,the author:[franztao],Reprint please bring the original link, thank you