HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

A. Evtushenko

doi:10.31618/ESU.2413-9335.2021.1.86.1349

A. Evtushenko DataFamily Khabarovsk

DOI: https://doi.org/10.31618/ESU.2413-9335.2021.1.86.1349

Keywords: language models, data preprocessing, GPT-3, critical information, personal data, masking.

Abstract

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP).

In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB.

This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Author Biography

A. Evtushenko , DataFamily Khabarovsk

CTO

References

1.Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language Models are Few-Shot Learners. 2020. 75s. https://arxiv.org/pdf/2005.14165.pdf
2.Nicholas Carlini, Florian Tramer, Eric Wallace, et al. Extracting Training Data from Large Language Models. 2020. 21s. https://arxiv.org/pdf/2012.07805.pdf
3.Sorami Hisamoto, Matt Post, Kevin Duh. Membership Inference Attacks on Sequence-toSequence Models: Is My Data In Your Machine Translation System? 2020. 15c. https://arxiv.org/pdf/1904.05506.pdf
4.Congzheng Song, Ananth Raghunathan. Information Leakage in Embedding Models. 2020. 14s. https://arxiv.org/pdf/2004.00053.pdf
5.Vsyo, chto nam nuzhno — eto generaciya / Habr – soobshestvo IT specialistov. URL: https://habr.com/ru/company/sberbank/blog/550056/
6.Kontrolnoe chislo / Wikipedia svobodnaya enciklopediya. URL: https://ru.wikipedia.org/wiki/Kontrolnoe_chislo

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

Abstract

Author Biography

References

CC BY-ND