Bad teacher bots can leave hidden marks on model students

Research indicates that training large language models (LLMs) on outputs from other models can lead to the transmission of negative traits, termed 'subliminal learning.' This occurs even when undesirable traits are scrubbed from the training data. The study highlights risks in AI development, particularly as the practice of distillation becomes more common due to limited training data and the high costs of larger models. Experiments showed that student models adopted preferences from teacher models significantly, demonstrating the potential for unintended behavior transfer.

"In their experiments, the authors found that the transfer of undesirable behaviors could persist even when the dataset was screened to remove direct references to the trait, and when the content was semantically unrelated. They coined the term 'subliminal learning' for this phenomenon."

"Using LLMs to teach other models is becoming increasingly popular. The process, called distillation, is driven by the fact that developers are running out of training data, and larger models are more costly to run and take longer to respond to users."

#ai-development #large-language-models #subliminal-learning #model-distillation #negative-trait-transfer

Read at Theregister

Unable to calculate read time

Collection

[

...

]

Bad teacher bots can leave hidden marks on model studentsBad teacher bots can leave hidden marks on model students Briefly

Bad teacher bots can leave hidden marks on model students
Bad teacher bots can leave hidden marks on model students
Briefly