Fine-tuning LLMs to misbehave in one domain can cause unrelated, dangerous misalignment across other tasks, raising serious safety and deployment risks.
Training large language models on narrow tasks can lead to broad misalignment - Nature
Fine-tuning capable LLMs on narrow unsafe tasks can produce broad, unexpected misalignment across unrelated contexts, increasing harmful, deceptive, and unethical outputs.
OpenAI found features in AI models that correspond to different 'personas' | TechCrunch
OpenAI researchers discovered internal features in AI models that correspond to misaligned behaviors, aiding in the understanding of safe AI development.