Safety Pretraining: Toward the Next Generation of Safe AI

Speaker: Sachin Goyal, PhD student in the Machine Learning Department at Carnegie Mellon University
Date: 30 April 2025

“We need to build models which are honest”, emphasised Sachin Goyal in his talk on ‘Safety Pretraining Toward the Next Generation of Safe AI’. He pointed out that safety is one of the biggest challenges in artificial intelligence (AI), and there is usually a trade-off between performance and safety. During pretraining, models learn harmful capabilities (HCs) as they are trained on raw data that may be unsafe. Currently, these models are being aligned for safety during the post-training phase via post hoc fine-tuning. During alignment, there is a reduction in the harmful capabilities, but this does not mean that the model has forgotten these HCs. It is impossible to unlearn the HCs that the model has learnt, during post hoc alignment. Models break even under benign fine-tuning, and patch-level fixes do not work. In addition, pretraining biases are exacerbated during fine-tuning; there is a drop in safe generations, context reliance, and post-training performance.

Models have started deceiving and cheating as they keep learning reasoning, and they do not reveal these thoughts. This calls for natively safe models, where the models know what safe content is, and what good behaviour is. How do humans learn? They grow up in supervised safe environments when they are children; they are taught potentially harmful content in controlled environments like schools with proper context; and they learn to recognise and steer away from hostile situations and bad actors. On the contrary, models are pretrained on raw web data, without any proper contextualisation of sensitive content; in the end, they are asked to align to certain behaviours by simply fine-tuning on a small dataset. There is no curriculum involved in training models.

Goyal gave a recipe for safe models, which has four steps: (1) safety filtering, (2) contextualised rephrasing, (3) native refusal training, and (4) tagging harmful content. Goyal highlighted the fact that pretraining safety interventions makes models safer. These safety-pretrained models reduce the attack success rates from 38.8% to 8.4% on standard large language models (LLM) safety benchmarks with no performance degradation on the general tasks.