Little Bobby Tables has a baby sister. Meet Sally Ignore Previous Instructions.

snaggen@programming.dev · 1 year ago

Little Bobby Tables has a baby sister. Meet Sally Ignore Previous Instructions.

vcmj@programming.dev · edit-2 1 year ago

Ah, even then it could just be a consequence of training samples usually being chronological(most often the expected resolution for conflicting instructions is "whatever you heard last", with some exceptions when explicitly stated) so it learns to think that way. I did find the pattern also applies to GPT trained on long articles where you'd expect it not to, so wanted to just explain why that might be.