Who’s Harry Potter? Approximate Unlearning in LLMs: Conclusion, Acknowledgment, and References

cover
3 Jul 2024

Authors:

(1) Ronen Eldan, Microsoft Research (email: roneneldan@microsoft.com);

(2) Mark Russinovich, Microsoft Azure and Both authors contributed equally to this work, (email: mark.russinovich@microsoft.com).

5 Conclusion

The ambitious endeavor of teaching a Large Language Model (LLM) to selectively forget, or ”unlearn”, is a testament to the nuanced complexities inherent in the world of artificial intelligence and machine learning. Widely regarded as a daunting task, any attempt at enabling such a functionality in LLMs stands at the vanguard of innovative solutions, and in this light, our proof of concept arguably underscores progress.

Firstly, our research demonstrates that unlearning, though challenging, is not an insurmountable task, as the positive outcomes in our experiments with the Llama2-7b model suggest. Yet, this achievement must be contextualized with prudence. Our current methodology—basing our evaluation on prompts presented to the model and assessing the resultant completions—though effective in certain scenarios, could potentially be blind to more adversarial means of extracting information. It’s conceivable that non-traditional or intricate methods, such as delving into token probability distributions, might inadvertently reveal the model’s latent familiarity with unlearned content.

Diving deeper into the potential generality of our technique, a pertinent observation emerges when considering the unique attributes of the Harry Potter series. The books are replete with idiosyncratic expressions and distinctive names—traits that, in hindsight, may have abetted our unlearning strategy. The pronounced presence of Harry Potter themes across the training data of many LLMs further compounds the challenge. Given such widespread representation, even the slightest hint in a prompt might stir a cascade of related completions, underscoring the depth of memory ingrained in the model.

A nuance of our methodology involves a reliance on GPT-4’s existing knowledge of the Harry Potter universe. To detect specific anchored terms and devise generic counterparts, the expertise of GPT-4 proved useful. This raises the question whether our technique achieve similar efficacy when stripped of such vast prior knowledge. Preliminary experiments show that entity extraction can still be effective when this knowledge is absent, and we speculate that the lack of familiarity with idiosyncratic expressions can be addressed with simple n-gram frequency analysis but we leave a more thorough study for future work.

Extending our approach to other types of content, particularly non-fiction or textbooks, presents its own set of challenges. Unlike the fictional universe of Harry Potter, non-fiction content will not possess the same density of unique terms or phrases. Furthermore, non-fictional texts often embed higher-level constructs such as ideas, concepts, or cultural perspectives. It remains uncertain to what extent our technique can effectively address and unlearn these more abstract elements. This would clearly necessitate adaptations of our technique.

In conclusion, while our technique offers a promising start, its applicability across various content types remains to be thoroughly tested. The presented approach offers a foundation, but further research is needed to refine and extend the methodology for broader unlearning tasks in LLMs.

Acknowledgement

The authors would like to thank Yanan Cai for helping to configure and manage the Azure GPU VMs used for this work.

References

[BHT+19] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Y Chai, Mirella Lapata, Angeliki Lazaridou, Ryan J Maynez, Piyush Narang, et al. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.

[GTB+21] Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021.

[JLZ+22] Yiwen Jiang, Shenglong Liu, Tao Zhao, Wei Li, and Xianzhou Gao. Machine unlearning survey. In Fifth International Conference on Mechatronics and Computer Technology Engineering (MCTE 2022), volume 12500, pages 1596–1603. SPIE, 2022.

[JYY+22] Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.

[MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.

[NHN+22] Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. arXiv preprint arXiv:2209.02299, 2022.

[Row07] J.K. Rowling. The Harry Potter Series. Bloomsbury, 1997-2007. Comprising: Harry Potter and the Philosopher’s Stone; Harry Potter and the Chamber of Secrets; Harry Potter and the Prisoner of Azkaban; Harry Potter and the Goblet of Fire; Harry Potter and the Order of the Phoenix; Harry Potter and the Half-Blood Prince; Harry Potter and the Deathly Hallows.

[SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.

WCY+23] Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.

[YBS19] Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. arXiv preprint arXiv:1911.07176, 2019.

[ZFBH+23] Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023.

[ZHB+19] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.

[ZNIS23] Haibo Zhang, Toru Nakamura, Takamasa Isohara, and Kouichi Sakurai. A review on machine unlearning. SN Computer Science, 4(4):337, 2023.

This paper is available on arxiv under CC 4.0 license.