NLP and Text-As-Data speaker series: On using large language models to translate literary works, and detecting when they’ve been used

Speaker: Mohit Iyyer

Location: 60 Fifth Avenue, Room 7th floor common area

Date: Thursday, April 13, 2023

In this presentation, I will talk about two projects that are only tangentially related (hence the awkward title!). First, I’ll share my lab’s experiences on working with large language models to translate literary texts (e.g., novels). Our research in this direction includes (1) efforts to develop human and automatic evaluations of the quality of AI-generated literary translations, which allow us to develop a better understanding of the types of errors made by these systems; and (2) work on building a publicly-accessible platform for readers of AI-generated translation, and extensions to collaborative human-AI translation. In the second part of the talk, I’ll focus on detecting when a piece of text has been generated by a language model. We evaluate several existing AI-generated text detectors (e.g., watermarking, DetectGPT, etc.) and discover that they are vulnerable to paraphrase attacks: simply passing text generated by ChatGPT through an external paraphrasing model is enough to fool current detectors. We propose a retrieval-based detection algorithm that proves more robust against paraphrasing attacks, but also has its own limitations.