Poster
in
Workshop: Synthetic Data Generation with Generative AI
CALICO: Conversational Agent Localization via Synthetic Data Generation
Andy Rosenbaum · Ershad Banijamali · Christopher DiPersio · Pegah Kharazmi · Pan Wei · Lu Zeng · Gokmen Oz · Wael Hamza · Clement Chung · Karolina Owczarzak · Fabian Triefenbach
Keywords: [ Multilingual ] [ Synthetic Data Generation ] [ Large Langauge Models ] [ Natural Language Processing ]
We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For named-entities, CALICO supports three operations: verbatim copy, literal transla- tion, and localization, i.e. generating entity values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 6 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized entities which are closer to the HL test set.