Workshop: Human in the loop dialogue systems
Behnam Hedayatnia, Rahul Goel, Shereen Oraby, Abigail See, Chandra Khatri, Y-Lan Boureau, Alborz Geramifard, Marilyn Walker, Dilek Hakkani-Tur
2020-12-11T06:10:00-08:00 - 2020-12-11T17:20:00-08:00
Abstract: Conversational interaction systems such as Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana have become very popular over the recent years. Such systems have allowed users to interact with a wide variety of content on the web through a conversational interface. Research challenges such as the Dialogue System Technology Challenges, Dialogue Dodecathlon, Amazon Alexa Prize and the Vision and Language Navigation task have continued to inspire research in conversational AI. These challenges have brought together researchers from different communities such as speech recognition, spoken language understanding, reinforcement learning, language generation, and multi-modal question answering.
Unlike other popular NLP tasks, dialogue frequently has humans in the loop, whether it is for evaluation, active learning or online reward estimation. Through this workshop we aim to bring together researchers from academia and industry to discuss the challenges and opportunities in such human in the loop setups. We hope that this sparks interesting discussions about conversational agents, interactive systems, and how we can use humans most effectively when building such setups. We will highlight areas such as human evaluation setups, reliability in human evaluation, human in the loop training, interactive learning and user modeling. We also highly encourage non-English based dialogue systems in these areas.
The one-day workshop will include talks from senior technical leaders and researchers to share insights associated with evaluating dialogue systems. We also plan on having oral presentations and poster sessions on works related to the topic of the workshop. Finally we will end the workshop with an interactive panel of speakers. As an outcome we expect the participants from the NeurIPS community to walk away with better understanding of human in the loop dialogue modeling as well as key areas of research in this field. Additionally we would like to see discussions around the unification of human evaluation setups in some way.
Unlike other popular NLP tasks, dialogue frequently has humans in the loop, whether it is for evaluation, active learning or online reward estimation. Through this workshop we aim to bring together researchers from academia and industry to discuss the challenges and opportunities in such human in the loop setups. We hope that this sparks interesting discussions about conversational agents, interactive systems, and how we can use humans most effectively when building such setups. We will highlight areas such as human evaluation setups, reliability in human evaluation, human in the loop training, interactive learning and user modeling. We also highly encourage non-English based dialogue systems in these areas.
The one-day workshop will include talks from senior technical leaders and researchers to share insights associated with evaluating dialogue systems. We also plan on having oral presentations and poster sessions on works related to the topic of the workshop. Finally we will end the workshop with an interactive panel of speakers. As an outcome we expect the participants from the NeurIPS community to walk away with better understanding of human in the loop dialogue modeling as well as key areas of research in this field. Additionally we would like to see discussions around the unification of human evaluation setups in some way.
Video
Chat
Chat is not available.
Schedule
2020-12-11T06:10:00-08:00 - 2020-12-11T06:20:00-08:00
Welcome and Opening Remarks
Behnam Hedayatnia
2020-12-11T07:50:00-08:00 - 2020-12-11T08:05:00-08:00
Invited Talk 1 Q/A - Milica Gašić
Milica Gasic
Current dialogue models are unnatural, narrow in domain and frustrating for users. Ultimately, we would rather like to converse with continuously evolving, human-like dialogue models at ease with large and extending domains. Limitations of the dialogue state tracking module, which maintains all information about what has happened in the dialogue so far, are central to this challenge. Its ability to extend its domain of operation is directly related to how natural the user perceives the system. I will talk about some of the latest research coming from the HHU Dialogue Systems and Machine Learning group that addresses this question.
2020-12-11T08:05:00-08:00 - 2020-12-11T08:20:00-08:00
Invited Talk 2 Q/A - Larry Heck
Larry Heck
I will present my recent research on expanding the AI skills of digital assistants through explicit human-in-the-loop dialogue and demonstrations. Digital assistants learn from other digital assistants with each assistant initially trained through human interaction in the style of a“Master and Apprentice”. For example, when a digital assistant does not know how to complete a requested task, rather than responding “I do not know how to do this yet”, the digital assistant responds with an invitation to the human“can you teach me?”. Apprentice-style learning is powered by a combination of all the modalities: natural language conversations, non-verbal modalities including gestures, touch, robot manipulation and motion, gaze, images/videos, and speech prosody. The new apprentice learning model is always helpful and always learning in an open world – as opposed to the current commercial digital assistants that are sometimes helpful, trained exclusively offline, and function over a closed world of “walled garden” knowledge. Master-Apprentice learning has the potential to yield exponential growth in the collective intelligence of digital assistants.
2020-12-11T08:20:00-08:00 - 2020-12-11T08:35:00-08:00
Invited Talk 3 Q/A - Maxine Eskenazi
Maxine Eskenazi, Shikib Mehri
Most of the work on intelligent agents in the past has centered on the agent itself, ignoring the needs and opinions of the user. We will show that it is essential to include the user in agent development and assessment. There is a significant advantage to relying on real users as opposed to paid users, which are the most prevalent at present. This introduces a study to assess system generation that employed the user’s following utterance for a more realistic picture of the appropriateness of an utterance. This takes us to a discussion of user-centric evaluation where two novel metrics, USR and FED, are introduced. Finally we present an interactive Challenge with real users held as a thread of DSTC9.
2020-12-11T09:05:00-08:00 - 2020-12-11T09:15:00-08:00
Contributed Talk 1 Q/A
Aaron Jhan
Apart from the coherence and fluency of responses, an empathetic chatbot emphasizes more on people's feelings. By considering altruistic behaviors between human interaction, empathetic chatbots enable people to get a better interactive and supportive experience. This study presents a framework whereby several empathetic chatbots are based on understanding users' implied feelings and replying empathetically for multiple dialogue turns. We call these chatbots CheerBots. CheerBots can be retrieval-based or generative-based and were finetuned by deep reinforcement learning. To respond in an empathetic way, we develop a simulating agent, a Conceptual Human Model, as aids for CheerBots in training with considerations on changes in user's emotional states in the future to arouse sympathy. Finally, automatic metrics and human rating results demonstrate that CheerBots outperform other baseline chatbots and achieves reciprocal altruism. The code and the pre-trained models will be made available.
2020-12-11T09:15:00-08:00 - 2020-12-11T09:25:00-08:00
Contributed Talk 2 Q/A
Thibault Cordier
These interactions can be taken from either human-to-human or human-machine conversations. However, human interactions are scarce and costly, making learning from few interactions essential. One solution to speedup the learning process is to guide the agent's exploration with the help of an expert. We present in this paper several imitation learning strategies for dialogue policy where the guiding expert is a near-optimal handcrafted policy. We incorporate these strategies with state-of-the-art reinforcement learning methods based on Q-learning and actor-critic. We notably propose a randomised exploration policy which allows for a seamless hybridisation of the learned policy and the expert. Our experiments show that our hybridisation strategy outperforms several baselines, and that it can accelerate the learning when facing real humans.
2020-12-11T09:25:00-08:00 - 2020-12-11T10:30:00-08:00
Poster Session Presentations
Nalin Chhibber, Weiyi Lu, Lina Rojas-Barahona, Katie Stasaski, Dookun Park, Govind Thattai, Alexandry Augustin, Mathilde Veron, Sahisnu Mazumder, Evgeny Krivosheev, Alessandro Bozzon
Gather.town room is linked https://neurips.gather.town/app/PWaiZS2fB5KdXNUK/HLDS%20Poster%20Session
2020-12-11T10:30:00-08:00 - 2020-12-11T11:00:00-08:00
Breakout session: Human Evaluation
Behnam Hedayatnia
https://us02web.zoom.us/j/71869602731?pwd=dFRoY3JwVUp6d2pOd3Q2ZXp3U3Z0QT09 Meeting ID: 718 6960 2731 Passcode: HLDS
2020-12-11T10:30:00-08:00 - 2020-12-11T11:00:00-08:00
Breakout session: Automatic Evaluation
Yang Liu
https://us02web.zoom.us/j/81906080248?pwd=aVlOdzFoZzJHWjZoaFlTODRVTEwxdz09 Meeting ID: 819 0608 0248 Passcode: HLDS
2020-12-11T11:20:00-08:00 - 2020-12-11T11:30:00-08:00
Contributed Talk 3 Q/A
José David Águas Lopes
Challenges around collecting and processing quality data have hampered progress in data-driven dialogue models. Previous approaches are moving away from costly, resource-intensive lab settings, where collection is slow but where the data is deemed of high quality. The advent of crowd-sourcing platforms, such as Amazon Mechanical Turk, has provided researchers with an alternative cost-effective and rapid way to collect data. However, the collection of fluid, natural spoken or textual interaction can be challenging, particularly between two crowd-sourced workers. In this study, we compare the performance of dialogue models for the same interaction task but collected in two different settings: in the lab vs. crowd-sourced. We find that fewer lab dialogues are needed to reach similar accuracy, less than half the amount of lab data as crowd-sourced data. We discuss the advantages and disadvantages of each data collection method.
2020-12-11T11:30:00-08:00 - 2020-12-11T11:40:00-08:00
Contributed Talk 4 Q/A
Qiuyuan Huang, Kezhen Chen
Emotion and empathy are examples of human qualities lacking in many human-machine interactions. The goal of our work is to generate engaging dialogue grounded in a user-shared image with increased emotion and empathy while minimizing socially inappropriate or offensive outputs. We release the Neural Image Commenting Evaluation (NICE) dataset consisting of almost two million images and their corresponding, human-generated comments, as well as a set of baseline models and over 28,000 human annotated samples. Instead of relying on manually labeled emotions, we also use automatically generated linguistic representations as a source of weakly supervised labels. Based on the annotations, we define two different task settings on the NICE dataset. Then, we propose a novel model - Modeling Affect Generation for Image Comments (MAGIC) - which aims to generate comments for images, conditioned on linguistic representations that capture style and affect, and to help generate more empathetic, emotional, engaging and socially appropriate comments. Using this model we achieve state-of-the-art performance on one setting and set a benchmark for the NICE dataset. Experiments show that our proposed method can generate more human-like and engaging image comments.
2020-12-11T12:50:00-08:00 - 2020-12-11T13:05:00-08:00
Invited Talk 4 Q/A - Jason Weston
Jason E Weston
(Towards) Learning from Conversing
2020-12-11T13:05:00-08:00 - 2020-12-11T13:20:00-08:00
Invited Talk 5 Q/A - Zhou Yu
Zhou Yu
Augment Intelligence with Multimodal Information
2020-12-11T13:20:00-08:00 - 2020-12-11T13:35:00-08:00
Invited Talk 6 Q/A - Gokhan Tür
Gokhan Tur
Recent advances in deep learning based methods for language processing, especially using self-supervised learning methods resulted in new excitement towards building more sophisticated Conversational AI systems. While this is partially true for social chatbots or retrieval based applications, the underlying skeleton of the goal oriented systems has remained unchanged: Still most language understanding models rely on supervised methods with manually annotated datasets even though the resulting performances are significantly better with much less data. In this talk I will cover two directions we are exploring to break from this: The first approach is aiming to incorporate multimodal information for better understanding and semantic grounding. The second part introduces an interactive self-supervision method to gather immediate actionable user feedback converting frictional moments into learning opportunities for interactive learning.
2020-12-11T13:55:00-08:00 - 2020-12-11T14:05:00-08:00
Contributed Talk 5 Q/A
Nathan Ng
Building user trust in dialogue agents requires smooth and consistent dialogue exchanges. However, agents can easily lose conversational context and generate irrelevant utterances. We call these situations dialogue breakdown, where agent ut- terances prevent users from continuing the conversation. Building systems to detect dialogue breakdown allows agents to recover appropriately or avoid breakdown entirely. In this paper we investigate the use of semi-supervised learning methods to improve dialogue breakdown detection, including continued pre-training on the Reddit dataset and a manifold-based data augmentation method. We demonstrate the effectiveness of these methods on the Dialogue Breakdown Detection Challenge (DBDC) English shared task. Our submissions to the 2020 DBDC5 shared task place first, beating baselines and other submissions by over 12% accuracy. In abla- tions on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2% accuracy. These methods are applicable generally to any dialogue task and provide a simple way to improve model performance.
2020-12-11T14:05:00-08:00 - 2020-12-11T14:20:00-08:00
Contributed Talk 6 Q/A
Chien-Wei Lin
Goal-oriented dialog systems enable users to complete specific goals like requesting information about a movie or booking a ticket. Typically the dialog system pipeline contains multiple ML models, including natural language understanding, state tracking and action prediction (policy learning). These models are trained through a combination of supervised or reinforcement learning methods and therefore require collection of labeled domain specific datasets. However, collecting annotated datasets with language and dialog-flow variations is expensive, time- consuming and scales poorly due to human involvement. In this paper, we propose an approach for automatically creating a large corpus of annotated dialogs from a few thoroughly annotated sample dialogs and the dialog schema. Our approach includes a novel goal-sampling technique for sampling plausible user goals and a dialog simulation technique that uses heuristic interplay between the user and the system, where the user tries to achieve the sampled goal. We validate our approach by generating data and training three different downstream conversational ML models. We achieve 18 − 50% relative accuracy improvements on a held-out test set compared to a baseline dialog generation approach that only samples natural language and entity value variations from existing catalogs but does not generate any novel dialog flow variations. We also qualitatively establish that the proposed approach is better than the baseline.
2020-12-11T15:30:00-08:00 - 2020-12-11T15:45:00-08:00
Invited Talk 7 Q/A - Ankur Parikh
Ankur Parikh
Despite large advances in neural text generation in terms of fluency, existing generation techniques are prone to hallucination and often produce output that is unfaithful or irrelevant to the source text. In this talk, we take a multi-faceted approach to this problem from 3 aspects: data, evaluation, and modeling. From the data standpoint, we propose ToTTo, a tables-to-text-dataset with high quality annotator revised references that we hope can serve as a benchmark for high precision text generation task. While the dataset is challenging, existing n-gram based evaluation metrics are often insufficient to detect hallucinations. To this end, we propose BLEURT, a fully learnt end-to-end metric based on transfer learning that can quickly adapt to measure specific evaluation criteria. Finally, we propose a model based on confidence decoding to mitigate hallucinations.
2020-12-11T15:45:00-08:00 - 2020-12-11T16:00:00-08:00
Invited Talk 8 Q/A - Percy Liang
Percy Liang
Natural language promises to be the ultimate interface for interacting with computers, allowing users to effortlessly tap into the wealth of digital information and extract insights from it. Today, virtual assistants such as Alex, Siri, and Google Assistant have given a glimpse into how this long-standing dream can become a reality, but there is still much work to be done. In this talk, I will discuss building natural language interfaces based on semantic parsing, which converts natural language into programs that can be executed by a computer. There are multiple challenges for building semantic parsers: how to acquire data without requiring laborious annotation, how to represent the meaning of sentences, and perhaps most importantly, how to widen the domains and capabilities of a semantic parser. Finally, I will talk about a new promising paradigm for tackling these challenges based on learning interactively from users.
2020-12-11T16:00:00-08:00 - 2020-12-11T16:15:00-08:00
Invited Talk 9 Q/A - Alexander Rudnicky
Alex Rudnicky
We have two different communities in spoken language interaction, one focused on goal-oriented dialog systems, the other on open-domain conversational agents. The latter has allowed us to focus on the mechanics of conversation and on the role of social behaviors. This talk describes some of our recent work on conversation systems.
2020-12-11T16:15:00-08:00 - 2020-12-11T17:15:00-08:00
Panel
Maxine Eskenazi, Larry Heck, Ankur Parikh, Govind Thattai, Alex Rudnicky, Jason E Weston
2020-12-11T17:15:00-08:00 - 2020-12-11T17:20:00-08:00
Closing Remarks / Best Paper Award
Behnam Hedayatnia