Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

Embedding User-Generated Content using Structural Supervision and Generative Models

Vinay Shukla · Yang Yang · Siddarth Malreddy · Jinoo Baek · Dale Johnson · Wenfei Zou · Karthik Lakshmanan · Mark Williams · Minh Pham


Abstract:

One well-studied solution to the need for vast amounts of human-labeled data is to use self-supervised training objectives in pretraining, which enables learning on completely unlabeled samples. Especially in the case of larger models such as LLMs, these pretraining procedures have demonstrated benefits [Devlin et al.,2018]. In this work we focus on training LLMs for producing semantically expressive sentence embeddings for User-Generated Content (UGC) in comment-style mediums. We provide a novel self-supervised training paradigm that leverages the structure of comment data and also demonstrate the efficacy of LLM generation for producing quality training data. Through empirical evaluation, we show improvements against existing baselines methods on several downstream tasks.

Chat is not available.