Invited Talk, Q&A
in
Workshop: Algorithmic Fairness through the lens of Metrics and Evaluation
Invited Talk: Harm Detectors and Guardian Models for LLMs: Implementations, Uses, and Limitations
Kush Varshney
Large language models (LLMs) are susceptible to a variety of harms, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy harm detectors and guardian models: compact classification models that provide labels for various harms. In addition to the models themselves, we discuss a wide range of uses for these detectors and guardian models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent sociotechnical challenges in their development.