GenAI Evaluation KDD 2024: workshop on Evaluation and Trustworthiness of Generative AI Models

Welcome to GenAI Evaluation KDD 2024 !

The landscape of machine learning and artificial intelligence has been profoundly reshaped by the advent of Generative AI Models and their applications, such as ChatGPT, GPT-4, Sora, and etc. Generative AI includes Large Language Models (LLMs) such as GPT, Claude, Flan-T5, Falcon, Llama, etc., and generative diffusion models. These models have not only showcased unprecedented capabilities but also catalyzed trans- formative shifts across numerous fields. Concurrently, there is a burgeoning interest in the comprehensive evaluation of Generative AI models, as evidenced by pioneering efforts in research bench- marks and frameworks for LLMs like PromptBench, BotChat, OpenCompass, MINT, and others. Despite these advancements, the quest to accurately assess the trustworthiness, safety, and ethical congruence of Generative AI Models continues to pose significant challenges. This underscores an urgent need for developing robust evaluation frameworks that can ensure these technologies are reliable and can be seamlessly integrated into society in a beneficial manner. Our workshop is dedicated to foster- ing interdisciplinary collaboration and innovation in this vital area, focusing on the development of new datasets, metrics, methods, and models that can advance our understanding and application of Generative AI.

Contact: kdd2024-ws-genai-eval@amazon.com

Title: Deploying Trustworthy Generative AI

Abstract: While generative AI models and applications have huge potential across different industries, their successful commercial deployment requires addressing several ethical, trustworthiness, and safety considerations. These concerns include domain-specific evaluation, hallucinations, truthfulness and grounding, safety and alignment, bias and fairness, robustness and security, privacy, unlearning, and copyright implications, calibration and confidence, and transparency. In this talk, we first motivate the need for adopting responsible AI principles when developing and deploying large language models (LLMs), text-to-image models, and other generative AI models, and provide a roadmap for thinking about responsible AI and AI observability for generative AI in practice. Focusing on real-world generative AI use cases (e.g., evaluating LLMs for robustness, security, bias, etc. especially in health AI applications and user-facing & enterprise-internal chatbot settings), we present practical solution approaches / guidelines for applying responsible AI techniques effectively and discuss lessons learned from deploying responsible AI approaches for generative AI applications in practice. This talk will be based on our KDD'24 LLM grounding and evaluation tutorial and last year's ICML/KDD/FAccT trustworthy generative AI tutorial.

Short Bio: Krishnaram Kenthapadi is the Chief Scientist, Clinical AI at Oracle Health, where he leads the AI initiatives for Clinical Digital Assistant and other Oracle Health products. Previously, as the Chief AI Officer & Chief Scientist of Fiddler AI, he led initiatives on generative AI (e.g., Fiddler Auditor, an open-source library for evaluating & red-teaming LLMs before deployment; AI safety, observability & feedback mechanisms for LLMs in production), and on AI safety, alignment, observability, and trustworthiness, as well as the technical strategy, innovation, and thought leadership for Fiddler. Prior to that, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in the Amazon AI platform, and shaped new initiatives such as Amazon SageMaker Clarify from inception to launch. Prior to joining Amazon, he led similar efforts at the LinkedIn AI team, and served as LinkedIn’s representative in Microsoft’s AI and Ethics in Engineering and Research (AETHER) Advisory Board. Previously, he was a Researcher at Microsoft Research Silicon Valley Lab. Krishnaram received his Ph.D. in Computer Science from Stanford University in 2006. He serves regularly on the senior program committees of FAccT, KDD, WWW, WSDM, and related conferences, and co-chaired the 2014 ACM Symposium on Computing for Development. His work has been recognized through awards at NAACL, WWW, SODA, CIKM, ICML AutoML workshop, and Microsoft’s AI/ML conference (MLADS). He has published 60+ papers, with 7000+ citations and filed 150+ patents (72 granted). He has presented tutorials on trustworthy generative AI, privacy, fairness, explainable AI, model monitoring, and responsible AI at forums such as ICML, KDD, WSDM, WWW, FAccT, and AAAI, given several invited industry talks, and instructed a course on responsible AI at Stanford.

Krishnaram Kenthapadi

Chief Scientist, Clinical AI at Oracle Health

Title: NSF funding opportunities for AI researchers

Abstract: This talk will present NSF funding opportunities that could support research in the Artificial Intelligence area, in particular on aspects related to safety, accountability, transparency, and others. The talk will briefly describe programs in AI and computational biology such as Safe Learning-Enabled Systems, Formal Methods in the Field, Molecular Foundations of Biotechnology, Smart Health and Global Centers.

Short Bio: Sorin Drăghici is a program director in the Division of Information and Intelligent Systems (IIS) of the Directorate for Computer and Information Science and Engineering (CISE) at the Science Foundation (NSF). Draghici was elected a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) in 2022, for contributions to the analysis of high-throughput genomics and proteomics data.[3] He has also been elected a Fellow of the Asia-Pacific Artificial Intelligence Association (AAIA). His work is focused on research in artificial intelligence, machine learning and data mining techniques applied to bioinformatics and computational biology.

Sorin Draghici

NSF

Call for Contributions

Link to the submission website: [link]

This workshop aims to serve as a pivotal platform for discussing the forefront of Generative AI trustworthiness and evaluation advancements. Generative AI models, such as Large Language Models (LLMs) and Diffusion Models have revolutionized various domains, underscoring the critical need for reliable Generative AI technologies. As these models increasingly influence decision-making processes, establishing robust evaluation metrics and methods becomes paramount. Our objective is to delve into diverse evaluation strategies to enhance Generative AI models reliability across applications. The workshop topics include, but are not limited to:

Holistic Evaluation: Covering datasets, metrics, and methodologies
Trustworthiness in Generative AI Models:

Truthfulness: Counteracting misinformation, hallucination, inconsistency, sycophancy in responses, adversarial factuality.
Ensuring Safety and Security: privacy concerns, preventing harmful and toxicity content.
Addressing Bias and Fairness.
Ethical Considerations: social norm alignment, compliance with values, regulations and laws.
Privacy: privacy awareness and privacy leakage.
Enhancing misuse resistance, explainability, and robustness.

User-Centric Assessment.
Multi-perspective Evaluation: Emphasizing logical reasoning, knowledge depth, problem-solving, and user alignment.
Cross-Modal Evaluation: Integrating text, image, audio, etc.

The workshop is designed to convene researchers from the realms of machine learning, data mining, and beyond, fostering the interdisciplinary exploration into Generative AI trustworthiness and evaluation. By featuring a blend of invited talks, presentations of peer-reviewed papers, and panel discussions, this workshop aims to facilitate exchanges of insights and foster collaborations across research and industry sectors. Participants from diverse fields such as Data Mining, Machine Learning, Natural Language Processing (NLP), and Information Retrieval are encouraged to share knowl- edge, debate challenges, and explore synergies, thereby advancing the state of the art in Generative AI technologies.

Welcome to GenAI Evaluation KDD 2024 !

Title: Deploying Trustworthy Generative AI

Title: NSF funding opportunities for AI researchers

SCHEDULE

Opening

2:00 - 2:05 PM

Keynote Talk 1: Deploying Trustworthy Generative AI

2:05 - 2:40 PM

Paper 19: Evaluation of Topic Continuity Using Nonlinearlized Naive Bayes With Attention Mechanism

2:40 - 2:55 PM

Paper 3: PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs

2:55 - 3:10 PM

Paper 20: DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

3:10 - 3:25 PM

Paper 24: Assessing Adversarial Robustness of Large Language Models: An Empirical Study

3:25 - 3:40 PM

Posters

3:40 - 4:00 PM

Coffee Break

4:00 - 4:30 PM

Keynote Talk 2: NSF funding opportunities for AI researchers

4:30 - 5:05 PM

Paper 26: Metapath of thoughts: Verbalized Metapaths in Heterogeneous graph as Contextual Augmentation to LLM

5:05 - 5:20 PM

Paper 21: What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

5:20 - 5:35PM

Paper 6: VERA: Validation and Evaluation of Retrieval-Augmented systems

5:35 - 5:50 PM

Closing

5:50 - 6:00 PM

Accepted Papers

Paper 14: Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

Paper Attachment

Paper 15: Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data

Paper Attachment

Paper 19: Evaluation of Topic Continuity Using Nonlinearlized Naive Bayes With Attention Mechanism

Paper Attachment

Paper 20: DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Paper Attachment

Paper 24: Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Paper Attachment

Paper 27: Trustworthiness in Medical Product Question Answering by Large Language Models

Paper Attachment

Paper 17: Assessment and Mitigation of Inconsistencies in LLM-based Evaluations

Paper Attachment

Paper 11: Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Paper Attachment

Paper 6: VERA: Validation and Evaluation of Retrieval-Augmented systems

Paper Attachment

Paper 7: How Stable is Stable Diffusion under Recursive InPainting (RIP)?

Paper Attachment

Paper 18: 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances

Paper Attachment

Paper 25: CodePatchLLM: Configuring code generation using a static analyzer

Paper Attachment

Paper 21: What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Paper Attachment

Paper 3: PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs

Paper Attachment

Paper 5: FFT: Towards Evaluating Large Language Models with Factuality, Fairness, Toxicity

Paper Attachment

Paper 10: Cost-Effective Hallucination Detection for LLMs

Paper Attachment

Paper 16: An Evaluation Benchmark for Generative AI In Security Domain

Paper Attachment

Paper 26: Metapath of thoughts: Verbalized Metapaths in Heterogeneous graph as Contextual Augmentation to LLM

Paper Attachment

Call for Contributions

Important Dates

Paper Submission Deadline

June 21th, 2024

Paper Acceptance Notification

July 19th, 2024

Camera-Ready Submission

August 12th, 2024

Workshop Date

August 25th, 2024 (2:00 pm - 6:00 pm)

Submission Guidelines

Organizers - Amazon Teams