GenAI Evaluation KDD2024:

KDD workshop on Evaluation and Trustworthiness of Generative AI Models

Held in conjunction with KDD'24


Welcome to GenAI Evaluation KDD 2024 !

The landscape of machine learning and artificial intelligence has been profoundly reshaped by the advent of Generative AI Models and their applications, such as ChatGPT, GPT-4, Sora, and etc. Generative AI includes Large Language Models (LLMs) such as GPT, Claude, Flan-T5, Falcon, Llama, etc., and generative diffusion models. These models have not only showcased unprecedented capabilities but also catalyzed trans- formative shifts across numerous fields. Concurrently, there is a burgeoning interest in the comprehensive evaluation of Generative AI models, as evidenced by pioneering efforts in research bench- marks and frameworks for LLMs like PromptBench, BotChat, OpenCompass, MINT, and others. Despite these advancements, the quest to accurately assess the trustworthiness, safety, and ethical congruence of Generative AI Models continues to pose significant challenges. This underscores an urgent need for developing robust evaluation frameworks that can ensure these technologies are reliable and can be seamlessly integrated into society in a beneficial manner. Our workshop is dedicated to foster- ing interdisciplinary collaboration and innovation in this vital area, focusing on the development of new datasets, metrics, methods, and models that can advance our understanding and application of Generative AI.

Contact: kdd2024-ws-genai-eval@amazon.com

Title: Deploying Trustworthy Generative AI


Abstract: While generative AI models and applications have huge potential across different industries, their successful commercial deployment requires addressing several ethical, trustworthiness, and safety considerations. These concerns include domain-specific evaluation, hallucinations, truthfulness and grounding, safety and alignment, bias and fairness, robustness and security, privacy, unlearning, and copyright implications, calibration and confidence, and transparency. In this talk, we first motivate the need for adopting responsible AI principles when developing and deploying large language models (LLMs), text-to-image models, and other generative AI models, and provide a roadmap for thinking about responsible AI and AI observability for generative AI in practice. Focusing on real-world generative AI use cases (e.g., evaluating LLMs for robustness, security, bias, etc. especially in health AI applications and user-facing & enterprise-internal chatbot settings), we present practical solution approaches / guidelines for applying responsible AI techniques effectively and discuss lessons learned from deploying responsible AI approaches for generative AI applications in practice. This talk will be based on our KDD'24 LLM grounding and evaluation tutorial and last year's ICML/KDD/FAccT trustworthy generative AI tutorial.

Short Bio: Krishnaram Kenthapadi is the Chief Scientist, Clinical AI at Oracle Health, where he leads the AI initiatives for Clinical Digital Assistant and other Oracle Health products. Previously, as the Chief AI Officer & Chief Scientist of Fiddler AI, he led initiatives on generative AI (e.g., Fiddler Auditor, an open-source library for evaluating & red-teaming LLMs before deployment; AI safety, observability & feedback mechanisms for LLMs in production), and on AI safety, alignment, observability, and trustworthiness, as well as the technical strategy, innovation, and thought leadership for Fiddler. Prior to that, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in the Amazon AI platform, and shaped new initiatives such as Amazon SageMaker Clarify from inception to launch. Prior to joining Amazon, he led similar efforts at the LinkedIn AI team, and served as LinkedIn’s representative in Microsoft’s AI and Ethics in Engineering and Research (AETHER) Advisory Board. Previously, he was a Researcher at Microsoft Research Silicon Valley Lab. Krishnaram received his Ph.D. in Computer Science from Stanford University in 2006. He serves regularly on the senior program committees of FAccT, KDD, WWW, WSDM, and related conferences, and co-chaired the 2014 ACM Symposium on Computing for Development. His work has been recognized through awards at NAACL, WWW, SODA, CIKM, ICML AutoML workshop, and Microsoft’s AI/ML conference (MLADS). He has published 60+ papers, with 7000+ citations and filed 150+ patents (72 granted). He has presented tutorials on trustworthy generative AI, privacy, fairness, explainable AI, model monitoring, and responsible AI at forums such as ICML, KDD, WSDM, WWW, FAccT, and AAAI, given several invited industry talks, and instructed a course on responsible AI at Stanford.

Title: NSF funding opportunities for AI researchers


Abstract: This talk will present NSF funding opportunities that could support research in the Artificial Intelligence area, in particular on aspects related to safety, accountability, transparency, and others. The talk will briefly describe programs in AI and computational biology such as Safe Learning-Enabled Systems, Formal Methods in the Field, Molecular Foundations of Biotechnology, Smart Health and Global Centers.

Short Bio: Sorin Drăghici is a program director in the Division of Information and Intelligent Systems (IIS) of the Directorate for Computer and Information Science and Engineering (CISE) at the Science Foundation (NSF). Draghici was elected a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) in 2022, for contributions to the analysis of high-throughput genomics and proteomics data.[3] He has also been elected a Fellow of the Asia-Pacific Artificial Intelligence Association (AAIA). His work is focused on research in artificial intelligence, machine learning and data mining techniques applied to bioinformatics and computational biology.

SCHEDULE

Sunday 25 August 2024 (2:00-6:00PM), Barcelona, Spain

  Opening
  2:00 - 2:05 PM

Introduction by organizers.

  Keynote Talk 1: Deploying Trustworthy Generative AI
  2:05 - 2:40 PM

Krishnaram Kenthapadi Chief Scientist, Clinical AI, Oracle Health

  Paper 19: Evaluation of Topic Continuity Using Nonlinearlized Naive Bayes With Attention Mechanism
  2:40 - 2:55 PM

Shu-Ting Pi, Pradeep Bagavan, Yejia Li, Disha and Qun Liu.

Presenter: Shu-Ting Pi

  Paper 3: PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs
  2:55 - 3:10 PM

Xinchi Qiu, William F. Shen, Yihong Chen, Nicola Cancedda, Pontus Stenetorp and Nicholas Lane.

Presenter: William F. Shen

   Paper 20: DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection
  3:10 - 3:25 PM

Joymallya Chakraborty, Wei Xia, Anirban Majumder, Dan Ma and Naveed Janvekar.

Presenter: Joymallya Chakraborty

   Paper 24: Assessing Adversarial Robustness of Large Language Models: An Empirical Study
  3:25 - 3:40 PM

Zeyu Yang, Xiaochen Zheng, Zhao Meng and Roger Wattenhofer.

Presenter: Zeyu Yang

  Posters
  3:40 - 4:00 PM

TBD

  Coffee Break
  4:00 - 4:30 PM
  Keynote Talk 2: NSF funding opportunities for AI researchers
  4:30 - 5:05 PM

Sorin Draghici Program director in the Division of Information and Intelligent Systems (IIS) at NSF

  Paper 26: Metapath of thoughts: Verbalized Metapaths in Heterogeneous graph as Contextual Augmentation to LLM
  5:05 - 5:20 PM

Harshvardhan Solanki, Jyoti Singh, Yihui Chong and Ankur Teredesai

Presenter: Ankur Teredesai

  Paper 21: What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain
  5:20 - 5:35PM

Antonis Maronikolakis, Ana Peleteiro Ramallo, Weiwei Cheng and Thomas Kober.

Presenter: Antonis Maronikolakis

  Paper 6: VERA: Validation and Evaluation of Retrieval-Augmented systems
  5:35 - 5:50 PM

Tianyu Ding, Adi Banerjee, Yunhong Li, Laurent Mombaerts, Tarik Borogovac and Juan Pablo De la Cruz Weinstein.

Presenter: Adi Banerjee

  Closing
  5:50 - 6:00 PM

Closing by organizers.


Accepted Papers

7 Oral presentations & 17 Posters

  Paper 14: Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models
  Paper Attachment

Joshua Ward, Chi-Hua Wang and Guang Cheng.

  Paper 15: Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data
  Paper Attachment

Yu Xia, Chi-Hua Wang, Joshua Mabry and Guang Cheng.

  Paper 19: Evaluation of Topic Continuity Using Nonlinearlized Naive Bayes With Attention Mechanism
  Paper Attachment

Shu-Ting Pi, Pradeep Bagavan, Yejia Li, Disha and Qun Liu.

  Paper 20: DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection
  Paper Attachment

Joymallya Chakraborty, Wei Xia, Anirban Majumder, Dan Ma and Naveed Janvekar.

  Paper 24: Assessing Adversarial Robustness of Large Language Models: An Empirical Study
  Paper Attachment

Zeyu Yang, Xiaochen Zheng, Zhao Meng and Roger Wattenhofer.

  Paper 27: Trustworthiness in Medical Product Question Answering by Large Language Models
  Paper Attachment

Daniel Lopez-Martinez.

  Paper 17: Assessment and Mitigation of Inconsistencies in LLM-based Evaluations
  Paper Attachment

Sarik Ghazarian, Yidong Zou, Swair Shah, Nanyun Peng, Anurag Beniwal, Christopher Potts and Narayanan Sadagopan.

  Paper 11: Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection
  Paper Attachment

Yong Xie, Karan Aggarwal, Aitzaz Ahmad and Stephen Lau.

  Paper 6: VERA: Validation and Evaluation of Retrieval-Augmented systems
  Paper Attachment

Tianyu Ding, Adi Banerjee, Yunhong Li, Laurent Mombaerts, Tarik Borogovac and Juan Pablo De la Cruz Weinstein.

  Paper 7: How Stable is Stable Diffusion under Recursive InPainting (RIP)?
  Paper Attachment

Javier Conde, Miguel González, Gonzalo Martínez, Fernando Moral, Elena Merino-Gómez and Pedro Reviriego.

  Paper 18: 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances
  Paper Attachment

Lorenzo Pacchiardi, Lucy Cheke and José Hernández-Orallo.

  Paper 25: CodePatchLLM: Configuring code generation using a static analyzer
  Paper Attachment

Danil Shaikhelislamov, Mikhail Drobyshevskiy and Andrey Belevantsev.

  Paper 21: What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain
  Paper Attachment

Antonis Maronikolakis, Ana Peleteiro Ramallo, Weiwei Cheng and Thomas Kober.

  Paper 3: PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs
  Paper Attachment

Xinchi Qiu, William F. Shen, Yihong Chen, Nicola Cancedda, Pontus Stenetorp and Nicholas Lane.

  Paper 5: FFT: Towards Evaluating Large Language Models with Factuality, Fairness, Toxicity
  Paper Attachment

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang and Tingwen Liu.

  Paper 10: Cost-Effective Hallucination Detection for LLMs
  Paper Attachment

Simon Valentin, Jinmiao Fu, Gianluca Detommaso, Shaoyuan Xu, Giovanni Zappella and Bryan Wang.

  Paper 16: An Evaluation Benchmark for Generative AI In Security Domain
  Paper Attachment

Mina Ghashami, Mikhail Kuznetsov, Vianne Gao, Ganyu Teng, Phil Wallis, Joseph Xie, Ali Torkamani, Baris Coskun and Wei Ding.

  Paper 26: Metapath of thoughts: Verbalized Metapaths in Heterogeneous graph as Contextual Augmentation to LLM
  Paper Attachment

Harshvardhan Solanki, Jyoti Singh, Yihui Chong and Ankur Teredesai.

Call for Contributions

  • Link to the submission website: [link]
  • This workshop aims to serve as a pivotal platform for discussing the forefront of Generative AI trustworthiness and evaluation advancements. Generative AI models, such as Large Language Models (LLMs) and Diffusion Models have revolutionized various domains, underscoring the critical need for reliable Generative AI technologies. As these models increasingly influence decision-making processes, establishing robust evaluation metrics and methods becomes paramount. Our objective is to delve into diverse evaluation strategies to enhance Generative AI models reliability across applications. The workshop topics include, but are not limited to:

    • Holistic Evaluation: Covering datasets, metrics, and methodologies
    • Trustworthiness in Generative AI Models:
      • Truthfulness: Counteracting misinformation, hallucination, inconsistency, sycophancy in responses, adversarial factuality.
      • Ensuring Safety and Security: privacy concerns, preventing harmful and toxicity content.
      • Addressing Bias and Fairness.
      • Ethical Considerations: social norm alignment, compliance with values, regulations and laws.
      • Privacy: privacy awareness and privacy leakage.
      • Enhancing misuse resistance, explainability, and robustness.
    • User-Centric Assessment.
    • Multi-perspective Evaluation: Emphasizing logical reasoning, knowledge depth, problem-solving, and user alignment.
    • Cross-Modal Evaluation: Integrating text, image, audio, etc.

    The workshop is designed to convene researchers from the realms of machine learning, data mining, and beyond, fostering the interdisciplinary exploration into Generative AI trustworthiness and evaluation. By featuring a blend of invited talks, presentations of peer-reviewed papers, and panel discussions, this workshop aims to facilitate exchanges of insights and foster collaborations across research and industry sectors. Participants from diverse fields such as Data Mining, Machine Learning, Natural Language Processing (NLP), and Information Retrieval are encouraged to share knowl- edge, debate challenges, and explore synergies, thereby advancing the state of the art in Generative AI technologies.

    Submission Guidelines

    • Paper submissions are limited to 9 pages, excluding references, must be in PDF and use ACM Conference Proceeding templates (two column format).
    • Additional supplemental material focused on reproducibility can be provided. Proofs, pseudo-code, and code may also be included in the supplement, which has no explicit page limit. The supplement format could be either single column or double column. The paper should be self-contained, since reviewers are not required to read the supplement.
    • The Word template guideline can be found here: [link]
    • The Latex/overleaf template guideline can be found here: [link]
    • The submissions will be judged for quality and relevance through single-blind reviewing.
    • A paper should be submitted in PDF format through EasyChair at the following link: [link]