ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

He, Chaoyue; Zhou, Xin; Wu, Yi; Yu, Xinjia; Zhang, Yan; Zhang, Lei; Wang, Di; Lyu, Shengfei; Xu, Hong; Wang, Xiaoqiao; Liu, Wei; Miao, Chunyan

doi:10.18653/v1/2025.emnlp-main.739

EMNLP 2025 Main Conference Oral | Suzhou, China, November 4-9, 2025 | Resource and Theme Paper Award nominations (Top 1%)

ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

Chaoyue He¹ Xin Zhou^1,* Yi Wu¹ Xinjia Yu¹ Yan Zhang¹ Lei Zhang¹ Di Wang¹ Shengfei Lyu¹ Hong Xu¹ Xiaoqiao Wang² Wei Liu² Chunyan Miao¹

¹ Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), Singapore; ² Alibaba Group, China

^* Corresponding author

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

An expert, source-grounded benchmark for evaluating whether LLMs understand sustainability reporting, climate disclosure, governance, and standards-driven ESG reasoning.

Main ESGenius benchmark results — ESGenius evaluates 50 language models across 1,136 expert ESG and sustainability questions, with source-grounded references and question-level inspection support.

Abstract

Large language models are increasingly used for sustainability reporting, climate disclosure, and ESG analysis, yet their knowledge of specialized standards and source-dependent concepts remains difficult to evaluate systematically. ESGenius provides a 1,136-question multiple-choice benchmark covering environmental, social, governance, and sustainability knowledge across major reporting and climate frameworks. Each question follows an A-D answer protocol with a Z option for uncertainty, and the reference version includes source document metadata and supporting excerpts for audit or retrieval-augmented evaluation. The repository includes dataset files, evaluation scripts, published result figures, and an interactive 50-model heatmap for model-question diagnostics.

Benchmark Overview

1,136expert ESG questions

50evaluated models

7framework families

A-D + Zanswer protocol

Standards-aware scope

Covers sustainability reporting, climate disclosure, biodiversity, energy, governance, and ESG reasoning across IPCC, GRI, SASB, ISO, IFRS/ISSB, TCFD, and CDP sources.

Source-grounded references

The reference CSV preserves document names, page references, and supporting text snippets so answers can be audited or used in retrieval-aware experiments.

Diagnostic evaluation

Published figures summarize aggregate performance, while the full heatmap exposes per-question outcomes, invalid outputs, uncertainty, and missing responses.

Dataset

The canonical public dataset release is hosted on Hugging Face, with local mirrors retained in this GitHub repository. It includes plain CSV/JSON files for standard evaluation and a reference-aware CSV for source-grounded inspection or retrieval experiments.

Hugging Face dataset Plain CSV Plain JSON Reference CSV Dataset documentation

query_idStable question identifier

queryQuestion stem

A-DAnswer options

ZNot sure option

ref_docSource document in reference file

source_textSupporting excerpt in reference file

Results

A compact ranking view summarizes the 50-model evaluation. Detailed model-question behavior is available in the full interactive Plotly report.

Ranking bar chart for evaluated ESGenius models — Ranking view for evaluated models on the balanced ESGenius question set.

Interactive Heatmap

Inspect every model-question outcome.

The full report covers 50 evaluated models across 1,136 ESGenius questions, sorted by model rank and question difficulty for fast error-pattern analysis.

Correct Wrong Invalid Not sure Missing

Open full Plotly heatmap

50 models ranked top to bottom Hardest questions Easiest questions

Citation

If you use ESGenius, please cite the EMNLP 2025 paper and repository metadata.

BibTeX

@inproceedings{he-etal-2025-esgenius,
  title = "{ESG}enius: Benchmarking {LLM}s on Environmental, Social, and Governance ({ESG}) and Sustainability Knowledge",
  author = "He, Chaoyue and Zhou, Xin and Wu, Yi and Yu, Xinjia and Zhang, Yan and Zhang, Lei and Wang, Di and Lyu, Shengfei and Xu, Hong and Xiaoqiao, Wang and Liu, Wei and Miao, Chunyan",
  editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet",
  booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2025",
  address = "Suzhou, China",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.emnlp-main.739/",
  doi = "10.18653/v1/2025.emnlp-main.739",
  pages = "14612--14653",
  ISBN = "979-8-89176-332-6"
}

Resources

Everything needed to inspect, reproduce, and cite the benchmark.

Read EMNLP 2025 paper DOI record Hugging Face dataset and code bundle View GitHub repository Evaluation guide Dataset documentation Apache 2.0 license