Judgesο
YESciEval provides two pre-trained judge models designed to evaluate scientific text syntheses based on different domains and datasets:
Ask Judge: A multidisciplinary YESciEval judge fine-tuned on the ORKGSyn dataset from the Open Research Knowledge Graph.
BioASQ Judge: A biomedical YESciEval judge fine-tuned on the BioASQ dataset from the BioASQ challenge.
Hint
Available YESciEval judge π€ Hugging Face:
Using YESciEval Judgesο
The following example demonstrates how to create an evaluation rubric, load a judge model, and evaluate an answer.
from yescieval import AutoJudge
from yescieval.rubric.pointwise.stylistic import Readability
papers = {
"A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
"Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
"Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
"Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
"Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
}
# Input question and synthesized answer
question = "How is AI used in modern healthcare systems?"
answer = (
"AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
"and assisting in treatment planning. It also supports personalized medicine and medical imaging."
)
# Step 1: Create a rubric
rubric = Readability(papers=papers, question=question, answer=answer)
instruction_prompt = rubric.instruct()
# Step 2: Load the evaluation model (judge)
judge = AutoJudge()
judge.from_pretrained(model_id="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
token="your_huggingface_token",
device="cpu")
# Step 3: Evaluate the answer
result = judge.judge(rubric=rubric)
print("Raw Evaluation Output:")
print(result)
Specialized Judges vs. Custom Modelsο
Class Name |
Description |
|---|---|
AutoJudge |
Base class for loading and running evaluation models (judges) with PEFT adapters. |
AskAutoJudge |
Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
BioASQAutoJudge |
Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. |
The difference between AskAutoJudge and BioASQAutoJudge compared to AutoJudge is that these specialized judges have their own predefined model paths on Hugging Face, making it easier to load the respective domain-specific models.
Custom Judgeο
The CustomAutoJudge class provides flexibility to load any compatible LLM model from Hugging Face by specifying the model ID. This allows you to use any pre-trained or fine-tuned model beyond the default specialized judges using YESciEval.
For example, you can load a model and evaluate a rubric like this:
# Initialize and load a custom model by specifying its Hugging Face model ID
judge = CustomAutoJudge()
judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
# Evaluate the rubric using the loaded model
result = judge.judge(rubric=rubric)
print(result)
This approach allows full control over which model is used for evaluation, supporting any LLM..
GPT Custom Judgeο
The GPTCustomAutoJudge class provides a generic, flexible interface to evaluate scientific syntheses using OpenAI GPT models.
You can use it to evaluate a rubric by providing your OpenAI API key and specifying the model ID:
# Initialize and load a custom model by specifying the GPT model ID
judge = GPTCustomAutoJudge()
judge.from_pretrained(model_id="gpt-5.2", token=OPEN_AI_API_KEY)
# Evaluate the rubric using the loaded model
result = judge.judge(rubric=rubric)
print(result.model_dump())
The output will be in the following format:
{
"rating": rating-value,
"rationale": "rationale-text"
}
This allows you to leverage the capabilities of OpenAIβs GPT models for scientific text evaluation.