Quickstart
YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based given rubrics (i.e. informativeness) using a pretrained or a custom judge and parse LLM output into structured JSON.
The following example shows the how to run a AskAutoJudge on Informativeness rubric:
from yescieval import AskAutoJudge, GPTParser
from yescieval.rubric.pointwise.fidelity import Informativeness
# Sample papers used in form of {"title": "abstract", ... }
papers = {
"A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
"Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
"Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
"Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
"Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
}
# Input question and synthesized answer
question = "How is AI used in modern healthcare systems?"
answer = (
"AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
"and assisting in treatment planning. It also supports personalized medicine and medical imaging."
)
# Step 1: Create a rubric
rubric = Informativeness(papers=papers, question=question, answer=answer)
instruction_prompt = rubric.instruct()
# Step 2: Load the evaluation model (judge)
judge = AskAutoJudge()
judge.from_pretrained(token="your_huggingface_token", device="cpu")
# Step 3: Evaluate the answer
result = judge.judge(rubric=rubric)
print("Raw Evaluation Output:")
print(result)
Tip
Ensure your Hugging Face model token has access to the model (e.g.,
YESciEval-ASK-Llama-3.1-8B).Use the
device="cuda"if running on GPU for better performance.Add more rubrics such as
Informativeness,Relevancy, etc for multi-criteria evaluation.
Output Parser: If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.
from yescieval import GPTParser
raw_output = "` {rating: `4`, rational: The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine.} `"
parser = GPTParser(openai_key="your_openai_key")
parsed = parser.parse(raw_output=raw_output)
print("Parsed Output:")
print(parsed.model_dump())
Expected output format is:
{
"rating": 4,
"rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine."
}
The output schema is as a following (if you do not prefer to use .model_dump()) to be able to use like result.rating to access the rating value or result.rationale to access the textual explanation for rating.
{
'properties': {
'rating': {
'description': 'Rating from 1 to 5',
'maximum': 5,
'minimum': 1,
'title': 'Rating',
'type': 'integer'
},
'rationale': {
'description': 'Textual explanation for the rating',
'title': 'Rationale',
'type': 'string'
}
},
'required': ['rating', 'rationale'],
'title': 'RubricLikertScale',
'type': 'object'
}
Hint
Key Components
Component |
Purpose |
|---|---|
Informativeness |
Defines rubric to evaluate relevance to source papers |
AskAutoJudge |
Loads and uses a judgment model to evaluate answers |
GPTParser |
Parses loosely formatted text from LLMs into JSON |