Quickstart

YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based given rubrics (i.e. informativeness) using a pretrained or a custom judge and parse LLM output into structured JSON.

The following example shows the how to run a AskAutoJudge on Informativeness rubric:

from yescieval import AskAutoJudge, GPTParser
from yescieval.rubric.pointwise.fidelity import Informativeness

# Sample papers used in form of {"title": "abstract", ... }
papers = {
    "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
    "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
    "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
    "Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
    "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
}

# Input question and synthesized answer
question = "How is AI used in modern healthcare systems?"
answer = (
    "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
    "and assisting in treatment planning. It also supports personalized medicine and medical imaging."
)

# Step 1: Create a rubric
rubric = Informativeness(papers=papers, question=question, answer=answer)
instruction_prompt = rubric.instruct()

# Step 2: Load the evaluation model (judge)
judge = AskAutoJudge()
judge.from_pretrained(token="your_huggingface_token", device="cpu")

# Step 3: Evaluate the answer
result = judge.judge(rubric=rubric)

print("Raw Evaluation Output:")
print(result)

Tip

  • Ensure your Hugging Face model token has access to the model (e.g., YESciEval-ASK-Llama-3.1-8B).

  • Use the device="cuda" if running on GPU for better performance.

  • Add more rubrics such as Informativeness, Relevancy, etc for multi-criteria evaluation.

Output Parser: If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.

from yescieval import GPTParser

raw_output = "` {rating: `4`, rational: The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine.} `"

parser = GPTParser(openai_key="your_openai_key")

parsed = parser.parse(raw_output=raw_output)

print("Parsed Output:")
print(parsed.model_dump())

Expected output format is:

{
  "rating": 4,
  "rationale": "The answer covers key aspects of how AI is applied in healthcare, such as diagnostics and personalized medicine."
}

The output schema is as a following (if you do not prefer to use .model_dump()) to be able to use like result.rating to access the rating value or result.rationale to access the textual explanation for rating.

{
        'properties': {
                'rating': {
                        'description': 'Rating from 1 to 5',
                        'maximum': 5,
                        'minimum': 1,
                        'title': 'Rating',
                        'type': 'integer'
                },
                'rationale': {
                        'description': 'Textual explanation for the rating',
                        'title': 'Rationale',
                        'type': 'string'
                }
        },
        'required': ['rating', 'rationale'],
        'title': 'RubricLikertScale',
        'type': 'object'
}

Hint

Key Components

Component

Purpose

Informativeness

Defines rubric to evaluate relevance to source papers

AskAutoJudge

Loads and uses a judgment model to evaluate answers

GPTParser

Parses loosely formatted text from LLMs into JSON