LLM Eval
NIH Text Evaluation
A text-evaluation workflow for NIH-style project material, focused on rubric, data scope, badcase review, bias awareness, and human review.
Problem
Text-heavy evaluation is unreliable if reviewers and models do not share stable criteria. The real product problem is making judgment comparable and auditable.
Workflow
- 01Separate dataset scope and summary-generation pipeline records.
- 02Define rubric dimensions before asking for model output.
- 03Use badcase flags for thin abstracts, institution halo, geography skew, missing funding context, and synthetic-label risk.
- 04Keep final judgment in a human-review loop instead of treating model output as expert decision.
Evidence
Rubric structure
Scientific value, methodology, team, social impact, and resource-use dimensions.
Data scope note
Separates main data, pipeline-summary records, and enhanced-analysis samples.
Badcase checklist
Covers institution halo, geography skew, research-area skew, thin abstract, missing funding context, and synthetic-label risk.
Sample report
A demo report format for input summary, rubric scores, flags, and human-review notes.
Boundary
- This does not replace NIH expert review.
- This does not claim a completed public benchmark.
- This does not prove model accuracy or commercial impact.
Role Mapping
- LLM Eval / data product: maps task, rubric, samples, and human review.
- Model strategy product: turns fuzzy quality into comparable evaluation language.
- AI product: explains model output risk in terms product and research teams can share.