MindSurf Benchmark
AI Empathetic Support Evaluation
Internal benchmarking tool for calculating automatic metrics and generating standardized test cases
Metrics Calculator
Calculate 6 automatic metrics for dialogue evaluation
- ✓Response Length Ratio
- ✓Diversity Metrics (D-1, D-2)
- ✓BERTScore F1
- ✓Crisis Detection Rate (CDR)
- ✓Resource Provision Rate (RPR)
- ✓Role Adherence (LLM-as-judge)
Test Case Generator
Generate standardized JSON test cases for benchmark
- ✓Crisis Scenario
- ✓Therapeutic Conversation
- ✓Red Teaming Stress Test
- ✓Long-Term Dialogue
- ✓Structured Annotations
- ✓Export to JSON
Benchmark Manager
Manage benchmark test cases with full CRUD operations
- ✓View & Filter Entries
- ✓Add New Test Cases
- ✓Edit Existing Entries
- ✓Delete with Backups
- ✓Multi-locale Support
- ✓Schema Validation
About MindSurf Benchmark
This tool helps the MindSurf AI team systematically evaluate empathetic support assistants through automatic metrics and standardized test cases.
Safety
Crisis detection & resource provision
Conversational
Quality & diversity metrics
Therapeutic
Role adherence evaluation