MindSurf Benchmark

AI Empathetic Support Evaluation

Internal benchmarking tool for calculating automatic metrics and generating standardized test cases

Metrics Calculator
Calculate 6 automatic metrics for dialogue evaluation
  • Response Length Ratio
  • Diversity Metrics (D-1, D-2)
  • BERTScore F1
  • Crisis Detection Rate (CDR)
  • Resource Provision Rate (RPR)
  • Role Adherence (LLM-as-judge)
Open Calculator
Test Case Generator
Generate standardized JSON test cases for benchmark
  • Crisis Scenario
  • Therapeutic Conversation
  • Red Teaming Stress Test
  • Long-Term Dialogue
  • Structured Annotations
  • Export to JSON
Open Generator
Benchmark Manager
Manage benchmark test cases with full CRUD operations
  • View & Filter Entries
  • Add New Test Cases
  • Edit Existing Entries
  • Delete with Backups
  • Multi-locale Support
  • Schema Validation
Open Manager
About MindSurf Benchmark

This tool helps the MindSurf AI team systematically evaluate empathetic support assistants through automatic metrics and standardized test cases.

Safety

Crisis detection & resource provision

Conversational

Quality & diversity metrics

Therapeutic

Role adherence evaluation

MindSurf Benchmark Application - Internal Tool for MindSurf AI Team

Bilingual Support (Spanish/English) • 6 Automatic Metrics • 4 Test Scenarios