intro image

🏆 Mastra.Build Hackathon Evaluator

Automated, unbiased evaluation system for Mastra.Build hackathon submissions using advanced multi-agent workflows.

🎥 Watch Demo Video - See the evaluator in action!

This system revolutionizes hackathon judging by replacing subjective manual reviews with systematic, data-driven evaluation. Built specifically for the Mastra.Build hackathon, it automatically evaluates submitted projects, validates claimed features through live testing, and provides sponsor-aligned scoring with track eligibility detection.

🎯 Core Purpose

Problem: Mastra.Build hackathon judging requires evaluating diverse AI agent projects consistently across multiple sponsor prize categories.

Solution: An AI-powered evaluation pipeline that:

🔍 Extracts verifiable claims from project documentation and demo videos
🧪 Tests functionality through automated agent interactions
⭐ Scores objectively using standardized criteria across all submissions
🏷️ Tags for sponsor tracks with automatic eligibility detection for Smithery, WorkOS, Browserbase, Arcade, Chroma, Recall, and Confident AI prizes
📊 Ranks submissions with transparent, auditable results

🌟 Why This Matters for Mastra.Build

This system transforms Mastra.Build evaluation from "subjective demos" to "empirical validation":

⚖️ Eliminates judging bias through systematic evaluation criteria
🎯 Validates AI agent functionality instead of relying on presentations alone
🏆 Automatically detects sponsor alignment for prize categories (MCP servers, auth integration, web browsing, etc.)
⚡ Scales to evaluate hundreds of Mastra framework submissions efficiently
📈 Provides detailed feedback to help Mastra community members improve their agents

🚀 Novel Approach: Mastra Evaluating Mastra

🎯 Revolutionary Insight: This project demonstrates a groundbreaking approach to AI agent evaluation by using the Mastra framework to evaluate Mastra-built agents.

🔄 Self-Evaluation Architecture

Rather than building evaluation as a separate framework or external tool, we've created something unprecedented:

🤖 Mastra Agents Evaluating Mastra Agents: The evaluator itself is a sophisticated Mastra multi-agent workflow
🔗 Native Framework Integration: Deep understanding of Mastra patterns, conventions, and architectural decisions
📡 Cross-Instance Communication: Uses official @mastra/client-js to programmatically test agents running in separate Mastra instances
🧬 Framework-Aware Testing: Inherent knowledge of Mastra workflows, tools, and agent patterns enables more intelligent evaluation

💡 Why This Matters Beyond Hackathons

This "framework evaluating itself" approach represents a new paradigm in AI system assessment:

Traditional Approach ❌:

External Eval Tool → Tests → AI Framework Project

Our Novel Approach ✅:

Mastra Evaluator Agent → Tests → Mastra Target Agent
(Same framework, deep native understanding)

🎯 Unique Advantages

🧠 Native Intelligence: The evaluator inherently understands Mastra conventions, making evaluations more contextually accurate
🔧 Self-Improving Ecosystem: Insights from evaluations can directly improve the framework itself
📊 Framework-Specific Metrics: Evaluation criteria tailored specifically to Mastra's multi-agent, workflow-oriented architecture
🚀 Proof of Concept: Demonstrates Mastra's capability to build sophisticated, production-ready evaluation systems

🎖️ Industry First: To our knowledge, this is the first time a multi-agent framework has been used to systematically evaluate projects built with itself, showcasing both the maturity and self-reflective capabilities of the Mastra ecosystem.

A key differentiator of this evaluation system is automated sponsor alignment detection. The AI scorer analyzes project dependencies, functionality, and implementation patterns to identify eligibility for specific sponsor prize categories:

🎯 Automated Tag Detection Process

The system automatically analyzes:

📦 Package Dependencies: Detects @smithery/sdk, @workos/node, browserbase, @arcadeai/arcadejs, chromadb, etc.
🔍 Code Patterns: Identifies authentication flows, web scraping, RAG implementations, MCP server structures
📝 Documentation Keywords: Extracts mentions of sponsor technologies and use cases
🧪 Functionality Testing: Validates actual integration with sponsor services

⚠️ Important Note: This system is designed to assist and accelerate the evaluation process, not replace human judgment. While it provides systematic analysis and scoring, human review remains essential for final prize decisions, especially for subjective categories like "Shane's favorite" and "Funniest". The AI evaluation serves as a comprehensive first-pass filter and detailed analysis tool for judges.

🎯 Key Features

⚙️ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
Dependency Injection - Leverages InversifyJS for loose coupling and testability
Multi-Agent Coordination - Specialized agents working in orchestrated harmony
🎮 Arcade AI Integration - Direct Google Sheets access through Arcade's tool ecosystem
📊 Automated Form Processing - Google Forms responses automatically feed into evaluation workflow
Template Ready - A complete Mastra template showcasing advanced patterns

🏗️ Architecture

The system uses a multi-agent pipeline with the following specialized components:

🤖 Core Agents

📋 Template Reviewer Agent - Main evaluation agent that coordinates the assessment process
📚 Documentation Review Agent - Analyzes project documentation for clarity, completeness, and extracts metadata
🎯 Promise Extraction Agent - Identifies and extracts stated features, claims, and guarantees from documentation
🧪 Testing Agent - Verifies promises through automated testing and validation
⭐ Scoring Agent - Provides final evaluation using a writer-reviewer pattern for high accuracy

🔧 Architecture Features

🆔 Unique Project ID: Each evaluation uses a UUID for tracking and correlation
📦 Structured Input/Output: All agents communicate through well-defined Zod schemas
🔀 Model Routing: Uses OpenRouter for optimal LLM selection per task
🏷️ Tag Generation: Automatic keyword extraction for searchability

🔄 Template Reviewer Workflow Architecture

The heart of this system is the template-reviewer-workflow (src/mastra/workflows/template-reviewer-workflow/), a sophisticated multi-step evaluation pipeline that demonstrates advanced workflow orchestration patterns.

📋 Workflow Overview

The template reviewer workflow implements a 4-phase evaluation process with parallel execution where possible:

 1templateReviewerWorkflow = createWorkflow({
 2  id: "template-reviewer-workflow",
 3  description: "Coordinator that launches the full template-review workflow",
 4  inputSchema: templateReviewerWorkflowInputSchema,
 5  outputSchema: templateReviewerWorkflowOutputSchema,
 6})
 7.then(createStep({ id: "clone-project" }))      // Phase 1: Setup
 8.parallel([                                      // Phase 2: Parallel Analysis
 9  createStep({ id: "setup-project-repo" }),
10  createStep({ id: "claims-extractor" })
11])
12.then(createStep({ id: "executor-and-scorer" })) // Phase 3: Testing & Scoring
13.commit();

🔧 Core Workflow Components

1. 📊 Project Setup & Cloning

Creates new project entity with UUID tracking
Persists project metadata to database
Initializes evaluation context with environment configuration

2. 🔍 Claims Extractor

Purpose: Systematically extracts capabilities claimed by template submitted from project documentation and video transcripts.

Key features:

📝 Dual-source analysis: Processes both documentation and video transcripts
🎯 Present-tense filtering: Distinguishes between current capabilities vs future promises
🏷️ Structured extraction: Outputs standardized claim objects with evidence references

 1export const claimsSchema = z.object({
 2  claims: z.array(z.object({
 3    name: z.string().describe("Concise, verb-first summary (≤ 10 words)"),
 4    description: z.string().describe("Full claim text with ≤ 25-word evidence snippet")
 5  }))
 6});

Why it's critical: Claims extraction forms the foundation for all subsequent testing and evaluation. Without accurate claim identification, the testing phase cannot validate the right functionality.

3. 📋 Plan Maker

Purpose: Generates comprehensive test plans that validate extracted claims through systematic chat-based interactions.

This component is strategically vital because it:

🎯 Bridges claims to testing: Converts abstract capability claims into concrete, executable test scenarios
📚 Resource-aware planning: Leverages a curated resource kit (PDFs, CSVs, websites, locations) for realistic testing
🔄 Multi-plan generation: Creates exactly 3 complementary test plans to maximize claim coverage
💬 Chat-based validation: Designs conversational tests that mirror real user interactions

Resource Kit Integration: We understand that sample data must be needed to test out certain agents, we have already taken care of that! This evaluator includes sample data for the following:

📄 Document Processing: Universal Declaration of Human Rights, Sherlock Holmes stories, AI Agent principles Mastra book
📊 Data Analysis: Iris dataset, Penguins dataset, Apple stock data
🌐 Web Content: Hacker News, Wikipedia pages
🌍 Location Data: Coordinates for weather-related testing

4. 🧪 Tester Component (`tester.ts`)

Purpose: Executes the generated test plans and validates agent responses against success criteria. It uses a multi-pass tester-validator loop to chat with target agent and verify if the target agent does what it claims in its documentation and video demos

 1export const testerOutputSchema = z.array(z.object({
 2  id: z.string(),           // Links back to plan-1, plan-2, plan-3
 3  passed: z.boolean(),      // Binary pass/fail result
 4  explanation: z.string(),  // Detailed reasoning for the result
 5}));

🔗 Programmatic Agent Testing with Mastra Client

A key innovation of this evaluation system is its ability to programmatically control and test agents running in separate Mastra playground instances. This is accomplished using the official @mastra/client-js library, enabling sophisticated cross-instance agent orchestration.

🚀 Multi-Instance Testing Architecture

The system operates using a dual-playground architecture:

🎯 Evaluator Instance - Runs the Template Reviewer Workflow (main evaluation agent)
🔬 Target Instance - Runs the project being evaluated (cloned and deployed automatically)

Key Innovation: The evaluator agent can programmatically discover, connect to, and test agents running on completely different Mastra playground instances.

📡 Official Mastra Client Integration

The tester component leverages the official Mastra JavaScript client (@mastra/client-js) for seamless agent communication:

 1import { MastraClient } from "@mastra/client-js";
 2
 3export async function runPlansAgainstAgent(props: {
 4  port: string;
 5  plans: z.infer<typeof planMakerOutputSchema>["plans"];
 6}) {
 7  // Connect to target Mastra instance
 8  const baseUrl = `http://localhost:${props.port}/`;
 9  const client = new MastraClient({ baseUrl });
10  
11  // Discover available agents and choose the one with most tools
12  const agents = await discoverAgentsWithClient(client);
13  // ... rest of testing logic
14}

💬 Threaded Conversation Testing

The system uses stateful conversation threads for realistic multi-turn testing:

 1async function sendChatWithClient(
 2  client: MastraClient,
 3  agentId: string,
 4  messages: Messages,
 5  threadId?: string
 6): Promise<string> {
 7  const agent = client.getAgent(agentId);
 8  const res: any = await agent.generate({ messages, threadId });
 9  
10  // Intelligent response parsing
11  if (typeof res === "string") return res;
12  if (typeof res.text === "string") return res.text;
13  if (typeof res.message === "string") return res.message;
14  if (typeof res.content === "string") return res.content;
15  
16  // Handle message arrays (conversation format)
17  if (Array.isArray(res.messages)) {
18    const last = res.messages[res.messages.length - 1];
19    if (last?.content) return String(last.content);
20  }
21  
22  return JSON.stringify(res);
23}

🎯 Context-Aware Testing Process

The complete testing workflow demonstrates:

🔌 Client Connection - Establishes connection to target Mastra instance using official client
🤖 Agent Discovery - Queries available agents and their capabilities via client API
🎯 Smart Selection - Chooses optimal agent based on name matching or tool count
🧵 Thread Management - Creates stable conversation threads per test plan for context continuity
💬 Interactive Testing - Conducts realistic chat-based validation of claimed functionality
📊 Evidence Collection - Documents complete interaction transcripts for transparent scoring

🏆 Why This Approach Matters for Judges

This professional client-based architecture demonstrates several key advantages:

🔧 Official Integration

✅ Standards Compliance: Uses official Mastra client library, not custom API calls
🔄 Future-Proof: Benefits from official library updates and improvements
🛡️ Error Handling: Robust error handling through established client patterns

🧵 Stateful Conversations

💬 Realistic Testing: Multi-turn conversations with proper context preservation
🎯 Thread Isolation: Each test plan maintains its own conversation thread
📈 Scalable Design: Concurrent testing across multiple agent instances

🎯 Tool-Based Agent Selection

🔧 Tool-Centric: Always selects the agent with the highest tool count for comprehensive testing
📊 Objective Criteria: Uses quantifiable metrics (tool count) rather than subjective name matching
🎯 Optimal Coverage: Ensures testing against the most capable agent available

This approach showcases integration patterns with the Mastra ecosystem, demonstrating how to build sophisticated agent orchestration systems using official tooling rather than ad-hoc API integrations.

5. ⭐ Scorer Component (`scorer.ts`)

Purpose: Provides comprehensive evaluation across multiple dimensions with detailed explanations.

 1export const scorerOutputSchema = z.object({
 2  descriptionQuality: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 3  tests: testerOutputSchema,  // Integration with test results
 4  appeal: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 5  creativity: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 6  architecture: z.object({
 7    agents: z.object({ count: z.number() }),
 8    tools: z.object({ count: z.number() }),
 9    workflows: z.object({ count: z.number() }),
10  }),
11  tags: z.array(z.string()),  // Automatic categorization
12});

🔄 Workflow Execution Flow

📥 Input Processing: Accepts project name, repository URL, description, video URL, and optional environment configuration
🔄 Parallel Phase:
- Repo Setup: Clones repository, runs npm install, creates .env file
- Claims Analysis: Extracts video transcript, analyzes documentation, identifies capabilities
📋 Plan Generation: Creates 3 targeted test plans based on extracted claims
🧪 Test Execution: Runs chat-based tests against the deployed project
⭐ Final Scoring: Generates comprehensive evaluation with detailed explanations

🎯 Why This Architecture Matters

🔍 Systematic Claim Validation

Unlike ad-hoc evaluation approaches, this workflow ensures every stated capability is systematically:

📝 Documented (claims extractor)
📋 Planned for testing (plan maker)
🧪 Empirically validated (tester)
⭐ Scored with evidence (scorer)

🔄 Reproducible Evaluation Process

The workflow creates an audit trail from initial claims through final scores, enabling:

🔍 Traceability: Every score traces back to specific test results
🔄 Reproducibility: Same project always produces consistent evaluations
📊 Comparative Analysis: Standardized scoring enables project comparisons

⚡ Parallel Processing Optimization

Smart parallelization reduces evaluation time:

Repository setup and claims extraction run concurrently
Video processing happens alongside documentation analysis
Database persistence is optimized for workflow state management

💡 Real-World Example: Evaluating Deep Research Assistant

Let's walk through how our template reviewer workflow would evaluate the Deep Research Assistant project:

📥 Input Processing

 1{
 2  "name": "Deep Research Assistant",
 3  "repoURLOrShorthand": "https://github.com/mastra-ai/template-deep-research.git",
 4  "description": "Advanced AI deep research assistant with human-in-the-loop workflows",
 5  "videoURL": "https://youtube.com/watch?v=demo-video",
 6  "envConfig": {
 7    "EXA_API_KEY": "demo-key-for-testing"
 8  }
 9}

🔍 Claims Extraction Output

Our claims extractor would identify these present-tense capabilities:

 1{
 2  "claims": [
 3    {
 4      "name": "Implements interactive human-in-loop research system",
 5      "description": "Creates an interactive, human-in-the-loop research system that allows users to explore topics - README line 3"
 6    },
 7    {
 8      "name": "Searches web using Exa API integration", 
 9      "description": "webSearchTool: Searches the web using the Exa API for relevant information - README line 15"
10    },
11    {
12      "name": "Evaluates research result relevance automatically",
13      "description": "evaluateResultTool: Assesses result relevance to the research topic - README line 16"
14    },
15    {
16      "name": "Generates comprehensive markdown reports",
17      "description": "reportAgent: Transforms research findings into comprehensive markdown reports - README line 20"
18    },
19    {
20      "name": "Extracts key learnings and follow-up questions",
21      "description": "extractLearningsTool: Identifies key learnings and generates follow-up questions - README line 17"
22    }
23  ]
24}

📋 Generated Test Plans

Our plan maker would create 3 targeted chat-based test plans:

Plan 1: End-to-End Research Process

 1{
 2  "id": "plan-1",
 3  "title": "Validate complete research workflow with report generation",
 4  "claims_targeted": [
 5    "Searches web using Exa API integration",
 6    "Generates comprehensive markdown reports"
 7  ],
 8  "steps": [
 9    {
10      "message": "I need you to research 'AI agent frameworks in 2024' and provide me with a comprehensive analysis. Please use the Principles of Building AI Agents document at https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf as a reference.",
11      "expected_agent_behavior": "Should initiate web search using Exa API, retrieve relevant information, and reference the provided PDF"
12    },
13    {
14      "message": "Now generate a final research report in markdown format with your findings.",
15      "expected_agent_behavior": "Should produce a well-structured markdown report containing research findings, analysis, and references"
16    }
17  ],
18  "success_criteria": [
19    "Successfully searches web using Exa API",
20    "References the provided PDF document",
21    "Generates properly formatted markdown report",
22    "Report contains research findings and analysis"
23  ],
24  "resourcesToUse": [
25    {"name": "AI Agent Principles PDF", "url": "https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf"}
26  ]
27}

Plan 2: Result Evaluation and Learning Extraction

 1{
 2  "id": "plan-2", 
 3  "title": "Test relevance evaluation and learning extraction capabilities",
 4  "claims_targeted": [
 5    "Evaluates research result relevance automatically",
 6    "Extracts key learnings and follow-up questions"
 7  ],
 8  "steps": [
 9    {
10      "message": "Research Python programming trends using information from https://en.wikipedia.org/wiki/Python_(programming_language) and evaluate how relevant each piece of information is to modern software development.",
11      "expected_agent_behavior": "Should retrieve Wikipedia content, assess relevance of different sections, and provide relevance ratings"
12    },
13    {
14      "message": "Based on your research, extract the top 3 key learnings and suggest 2 follow-up research questions.",
15      "expected_agent_behavior": "Should identify key insights from the research and generate relevant follow-up questions for deeper investigation"
16    }
17  ],
18  "success_criteria": [
19    "Demonstrates relevance evaluation for search results",
20    "Extracts meaningful key learnings from research data",
21    "Generates logical follow-up research questions",
22    "Shows clear reasoning for relevance assessments"
23  ],
24  "resourcesToUse": [
25    {"name": "Wikipedia Python Page", "url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}
26  ]
27}

Plan 3: Multi-Source Research Integration

 1{
 2  "id": "plan-3",
 3  "title": "Validate research across multiple data sources and formats",
 4  "claims_targeted": [
 5    "Implements interactive human-in-loop research system",
 6    "Searches web using Exa API integration"
 7  ],
 8  "steps": [
 9    {
10      "message": "Research current trends in data science by analyzing information from https://news.ycombinator.com/ and correlating it with data patterns from the Iris dataset at https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
11      "expected_agent_behavior": "Should fetch and analyze both web content from Hacker News and CSV data, then find correlations or connections"
12    },
13    {
14      "message": "Summarize how the current discussions on Hacker News relate to data science methodologies, using examples from the Iris dataset analysis.",
15      "expected_agent_behavior": "Should synthesize findings from both sources and demonstrate connections between current discussions and classic data science examples"
16    }
17  ],
18  "success_criteria": [
19    "Successfully processes both web content and CSV data",
20    "Demonstrates integration across multiple data formats", 
21    "Provides meaningful synthesis of disparate information sources",
22    "Shows ability to correlate web discussions with data analysis"
23  ],
24  "resourcesToUse": [
25    {"name": "Hacker News", "url": "https://news.ycombinator.com/"},
26    {"name": "Iris Dataset", "url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"}
27  ]
28}

Sample Result View

🧪 Sample Test Results

 1[
 2  {
 3    "id": "plan-1",
 4    "passed": true,
 5    "explanation": "Successfully completed end-to-end research with Exa API integration. Generated comprehensive markdown report with proper structure and citations."
 6  },
 7  {
 8    "id": "plan-2", 
 9    "passed": true,
10    "explanation": "Demonstrated clear relevance evaluation process. Extracted meaningful insights and generated logical follow-up questions."
11  },
12  {
13    "id": "plan-3",
14    "passed": false,
15    "explanation": "Successfully processed both data sources but failed to establish meaningful correlations between HN discussions and Iris dataset patterns."
16  }
17]

⭐ Final Evaluation Score

 1{
 2  "descriptionQuality": {
 3    "score": 4,
 4    "explanation": "Clear, well-structured documentation with good technical detail and usage examples"
 5  },
 6  "tests": [
 7    {"id": "plan-1", "passed": true, "explanation": "End-to-end research workflow validated"},
 8    {"id": "plan-2", "passed": true, "explanation": "Relevance evaluation and learning extraction working"}, 
 9    {"id": "plan-3", "passed": false, "explanation": "Multi-source integration needs improvement"}
10  ],
11  "appeal": {
12    "score": 4,
13    "explanation": "Compelling use case for research automation with clear business value"
14  },
15  "creativity": {
16    "score": 3,
17    "explanation": "Good implementation of known patterns but limited novel approaches"
18  },
19  "architecture": {
20    "agents": {"count": 2},
21    "tools": {"count": 3},
22    "workflows": {"count": 2}
23  },
24  "tags": [
25    "exa-api", 
26    "web-search", 
27    "report-generation", 
28    "human-in-the-loop",
29    "eligible-browserbase",
30    "eligible-productivity", 
31    "eligible-best-overall"
32  ]
33}

The AI automatically detected sponsor eligibilities based on:

🌐 eligible-browserbase: Uses Exa API for web search (similar to web browsing functionality)
⚡ eligible-productivity: Research automation enhances user productivity
🥇 eligible-best-overall: Solid implementation with good architecture and functionality

Additional tags would be generated for projects using:

eligible-smithery: Projects with @smithery/sdk dependency
eligible-workos: Authentication flows using @workos/node
eligible-arcade: Tool integrations using @arcadeai/arcadejs
eligible-chroma: RAG implementations with chromadb
eligible-recall: Crypto/blockchain functionality
eligible-confident-ai: Evaluation frameworks integration

💉 Dependency Injection Architecture

Comprehensive IoC implementation using InversifyJS - a nice add for mastra templates library in my opinion:

🎮 Arcade AI Integration

⏰ Extension Enhancement: The Arcade AI integration and Google Forms workflow was added during the 3-hour extension period granted by judges and was not part of the original hackathon submission. This demonstrates the system's extensibility and rapid integration capabilities.

📊 Google Sheets Tool (`google-sheets-tool.ts`)

Seamlessly integrates with Google Sheets through Arcade AI's tool ecosystem:

 1export const googleSheetsTool = ({ arcadeApiKey, arcadeUserId, defaultSpreadsheetId }) => {
 2  return createTool({
 3    id: "get_google_spreadsheet",
 4    description: "Fetch data from a Google Spreadsheet using Arcade AI",
 5    execute: async ({ context }) => {
 6      const client = new Arcade({ apiKey: arcadeApiKey });
 7      const result = await client.tools.execute({
 8        tool_name: "GoogleSheets.GetSpreadsheet@3.0.0",
 9        input: { spreadsheet_id: finalSpreadsheetId },
10        user_id: arcadeUserId,
11      });
12    }
13  });
14};

🔄 Google Forms to Evaluation Pipeline

Automated hackathon submission processing:

📝 Form Submissions - Participants submit via Google Forms (repo URL, demo video, description)
📊 Sheet Integration - Responses automatically populate Google Sheets
🎮 Arcade Processing - Google Sheets tool fetches new submissions
🤖 Auto-Evaluation - Each row triggers template-reviewer-workflow
📈 Live Results - Scores and rankings update in real-time

🎯 Benefits for Hackathon Organizers

⚡ Zero Manual Input - Forms directly feed evaluation pipeline
🔄 Real-time Processing - Submissions evaluated as they arrive
📊 Automated Tracking - Complete audit trail from form to final score
🏆 Instant Rankings - Live leaderboard updates with new submissions

🔧 Container Setup (`index.ts`)

 1import "reflect-metadata";
 2import { Container } from "inversify";
 3
 4const container = new Container();
 5container.bind(Config).toDynamicValue(() => new Config()).inSingletonScope();
 6container.bind(DB_SYMBOL).toConstantValue(getDB(container));
 7
 8// Professional DI container with proper lifecycle management

🎭 Service Abstractions (`infra/repositories/`)

 1@injectable()
 2class ProjectRepository implements IProjectRepository {
 3  constructor(@inject(DB_SYMBOL) private db: Database) {}
 4  // Clean dependency injection with interface segregation
 5}

🧪 Testability Benefits

 1// Easy mocking for unit tests
 2const mockRepo = mock<IProjectRepository>();
 3container.rebind(PROJECT_REPO_SYMBOL).toConstantValue(mockRepo);
 4// Dependency injection enables effortless testing

📁 Project Structure

src/mastra/
├── 🤖 agents/           # AI agents for specialized evaluation tasks
│   ├── claims-extractor-agent.ts     # 🔍 Claims extraction specialist
│   ├── template-reviewer-agent.ts    # 📋 Main coordinator agent
│   └── weather-agent.ts              # 🌤️ Example weather agent
├── 🏛️ domain/          # Domain entities and business logic
│   ├── aggregates/      # 📁 Domain aggregates and configuration
│   │   ├── config.ts    # ⚙️ Application configuration
│   │   └── project/     # 📋 Project domain model
│   └── shared/          # 💎 Shared value objects
│       └── value-objects/
│           ├── id.ts    # 🆔 Type-safe identifiers
│           └── index.ts # 📤 Value object exports
├── 🏗️ infra/           # Infrastructure layer
│   ├── database/        # 🗄️ MongoDB connection and setup
│   │   └── mongodb.ts   # 📊 Database configuration
│   ├── model/           # 🧠 AI model configuration
│   │   └── index.ts     # 🤖 OpenRouter model setup
│   ├── repositories/    # 📚 Data persistence layer
│   │   └── project.ts   # 📋 Project data access
│   └── services/        # 🔧 External service integrations
│       └── video/       # 🎥 Video processing services
├── 🛠️ tools/           # Mastra tools for agent capabilities
│   ├── google-sheets-tool.ts # 📊 Google Sheets integration via Arcade AI
│   └── list-projects-tool.ts # 📋 Project listing tool
├── 🔄 workflows/       # Business process workflows
│   ├── template-reviewer-workflow/  # 📊 Main evaluation pipeline
│   │   ├── claim-extractor.ts      # 🔍 Claims extraction step
│   │   ├── index.ts               # 🚀 Workflow orchestration
│   │   ├── plan-maker.ts          # 📋 Test plan generation
│   │   ├── scorer.ts              # ⭐ Evaluation scoring
│   │   ├── tester.ts              # 🧪 Automated testing
│   │   ├── sample-input.json      # 📝 Example input data
│   │   └── sample-output.json     # 📊 Example output structure
│   └── test-workflow.ts           # 🧪 Simple test workflow
└── index.ts             # 🎯 Main application entry point

✨ Features

🔄 Multi-Agent Evaluation Pipeline

📚 Documentation Analysis - Comprehensive review of README and project documentation
🎯 Promise Extraction - Systematic identification of project claims and features
🧪 Automated Testing - Verification of promises through code execution and testing
⭐ Structured Scoring - Evidence-based evaluation using defined rubrics
🏷️ Tag Generation - Automatic classification for searchability and organization

📊 Evaluation Criteria

📖 Documentation Quality - Clarity, completeness, usability assessment
✅ Feature Completeness - Delivery verification of promised functionality
🛡️ Reliability - Error-free operation validation through testing
💡 Innovation/Impact - Novelty and significance evaluation of solution
🏗️ Technical Implementation - Code quality and architecture assessment

🎁 Template Contribution to Mastra Ecosystem

This project represents a groundbreaking addition to Mastra's template library, introducing enterprise-grade architectural patterns that are currently missing from the official collection:

⚙️ Seamless Environment Variable Injection

🔗 Parent-to-Child Propagation - Automatically injects parent playground's environment variables into testing playgrounds
�️ API Key Inheritance - Testing environments inherit all AI API keys from the evaluator playground
🎯 Zero-Config Testing - Target projects receive all necessary environment variables without manual setup
🔄 Dynamic Configuration Merging - Combines parent playground config with project-specific environment variables
� Effortless Multi-Instance Testing - Eliminates setup friction for cross-playground agent communication
⚡ Automated Environment Provisioning - Testing playgrounds get fully configured environments automatically

💉 Dependency Injection Mastery

🚨 Critical Gap Filled - Addresses the complete absence of DI examples in current templates
🔧 InversifyJS Integration - Professional IoC container setup with decorators
🎭 Interface Segregation - Clean abstractions between application layers
⚡ Lifecycle Management - Singleton scoping and proper resource management
🧪 Testability Focus - Architecture designed for easy mocking and unit testing
📦 Modular Design - Loosely coupled components for maximum flexibility

⚙️ Advanced Environment Variable Management

A critical innovation for seamless multi-playground testing:

 1// Automatic environment injection from parent to testing playgrounds
 2envConfig: {
 3  ...container.get(Config).aiAPIKeys,  // Parent playground's AI keys
 4  ...inputData.envConfig,              // Project-specific overrides
 5}

Key Benefits:

🔗 Zero-Config Testing - Testing playgrounds inherit all necessary API keys automatically
🎯 Parent-Child Propagation - Evaluator playground shares environment with target projects
🛠️ AI Provider Continuity - OpenRouter, OpenAI, and other API keys propagate seamlessly
⚙️ Configuration Merging - Smart combination of parent config with project-specific variables
🚀 Friction-Free Setup - Eliminates manual environment setup for cross-instance testing
🔒 Secure Key Management - Centralizes API key management in the parent evaluator instance

Why This Matters for Mastra Templates: This pattern solves a critical pain point in multi-instance Mastra deployments where testing environments need access to the same API keys and configurations as the parent system, enabling truly automated evaluation workflows.

🏢 Enterprise-Ready Architecture

Unlike other templates focused on simple demos, this showcases:

🏗️ Production Patterns - Battle-tested enterprise architectural decisions
📈 Scalability Design - Built to handle complex business domains
🛡️ Maintainability - Clean code principles and SOLID design patterns
🔍 Observability - Comprehensive logging and monitoring integration

📋 Prerequisites

🟢 Node.js >= 20.9.0
🗄️ MongoDB instance for data persistence
🔑 OpenRouter API key for LLM access
📋 Project repository or documentation to evaluate

🚀 Installation

📥 Clone the repository:

 1git clone <repository-url>
 2cd mastra-template-evaluator

📦 Install dependencies:

 1npm install

⚙️ Set up environment variables:

 1# Configure OpenRouter API key and MongoDB connection

🎮 Usage

🔧 Development Mode

 1npm run dev

🏗️ Build

 1npm run build

🚀 Production

 1npm start

📊 Template Evaluation

The system can evaluate projects by:

📚 Analyzing markdown documentation and README files
🎥 Processing video demonstrations (YouTube links)
🎯 Extracting and verifying feature claims
🧪 Running automated tests and validations
📝 Generating comprehensive evaluation reports with scores and feedback

📋 Example Evaluation Input

Here's an example of how to evaluate a Mastra template project using this system. The evaluator takes a structured input describing the project to be assessed:

 1{
 2  "name": "PDF to Questions Generator",
 3  "repoURLOrShorthand": "mastra-ai/template-pdf-questions",
 4  "videoURL": "https://youtu.be/WQ0rvX8ajeg",
 5  "description": "A Mastra template that demonstrates **how to protect against token limits** by generating AI summaries from large datasets before passing as output from tool calls..."
 6}

Key Input Fields:

name: Human-readable project name for identification
repoURLOrShorthand: GitHub repository (full URL or owner/repo format)
videoURL: YouTube demo video for functionality analysis
description: Full project documentation in markdown format

What the evaluator does with this input:

🔄 Clones the repository and sets up the project environment
📝 Extracts claims from both documentation and video transcript
🧪 Generates test plans to validate claimed functionality
🤖 Runs automated tests by interacting with the deployed agent
⭐ Provides comprehensive scoring across multiple criteria
🏷️ Auto-detects sponsor track eligibility (MCP, auth, RAG, etc.)

🧪 READY TO TEST: The sample-input.json file contains a fully working test case that you can use immediately to test the evaluation system. This input evaluates the official Mastra PDF Questions template and demonstrates all evaluation features including claims extraction, test plan generation, automated testing, and comprehensive scoring.

📊 EXAMPLE OUTPUT: The sample-output.json file shows the complete result structure from a successful evaluation run. It demonstrates the comprehensive scoring, test results, sponsor track tags, and detailed analysis that the system produces for each evaluated project.

✅ VERIFIED WORKING: This sample input has been tested end-to-end and successfully evaluates the target project with full functionality validation, scoring, and sponsor track detection. Simply run the workflow with this input to see the complete evaluation process in action.

🛠️ Tech Stack

🎯 Core Framework

Mastra - Multi-agent orchestration and workflow management
TypeScript - Type-safe development with full IntelliSense support
Node.js - Runtime environment for scalable server-side applications

🤖 AI & LLM Integration

OpenRouter - Multi-provider LLM routing with model selection optimization
AI SDK - Vercel's AI SDK for streamlined language model interactions
Mastra Client - Official client library for cross-instance agent communication

🗄️ Data & Storage

MongoDB - Document database for project metadata, evaluations, and scoring data
Zod - Schema validation and type-safe data parsing
UUID - Unique identifier generation for project tracking

🏗️ Architecture & Design

⚙️ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
Dependency Injection - InversifyJS IoC container for loose coupling and testability
Repository Pattern - Data access abstraction with clean interfaces

🧪 Testing & Quality

Multi-Agent Testing - Automated agent interaction validation
Chat-based Validation - Conversational testing with stateful threads
Evidence Collection - Comprehensive interaction logging and analysis

🔧 External Services

Arcade AI - Google Sheets integration and tool ecosystem access
TranscriptAPI - Video transcript extraction and processing
YouTube Integration - Demo video analysis and content extraction
Git Integration - Automated repository cloning and project setup

📦 Dependencies

🏗️ Core Framework

@mastra/core: 🎯 Core Mastra framework functionality
@mastra/libsql: 🗄️ SQLite storage for telemetry and evaluations
@mastra/memory: 🧠 Memory management for agent persistence
@mastra/loggers: 📊 Logging infrastructure

🤖 AI and LLM Integration

@openrouter/ai-sdk-provider: 🔀 OpenRouter LLM provider
ai: 🧠 AI SDK for language model interactions

🏛️ Infrastructure & Architecture

inversify: 💉 Dependency injection container (IoC)
mongodb: 🗄️ MongoDB database driver
zod: ✅ Schema validation and type safety
reflect-metadata: 🎭 Decorator metadata reflection

🎮 Integration & Tools

@arcadeai/arcadejs: 🎮 Arcade AI SDK for Google Sheets integration
uuid: 🆔 Unique identifier generation

🛠️ Development

mastra: 🔧 CLI tools for development and deployment
typescript: 📝 TypeScript support and compilation
@types/node: 🟢 Node.js type definitions

🚀 Multi-Agent Benefits

This architecture provides several advantages over single-agent approaches:

🎯 Specialization - Each agent focuses on a specific domain (documentation, testing, scoring)
✨ Clarity - Clear separation of concerns improves reliability and maintainability
📈 Scalability - Agents can run concurrently where appropriate
🎯 Accuracy - Writer-reviewer pattern in scoring agent ensures high-quality evaluations
🔄 Flexibility - Different models can be used for different complexity levels

🔄 Hackathon Evaluation Process

📋 Submission Processing Pipeline

📥 Project Intake
- Repository URL and documentation analysis
- Demo video transcript extraction and processing
- Environment setup and dependency detection
⚡ Parallel Intelligence Gathering
- Claims Extraction: AI identifies all stated project capabilities
- Repository Analysis: Code scanning for architectural patterns and sponsor integrations
- Video Analysis: Demo functionality validation from transcript
📋 Test Plan Generation
- AI Test Designer: Creates 3 targeted test plans per project
- Resource Allocation: Assigns appropriate datasets, PDFs, and web resources
- Interaction Planning: Designs realistic user scenarios for agent testing
🧪 Live Functionality Testing
- Automated Agent Interaction: Tests each claimed feature through chat interfaces
- Success Validation: Empirical verification against stated capabilities
- Evidence Collection: Detailed logs and response analysis
🏆 Multi-Dimensional Scoring
- Technical Merit: Architecture quality, code patterns, innovation
- Functional Completeness: Validation of all claimed features
- Sponsor Alignment: Automatic detection of prize track eligibility
- Impact Assessment: Productivity gains, user value, market potential
📊 Results Compilation
- Detailed Scorecards: Transparent breakdown of all evaluation criteria
- Prize Recommendations: AI-identified sponsor track matches
- Improvement Feedback: Specific suggestions for enhancement
- Comparative Ranking: Position relative to other submissions

⚡ 10x faster evaluation: Process hundreds of submissions in hours, not days
🎯 Consistent scoring: Every project evaluated using the same rigorous criteria
📊 Data-driven decisions: Replace gut feelings with empirical evidence
🏷️ Automatic categorization: AI identifies sponsor prize eligibility instantly
📈 Detailed rankings: Transparent scoring breakdown for every submission

🤖 Participants

📋 Clear expectations: Understand exactly how projects will be evaluated
🔄 Immediate feedback: Get detailed analysis of strengths and improvement areas
🎯 Strategic insights: See which sponsor tracks your project aligns with
⭐ Fair evaluation: No bias based on presentation skills or demo timing
📚 Learning opportunity: Understand enterprise-grade Mastra patterns

🏛️ Long-term Mastra Ecosystem Value

🌟 Template Library Leadership

This template pioneers critical architectural patterns missing from Mastra's current library:

⚙️ Environment Variable Injection: Seamless config propagation from parent to testing playgrounds
💉 Dependency Injection: Professional IoC container setup with InversifyJS
🏢 Enterprise Architecture: Production-ready patterns for complex business logic
🧪 Systematic Testing: Multi-agent evaluation workflows for quality assurance

📚 Educational Excellence

📖 Reference Implementation: Complete environment injection + DI example for enterprise developers
🎯 Real Business Logic: Project evaluation domain demonstrates complex workflows
🔧 Best Practices: Proper error handling, logging, and monitoring integration
⚡ Scalability Patterns: Built to handle hundreds of concurrent evaluations

🚀 Market Positioning

🏢 Enterprise Credibility: Positions Mastra as enterprise-capable framework
👥 Developer Attraction: Sophisticated examples attract senior developers
📊 Quality Standards: Establishes architectural benchmarks for future templates
🔮 Ecosystem Foundation: Enables complex multi-agent applications in production

🙏 Acknowledgments

Thanks to the Mastra team for creating the PDF Questions template which was used as a target to test this evaluation tool.

I've uploaded a demo video to YouTube at https://www.youtube.com/watch?v=WQ0rvX8ajeg just to test things out - if the Mastra team would like it taken down, please reach out!

Thanks to TranscriptAPI for providing video transcription services with permission for this hackathon project.

📄 License

ISC