intro image

๐Ÿ† Mastra.Build Hackathon Evaluator

Automated, unbiased evaluation system for Mastra.Build hackathon submissions using advanced multi-agent workflows.

๐ŸŽฅ Watch Demo Video - See the evaluator in action!

This system revolutionizes hackathon judging by replacing subjective manual reviews with systematic, data-driven evaluation. Built specifically for the Mastra.Build hackathon, it automatically evaluates submitted projects, validates claimed features through live testing, and provides sponsor-aligned scoring with track eligibility detection.

๐ŸŽฏ Core Purpose

Problem: Mastra.Build hackathon judging requires evaluating diverse AI agent projects consistently across multiple sponsor prize categories.

Solution: An AI-powered evaluation pipeline that:

  • ๐Ÿ” Extracts verifiable claims from project documentation and demo videos
  • ๐Ÿงช Tests functionality through automated agent interactions
  • โญ Scores objectively using standardized criteria across all submissions
  • ๐Ÿท๏ธ Tags for sponsor tracks with automatic eligibility detection for Smithery, WorkOS, Browserbase, Arcade, Chroma, Recall, and Confident AI prizes
  • ๐Ÿ“Š Ranks submissions with transparent, auditable results

๐ŸŒŸ Why This Matters for Mastra.Build

This system transforms Mastra.Build evaluation from "subjective demos" to "empirical validation":

  • โš–๏ธ Eliminates judging bias through systematic evaluation criteria
  • ๐ŸŽฏ Validates AI agent functionality instead of relying on presentations alone
  • ๐Ÿ† Automatically detects sponsor alignment for prize categories (MCP servers, auth integration, web browsing, etc.)
  • โšก Scales to evaluate hundreds of Mastra framework submissions efficiently
  • ๐Ÿ“ˆ Provides detailed feedback to help Mastra community members improve their agents

๐Ÿš€ Novel Approach: Mastra Evaluating Mastra

๐ŸŽฏ Revolutionary Insight: This project demonstrates a groundbreaking approach to AI agent evaluation by using the Mastra framework to evaluate Mastra-built agents.

๐Ÿ”„ Self-Evaluation Architecture

Rather than building evaluation as a separate framework or external tool, we've created something unprecedented:

  • ๐Ÿค– Mastra Agents Evaluating Mastra Agents: The evaluator itself is a sophisticated Mastra multi-agent workflow
  • ๐Ÿ”— Native Framework Integration: Deep understanding of Mastra patterns, conventions, and architectural decisions
  • ๐Ÿ“ก Cross-Instance Communication: Uses official @mastra/client-js to programmatically test agents running in separate Mastra instances
  • ๐Ÿงฌ Framework-Aware Testing: Inherent knowledge of Mastra workflows, tools, and agent patterns enables more intelligent evaluation

๐Ÿ’ก Why This Matters Beyond Hackathons

This "framework evaluating itself" approach represents a new paradigm in AI system assessment:

Traditional Approach โŒ:

External Eval Tool โ†’ Tests โ†’ AI Framework Project

Our Novel Approach โœ…:

 1Mastra Evaluator Agent โ†’ Tests โ†’ Mastra Target Agent
 2(Same framework, deep native understanding)

๐ŸŽฏ Unique Advantages

  • ๐Ÿง  Native Intelligence: The evaluator inherently understands Mastra conventions, making evaluations more contextually accurate
  • ๐Ÿ”ง Self-Improving Ecosystem: Insights from evaluations can directly improve the framework itself
  • ๐Ÿ“Š Framework-Specific Metrics: Evaluation criteria tailored specifically to Mastra's multi-agent, workflow-oriented architecture
  • ๐Ÿš€ Proof of Concept: Demonstrates Mastra's capability to build sophisticated, production-ready evaluation systems

๐ŸŽ–๏ธ Industry First: To our knowledge, this is the first time a multi-agent framework has been used to systematically evaluate projects built with itself, showcasing both the maturity and self-reflective capabilities of the Mastra ecosystem.

๐Ÿ† Sponsor Prize Track Detection

A key differentiator of this evaluation system is automated sponsor alignment detection. The AI scorer analyzes project dependencies, functionality, and implementation patterns to identify eligibility for specific sponsor prize categories:

๐Ÿ“‹ Mastra.Build Prize Categories

  • ๐Ÿฅ‡ Best overall (judged by Mastra)
  • ๐Ÿ”ง Best MCP server (judged by Smithery)
  • โญ Bonus award: Best use of Smithery (Switch2)
  • ๐Ÿค– Best use of AgentNetwork (judged by Mastra)
  • ๐Ÿ” Best use of auth (judged by WorkOS)
  • ๐ŸŒ Best use of web browsing (judged by Browserbase)
  • ๐Ÿ› ๏ธ Best use of tool provider (judged by Arcade)
  • ๐Ÿ“š Best RAG template (judged by Chroma)
  • โšก Best productivity (judged by Mastra)
  • ๐Ÿ’ป Best coding agent (judged by Mastra)
  • ๐Ÿ’ฐ Best crypto agent (judged by Recall)
  • ๐Ÿงช Best use of Evals (judged by Confident AI)
  • ๐ŸŽฏ Shane's favorite (judged by Shane)
  • ๐Ÿ˜„ Funniest (judged by Abhi)

๐ŸŽฏ Automated Tag Detection Process

The system automatically analyzes:

  • ๐Ÿ“ฆ Package Dependencies: Detects @smithery/sdk, @workos/node, browserbase, @arcadeai/arcadejs, chromadb, etc.
  • ๐Ÿ” Code Patterns: Identifies authentication flows, web scraping, RAG implementations, MCP server structures
  • ๐Ÿ“ Documentation Keywords: Extracts mentions of sponsor technologies and use cases
  • ๐Ÿงช Functionality Testing: Validates actual integration with sponsor services

โš ๏ธ Important Note: This system is designed to assist and accelerate the evaluation process, not replace human judgment. While it provides systematic analysis and scoring, human review remains essential for final prize decisions, especially for subjective categories like "Shane's favorite" and "Funniest". The AI evaluation serves as a comprehensive first-pass filter and detailed analysis tool for judges.

๐ŸŽฏ Key Features

  • โš™๏ธ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
  • Dependency Injection - Leverages InversifyJS for loose coupling and testability
  • Multi-Agent Coordination - Specialized agents working in orchestrated harmony
  • ๐ŸŽฎ Arcade AI Integration - Direct Google Sheets access through Arcade's tool ecosystem
  • ๐Ÿ“Š Automated Form Processing - Google Forms responses automatically feed into evaluation workflow
  • Template Ready - A complete Mastra template showcasing advanced patterns

๐Ÿ—๏ธ Architecture

The system uses a multi-agent pipeline with the following specialized components:

๐Ÿค– Core Agents

  • ๐Ÿ“‹ Template Reviewer Agent - Main evaluation agent that coordinates the assessment process
  • ๐Ÿ“š Documentation Review Agent - Analyzes project documentation for clarity, completeness, and extracts metadata
  • ๐ŸŽฏ Promise Extraction Agent - Identifies and extracts stated features, claims, and guarantees from documentation
  • ๐Ÿงช Testing Agent - Verifies promises through automated testing and validation
  • โญ Scoring Agent - Provides final evaluation using a writer-reviewer pattern for high accuracy

๐Ÿ”ง Architecture Features

  • ๐Ÿ†” Unique Project ID: Each evaluation uses a UUID for tracking and correlation
  • ๐Ÿ“ฆ Structured Input/Output: All agents communicate through well-defined Zod schemas
  • ๐Ÿ”€ Model Routing: Uses OpenRouter for optimal LLM selection per task
  • ๐Ÿท๏ธ Tag Generation: Automatic keyword extraction for searchability

๐Ÿ”„ Template Reviewer Workflow Architecture

The heart of this system is the template-reviewer-workflow (src/mastra/workflows/template-reviewer-workflow/), a sophisticated multi-step evaluation pipeline that demonstrates advanced workflow orchestration patterns.

๐Ÿ“‹ Workflow Overview

The template reviewer workflow implements a 4-phase evaluation process with parallel execution where possible:

 1templateReviewerWorkflow = createWorkflow({
 2  id: "template-reviewer-workflow",
 3  description: "Coordinator that launches the full template-review workflow",
 4  inputSchema: templateReviewerWorkflowInputSchema,
 5  outputSchema: templateReviewerWorkflowOutputSchema,
 6})
 7.then(createStep({ id: "clone-project" }))      // Phase 1: Setup
 8.parallel([                                      // Phase 2: Parallel Analysis
 9  createStep({ id: "setup-project-repo" }),
10  createStep({ id: "claims-extractor" })
11])
12.then(createStep({ id: "executor-and-scorer" })) // Phase 3: Testing & Scoring
13.commit();

๐Ÿ”ง Core Workflow Components

1. ๐Ÿ“Š Project Setup & Cloning

  • Creates new project entity with UUID tracking
  • Persists project metadata to database
  • Initializes evaluation context with environment configuration

2. ๐Ÿ” Claims Extractor

Purpose: Systematically extracts capabilities claimed by template submitted from project documentation and video transcripts.

Key features:

  • ๐Ÿ“ Dual-source analysis: Processes both documentation and video transcripts
  • ๐ŸŽฏ Present-tense filtering: Distinguishes between current capabilities vs future promises
  • ๐Ÿท๏ธ Structured extraction: Outputs standardized claim objects with evidence references
 1export const claimsSchema = z.object({
 2  claims: z.array(z.object({
 3    name: z.string().describe("Concise, verb-first summary (โ‰ค 10 words)"),
 4    description: z.string().describe("Full claim text with โ‰ค 25-word evidence snippet")
 5  }))
 6});

Why it's critical: Claims extraction forms the foundation for all subsequent testing and evaluation. Without accurate claim identification, the testing phase cannot validate the right functionality.

3. ๐Ÿ“‹ Plan Maker

Purpose: Generates comprehensive test plans that validate extracted claims through systematic chat-based interactions.

This component is strategically vital because it:

  • ๐ŸŽฏ Bridges claims to testing: Converts abstract capability claims into concrete, executable test scenarios
  • ๐Ÿ“š Resource-aware planning: Leverages a curated resource kit (PDFs, CSVs, websites, locations) for realistic testing
  • ๐Ÿ”„ Multi-plan generation: Creates exactly 3 complementary test plans to maximize claim coverage
  • ๐Ÿ’ฌ Chat-based validation: Designs conversational tests that mirror real user interactions

Resource Kit Integration: We understand that sample data must be needed to test out certain agents, we have already taken care of that! This evaluator includes sample data for the following:

  • ๐Ÿ“„ Document Processing: Universal Declaration of Human Rights, Sherlock Holmes stories, AI Agent principles Mastra book
  • ๐Ÿ“Š Data Analysis: Iris dataset, Penguins dataset, Apple stock data
  • ๐ŸŒ Web Content: Hacker News, Wikipedia pages
  • ๐ŸŒ Location Data: Coordinates for weather-related testing

4. ๐Ÿงช Tester Component (tester.ts)

Purpose: Executes the generated test plans and validates agent responses against success criteria. It uses a multi-pass tester-validator loop to chat with target agent and verify if the target agent does what it claims in its documentation and video demos

 1export const testerOutputSchema = z.array(z.object({
 2  id: z.string(),           // Links back to plan-1, plan-2, plan-3
 3  passed: z.boolean(),      // Binary pass/fail result
 4  explanation: z.string(),  // Detailed reasoning for the result
 5}));

๐Ÿ”— Programmatic Agent Testing with Mastra Client

A key innovation of this evaluation system is its ability to programmatically control and test agents running in separate Mastra playground instances. This is accomplished using the official @mastra/client-js library, enabling sophisticated cross-instance agent orchestration.

๐Ÿš€ Multi-Instance Testing Architecture

The system operates using a dual-playground architecture:

  1. ๐ŸŽฏ Evaluator Instance - Runs the Template Reviewer Workflow (main evaluation agent)
  2. ๐Ÿ”ฌ Target Instance - Runs the project being evaluated (cloned and deployed automatically)

Key Innovation: The evaluator agent can programmatically discover, connect to, and test agents running on completely different Mastra playground instances.

๐Ÿ“ก Official Mastra Client Integration

The tester component leverages the official Mastra JavaScript client (@mastra/client-js) for seamless agent communication:

 1import { MastraClient } from "@mastra/client-js";
 2
 3export async function runPlansAgainstAgent(props: {
 4  port: string;
 5  plans: z.infer<typeof planMakerOutputSchema>["plans"];
 6}) {
 7  // Connect to target Mastra instance
 8  const baseUrl = `http://localhost:${props.port}/`;
 9  const client = new MastraClient({ baseUrl });
10  
11  // Discover available agents and choose the one with most tools
12  const agents = await discoverAgentsWithClient(client);
13  // ... rest of testing logic
14}

๐Ÿ’ฌ Threaded Conversation Testing

The system uses stateful conversation threads for realistic multi-turn testing:

 1async function sendChatWithClient(
 2  client: MastraClient,
 3  agentId: string,
 4  messages: Messages,
 5  threadId?: string
 6): Promise<string> {
 7  const agent = client.getAgent(agentId);
 8  const res: any = await agent.generate({ messages, threadId });
 9  
10  // Intelligent response parsing
11  if (typeof res === "string") return res;
12  if (typeof res.text === "string") return res.text;
13  if (typeof res.message === "string") return res.message;
14  if (typeof res.content === "string") return res.content;
15  
16  // Handle message arrays (conversation format)
17  if (Array.isArray(res.messages)) {
18    const last = res.messages[res.messages.length - 1];
19    if (last?.content) return String(last.content);
20  }
21  
22  return JSON.stringify(res);
23}

๐ŸŽฏ Context-Aware Testing Process

The complete testing workflow demonstrates:

  1. ๐Ÿ”Œ Client Connection - Establishes connection to target Mastra instance using official client
  2. ๐Ÿค– Agent Discovery - Queries available agents and their capabilities via client API
  3. ๐ŸŽฏ Smart Selection - Chooses optimal agent based on name matching or tool count
  4. ๐Ÿงต Thread Management - Creates stable conversation threads per test plan for context continuity
  5. ๐Ÿ’ฌ Interactive Testing - Conducts realistic chat-based validation of claimed functionality
  6. ๐Ÿ“Š Evidence Collection - Documents complete interaction transcripts for transparent scoring

๐Ÿ† Why This Approach Matters for Judges

This professional client-based architecture demonstrates several key advantages:

๐Ÿ”ง Official Integration

  • โœ… Standards Compliance: Uses official Mastra client library, not custom API calls
  • ๐Ÿ”„ Future-Proof: Benefits from official library updates and improvements
  • ๐Ÿ›ก๏ธ Error Handling: Robust error handling through established client patterns

๐Ÿงต Stateful Conversations

  • ๐Ÿ’ฌ Realistic Testing: Multi-turn conversations with proper context preservation
  • ๐ŸŽฏ Thread Isolation: Each test plan maintains its own conversation thread
  • ๐Ÿ“ˆ Scalable Design: Concurrent testing across multiple agent instances

๐ŸŽฏ Tool-Based Agent Selection

  • ๐Ÿ”ง Tool-Centric: Always selects the agent with the highest tool count for comprehensive testing
  • ๐Ÿ“Š Objective Criteria: Uses quantifiable metrics (tool count) rather than subjective name matching
  • ๐ŸŽฏ Optimal Coverage: Ensures testing against the most capable agent available

This approach showcases integration patterns with the Mastra ecosystem, demonstrating how to build sophisticated agent orchestration systems using official tooling rather than ad-hoc API integrations.

5. โญ Scorer Component (scorer.ts)

Purpose: Provides comprehensive evaluation across multiple dimensions with detailed explanations.

 1export const scorerOutputSchema = z.object({
 2  descriptionQuality: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 3  tests: testerOutputSchema,  // Integration with test results
 4  appeal: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 5  creativity: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
 6  architecture: z.object({
 7    agents: z.object({ count: z.number() }),
 8    tools: z.object({ count: z.number() }),
 9    workflows: z.object({ count: z.number() }),
10  }),
11  tags: z.array(z.string()),  // Automatic categorization
12});

๐Ÿ”„ Workflow Execution Flow

  1. ๐Ÿ“ฅ Input Processing: Accepts project name, repository URL, description, video URL, and optional environment configuration
  2. ๐Ÿ”„ Parallel Phase:
    • Repo Setup: Clones repository, runs npm install, creates .env file
    • Claims Analysis: Extracts video transcript, analyzes documentation, identifies capabilities
  3. ๐Ÿ“‹ Plan Generation: Creates 3 targeted test plans based on extracted claims
  4. ๐Ÿงช Test Execution: Runs chat-based tests against the deployed project
  5. โญ Final Scoring: Generates comprehensive evaluation with detailed explanations

๐ŸŽฏ Why This Architecture Matters

๐Ÿ” Systematic Claim Validation

Unlike ad-hoc evaluation approaches, this workflow ensures every stated capability is systematically:

  • ๐Ÿ“ Documented (claims extractor)
  • ๐Ÿ“‹ Planned for testing (plan maker)
  • ๐Ÿงช Empirically validated (tester)
  • โญ Scored with evidence (scorer)

๐Ÿ”„ Reproducible Evaluation Process

The workflow creates an audit trail from initial claims through final scores, enabling:

  • ๐Ÿ” Traceability: Every score traces back to specific test results
  • ๐Ÿ”„ Reproducibility: Same project always produces consistent evaluations
  • ๐Ÿ“Š Comparative Analysis: Standardized scoring enables project comparisons

โšก Parallel Processing Optimization

Smart parallelization reduces evaluation time:

  • Repository setup and claims extraction run concurrently
  • Video processing happens alongside documentation analysis
  • Database persistence is optimized for workflow state management

๐Ÿ’ก Real-World Example: Evaluating Deep Research Assistant

Let's walk through how our template reviewer workflow would evaluate the Deep Research Assistant project:

๐Ÿ“ฅ Input Processing

 1{
 2  "name": "Deep Research Assistant",
 3  "repoURLOrShorthand": "https://github.com/mastra-ai/template-deep-research.git",
 4  "description": "Advanced AI deep research assistant with human-in-the-loop workflows",
 5  "videoURL": "https://youtube.com/watch?v=demo-video",
 6  "envConfig": {
 7    "EXA_API_KEY": "demo-key-for-testing"
 8  }
 9}

๐Ÿ” Claims Extraction Output

Our claims extractor would identify these present-tense capabilities:

 1{
 2  "claims": [
 3    {
 4      "name": "Implements interactive human-in-loop research system",
 5      "description": "Creates an interactive, human-in-the-loop research system that allows users to explore topics - README line 3"
 6    },
 7    {
 8      "name": "Searches web using Exa API integration", 
 9      "description": "webSearchTool: Searches the web using the Exa API for relevant information - README line 15"
10    },
11    {
12      "name": "Evaluates research result relevance automatically",
13      "description": "evaluateResultTool: Assesses result relevance to the research topic - README line 16"
14    },
15    {
16      "name": "Generates comprehensive markdown reports",
17      "description": "reportAgent: Transforms research findings into comprehensive markdown reports - README line 20"
18    },
19    {
20      "name": "Extracts key learnings and follow-up questions",
21      "description": "extractLearningsTool: Identifies key learnings and generates follow-up questions - README line 17"
22    }
23  ]
24}

๐Ÿ“‹ Generated Test Plans

Our plan maker would create 3 targeted chat-based test plans:

Plan 1: End-to-End Research Process

 1{
 2  "id": "plan-1",
 3  "title": "Validate complete research workflow with report generation",
 4  "claims_targeted": [
 5    "Searches web using Exa API integration",
 6    "Generates comprehensive markdown reports"
 7  ],
 8  "steps": [
 9    {
10      "message": "I need you to research 'AI agent frameworks in 2024' and provide me with a comprehensive analysis. Please use the Principles of Building AI Agents document at https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf as a reference.",
11      "expected_agent_behavior": "Should initiate web search using Exa API, retrieve relevant information, and reference the provided PDF"
12    },
13    {
14      "message": "Now generate a final research report in markdown format with your findings.",
15      "expected_agent_behavior": "Should produce a well-structured markdown report containing research findings, analysis, and references"
16    }
17  ],
18  "success_criteria": [
19    "Successfully searches web using Exa API",
20    "References the provided PDF document",
21    "Generates properly formatted markdown report",
22    "Report contains research findings and analysis"
23  ],
24  "resourcesToUse": [
25    {"name": "AI Agent Principles PDF", "url": "https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf"}
26  ]
27}

Plan 2: Result Evaluation and Learning Extraction

 1{
 2  "id": "plan-2", 
 3  "title": "Test relevance evaluation and learning extraction capabilities",
 4  "claims_targeted": [
 5    "Evaluates research result relevance automatically",
 6    "Extracts key learnings and follow-up questions"
 7  ],
 8  "steps": [
 9    {
10      "message": "Research Python programming trends using information from https://en.wikipedia.org/wiki/Python_(programming_language) and evaluate how relevant each piece of information is to modern software development.",
11      "expected_agent_behavior": "Should retrieve Wikipedia content, assess relevance of different sections, and provide relevance ratings"
12    },
13    {
14      "message": "Based on your research, extract the top 3 key learnings and suggest 2 follow-up research questions.",
15      "expected_agent_behavior": "Should identify key insights from the research and generate relevant follow-up questions for deeper investigation"
16    }
17  ],
18  "success_criteria": [
19    "Demonstrates relevance evaluation for search results",
20    "Extracts meaningful key learnings from research data",
21    "Generates logical follow-up research questions",
22    "Shows clear reasoning for relevance assessments"
23  ],
24  "resourcesToUse": [
25    {"name": "Wikipedia Python Page", "url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}
26  ]
27}

Plan 3: Multi-Source Research Integration

 1{
 2  "id": "plan-3",
 3  "title": "Validate research across multiple data sources and formats",
 4  "claims_targeted": [
 5    "Implements interactive human-in-loop research system",
 6    "Searches web using Exa API integration"
 7  ],
 8  "steps": [
 9    {
10      "message": "Research current trends in data science by analyzing information from https://news.ycombinator.com/ and correlating it with data patterns from the Iris dataset at https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
11      "expected_agent_behavior": "Should fetch and analyze both web content from Hacker News and CSV data, then find correlations or connections"
12    },
13    {
14      "message": "Summarize how the current discussions on Hacker News relate to data science methodologies, using examples from the Iris dataset analysis.",
15      "expected_agent_behavior": "Should synthesize findings from both sources and demonstrate connections between current discussions and classic data science examples"
16    }
17  ],
18  "success_criteria": [
19    "Successfully processes both web content and CSV data",
20    "Demonstrates integration across multiple data formats", 
21    "Provides meaningful synthesis of disparate information sources",
22    "Shows ability to correlate web discussions with data analysis"
23  ],
24  "resourcesToUse": [
25    {"name": "Hacker News", "url": "https://news.ycombinator.com/"},
26    {"name": "Iris Dataset", "url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"}
27  ]
28}


Sample Result View

๐Ÿงช Sample Test Results

 1[
 2  {
 3    "id": "plan-1",
 4    "passed": true,
 5    "explanation": "Successfully completed end-to-end research with Exa API integration. Generated comprehensive markdown report with proper structure and citations."
 6  },
 7  {
 8    "id": "plan-2", 
 9    "passed": true,
10    "explanation": "Demonstrated clear relevance evaluation process. Extracted meaningful insights and generated logical follow-up questions."
11  },
12  {
13    "id": "plan-3",
14    "passed": false,
15    "explanation": "Successfully processed both data sources but failed to establish meaningful correlations between HN discussions and Iris dataset patterns."
16  }
17]

โญ Final Evaluation Score

 1{
 2  "descriptionQuality": {
 3    "score": 4,
 4    "explanation": "Clear, well-structured documentation with good technical detail and usage examples"
 5  },
 6  "tests": [
 7    {"id": "plan-1", "passed": true, "explanation": "End-to-end research workflow validated"},
 8    {"id": "plan-2", "passed": true, "explanation": "Relevance evaluation and learning extraction working"}, 
 9    {"id": "plan-3", "passed": false, "explanation": "Multi-source integration needs improvement"}
10  ],
11  "appeal": {
12    "score": 4,
13    "explanation": "Compelling use case for research automation with clear business value"
14  },
15  "creativity": {
16    "score": 3,
17    "explanation": "Good implementation of known patterns but limited novel approaches"
18  },
19  "architecture": {
20    "agents": {"count": 2},
21    "tools": {"count": 3},
22    "workflows": {"count": 2}
23  },
24  "tags": [
25    "exa-api", 
26    "web-search", 
27    "report-generation", 
28    "human-in-the-loop",
29    "eligible-browserbase",
30    "eligible-productivity", 
31    "eligible-best-overall"
32  ]
33}

๐Ÿ† Sponsor Track Eligibility Detection

The AI automatically detected sponsor eligibilities based on:

  • ๐ŸŒ eligible-browserbase: Uses Exa API for web search (similar to web browsing functionality)
  • โšก eligible-productivity: Research automation enhances user productivity
  • ๐Ÿฅ‡ eligible-best-overall: Solid implementation with good architecture and functionality

Additional tags would be generated for projects using:

  • eligible-smithery: Projects with @smithery/sdk dependency
  • eligible-workos: Authentication flows using @workos/node
  • eligible-arcade: Tool integrations using @arcadeai/arcadejs
  • eligible-chroma: RAG implementations with chromadb
  • eligible-recall: Crypto/blockchain functionality
  • eligible-confident-ai: Evaluation frameworks integration

๐Ÿ’‰ Dependency Injection Architecture

Comprehensive IoC implementation using InversifyJS - a nice add for mastra templates library in my opinion:

๐ŸŽฎ Arcade AI Integration

โฐ Extension Enhancement: The Arcade AI integration and Google Forms workflow was added during the 3-hour extension period granted by judges and was not part of the original hackathon submission. This demonstrates the system's extensibility and rapid integration capabilities.

๐Ÿ“Š Google Sheets Tool (google-sheets-tool.ts)

Seamlessly integrates with Google Sheets through Arcade AI's tool ecosystem:

 1export const googleSheetsTool = ({ arcadeApiKey, arcadeUserId, defaultSpreadsheetId }) => {
 2  return createTool({
 3    id: "get_google_spreadsheet",
 4    description: "Fetch data from a Google Spreadsheet using Arcade AI",
 5    execute: async ({ context }) => {
 6      const client = new Arcade({ apiKey: arcadeApiKey });
 7      const result = await client.tools.execute({
 8        tool_name: "GoogleSheets.GetSpreadsheet@3.0.0",
 9        input: { spreadsheet_id: finalSpreadsheetId },
10        user_id: arcadeUserId,
11      });
12    }
13  });
14};

๐Ÿ”„ Google Forms to Evaluation Pipeline

Automated hackathon submission processing:

  1. ๐Ÿ“ Form Submissions - Participants submit via Google Forms (repo URL, demo video, description)
  2. ๐Ÿ“Š Sheet Integration - Responses automatically populate Google Sheets
  3. ๐ŸŽฎ Arcade Processing - Google Sheets tool fetches new submissions
  4. ๐Ÿค– Auto-Evaluation - Each row triggers template-reviewer-workflow
  5. ๐Ÿ“ˆ Live Results - Scores and rankings update in real-time

๐ŸŽฏ Benefits for Hackathon Organizers

  • โšก Zero Manual Input - Forms directly feed evaluation pipeline
  • ๐Ÿ”„ Real-time Processing - Submissions evaluated as they arrive
  • ๐Ÿ“Š Automated Tracking - Complete audit trail from form to final score
  • ๐Ÿ† Instant Rankings - Live leaderboard updates with new submissions

๐Ÿ”ง Container Setup (index.ts)

 1import "reflect-metadata";
 2import { Container } from "inversify";
 3
 4const container = new Container();
 5container.bind(Config).toDynamicValue(() => new Config()).inSingletonScope();
 6container.bind(DB_SYMBOL).toConstantValue(getDB(container));
 7
 8// Professional DI container with proper lifecycle management

๐ŸŽญ Service Abstractions (infra/repositories/)

 1@injectable()
 2class ProjectRepository implements IProjectRepository {
 3  constructor(@inject(DB_SYMBOL) private db: Database) {}
 4  // Clean dependency injection with interface segregation
 5}

๐Ÿงช Testability Benefits

 1// Easy mocking for unit tests
 2const mockRepo = mock<IProjectRepository>();
 3container.rebind(PROJECT_REPO_SYMBOL).toConstantValue(mockRepo);
 4// Dependency injection enables effortless testing

๐Ÿ“ Project Structure

 1src/mastra/
 2โ”œโ”€โ”€ ๐Ÿค– agents/           # AI agents for specialized evaluation tasks
 3โ”‚   โ”œโ”€โ”€ claims-extractor-agent.ts     # ๐Ÿ” Claims extraction specialist
 4โ”‚   โ”œโ”€โ”€ template-reviewer-agent.ts    # ๐Ÿ“‹ Main coordinator agent
 5โ”‚   โ””โ”€โ”€ weather-agent.ts              # ๐ŸŒค๏ธ Example weather agent
 6โ”œโ”€โ”€ ๐Ÿ›๏ธ domain/          # Domain entities and business logic
 7โ”‚   โ”œโ”€โ”€ aggregates/      # ๐Ÿ“ Domain aggregates and configuration
 8โ”‚   โ”‚   โ”œโ”€โ”€ config.ts    # โš™๏ธ Application configuration
 9โ”‚   โ”‚   โ””โ”€โ”€ project/     # ๐Ÿ“‹ Project domain model
10โ”‚   โ””โ”€โ”€ shared/          # ๐Ÿ’Ž Shared value objects
11โ”‚       โ””โ”€โ”€ value-objects/
12โ”‚           โ”œโ”€โ”€ id.ts    # ๐Ÿ†” Type-safe identifiers
13โ”‚           โ””โ”€โ”€ index.ts # ๐Ÿ“ค Value object exports
14โ”œโ”€โ”€ ๐Ÿ—๏ธ infra/           # Infrastructure layer
15โ”‚   โ”œโ”€โ”€ database/        # ๐Ÿ—„๏ธ MongoDB connection and setup
16โ”‚   โ”‚   โ””โ”€โ”€ mongodb.ts   # ๐Ÿ“Š Database configuration
17โ”‚   โ”œโ”€โ”€ model/           # ๐Ÿง  AI model configuration
18โ”‚   โ”‚   โ””โ”€โ”€ index.ts     # ๐Ÿค– OpenRouter model setup
19โ”‚   โ”œโ”€โ”€ repositories/    # ๐Ÿ“š Data persistence layer
20โ”‚   โ”‚   โ””โ”€โ”€ project.ts   # ๐Ÿ“‹ Project data access
21โ”‚   โ””โ”€โ”€ services/        # ๐Ÿ”ง External service integrations
22โ”‚       โ””โ”€โ”€ video/       # ๐ŸŽฅ Video processing services
23โ”œโ”€โ”€ ๐Ÿ› ๏ธ tools/           # Mastra tools for agent capabilities
24โ”‚   โ”œโ”€โ”€ google-sheets-tool.ts # ๐Ÿ“Š Google Sheets integration via Arcade AI
25โ”‚   โ””โ”€โ”€ list-projects-tool.ts # ๐Ÿ“‹ Project listing tool
26โ”œโ”€โ”€ ๐Ÿ”„ workflows/       # Business process workflows
27โ”‚   โ”œโ”€โ”€ template-reviewer-workflow/  # ๐Ÿ“Š Main evaluation pipeline
28โ”‚   โ”‚   โ”œโ”€โ”€ claim-extractor.ts      # ๐Ÿ” Claims extraction step
29โ”‚   โ”‚   โ”œโ”€โ”€ index.ts               # ๐Ÿš€ Workflow orchestration
30โ”‚   โ”‚   โ”œโ”€โ”€ plan-maker.ts          # ๐Ÿ“‹ Test plan generation
31โ”‚   โ”‚   โ”œโ”€โ”€ scorer.ts              # โญ Evaluation scoring
32โ”‚   โ”‚   โ”œโ”€โ”€ tester.ts              # ๐Ÿงช Automated testing
33โ”‚   โ”‚   โ”œโ”€โ”€ sample-input.json      # ๐Ÿ“ Example input data
34โ”‚   โ”‚   โ””โ”€โ”€ sample-output.json     # ๐Ÿ“Š Example output structure
35โ”‚   โ””โ”€โ”€ test-workflow.ts           # ๐Ÿงช Simple test workflow
36โ””โ”€โ”€ index.ts             # ๐ŸŽฏ Main application entry point

โœจ Features

๐Ÿ”„ Multi-Agent Evaluation Pipeline

  1. ๐Ÿ“š Documentation Analysis - Comprehensive review of README and project documentation
  2. ๐ŸŽฏ Promise Extraction - Systematic identification of project claims and features
  3. ๐Ÿงช Automated Testing - Verification of promises through code execution and testing
  4. โญ Structured Scoring - Evidence-based evaluation using defined rubrics
  5. ๐Ÿท๏ธ Tag Generation - Automatic classification for searchability and organization

๐Ÿ“Š Evaluation Criteria

  • ๐Ÿ“– Documentation Quality - Clarity, completeness, usability assessment
  • โœ… Feature Completeness - Delivery verification of promised functionality
  • ๐Ÿ›ก๏ธ Reliability - Error-free operation validation through testing
  • ๐Ÿ’ก Innovation/Impact - Novelty and significance evaluation of solution
  • ๐Ÿ—๏ธ Technical Implementation - Code quality and architecture assessment

๐ŸŽ Template Contribution to Mastra Ecosystem

This project represents a groundbreaking addition to Mastra's template library, introducing enterprise-grade architectural patterns that are currently missing from the official collection:

โš™๏ธ Seamless Environment Variable Injection

  • ๐Ÿ”— Parent-to-Child Propagation - Automatically injects parent playground's environment variables into testing playgrounds
  • ๏ฟฝ๏ธ API Key Inheritance - Testing environments inherit all AI API keys from the evaluator playground
  • ๐ŸŽฏ Zero-Config Testing - Target projects receive all necessary environment variables without manual setup
  • ๐Ÿ”„ Dynamic Configuration Merging - Combines parent playground config with project-specific environment variables
  • ๏ฟฝ Effortless Multi-Instance Testing - Eliminates setup friction for cross-playground agent communication
  • โšก Automated Environment Provisioning - Testing playgrounds get fully configured environments automatically

๐Ÿ’‰ Dependency Injection Mastery

  • ๐Ÿšจ Critical Gap Filled - Addresses the complete absence of DI examples in current templates
  • ๐Ÿ”ง InversifyJS Integration - Professional IoC container setup with decorators
  • ๐ŸŽญ Interface Segregation - Clean abstractions between application layers
  • โšก Lifecycle Management - Singleton scoping and proper resource management
  • ๐Ÿงช Testability Focus - Architecture designed for easy mocking and unit testing
  • ๐Ÿ“ฆ Modular Design - Loosely coupled components for maximum flexibility

โš™๏ธ Advanced Environment Variable Management

A critical innovation for seamless multi-playground testing:

 1// Automatic environment injection from parent to testing playgrounds
 2envConfig: {
 3  ...container.get(Config).aiAPIKeys,  // Parent playground's AI keys
 4  ...inputData.envConfig,              // Project-specific overrides
 5}

Key Benefits:

  • ๐Ÿ”— Zero-Config Testing - Testing playgrounds inherit all necessary API keys automatically
  • ๐ŸŽฏ Parent-Child Propagation - Evaluator playground shares environment with target projects
  • ๐Ÿ› ๏ธ AI Provider Continuity - OpenRouter, OpenAI, and other API keys propagate seamlessly
  • โš™๏ธ Configuration Merging - Smart combination of parent config with project-specific variables
  • ๐Ÿš€ Friction-Free Setup - Eliminates manual environment setup for cross-instance testing
  • ๐Ÿ”’ Secure Key Management - Centralizes API key management in the parent evaluator instance

Why This Matters for Mastra Templates: This pattern solves a critical pain point in multi-instance Mastra deployments where testing environments need access to the same API keys and configurations as the parent system, enabling truly automated evaluation workflows.

๐Ÿข Enterprise-Ready Architecture

Unlike other templates focused on simple demos, this showcases:

  • ๐Ÿ—๏ธ Production Patterns - Battle-tested enterprise architectural decisions
  • ๐Ÿ“ˆ Scalability Design - Built to handle complex business domains
  • ๐Ÿ›ก๏ธ Maintainability - Clean code principles and SOLID design patterns
  • ๐Ÿ” Observability - Comprehensive logging and monitoring integration

๐Ÿ“‹ Prerequisites

  • ๐ŸŸข Node.js >= 20.9.0
  • ๐Ÿ—„๏ธ MongoDB instance for data persistence
  • ๐Ÿ”‘ OpenRouter API key for LLM access
  • ๐Ÿ“‹ Project repository or documentation to evaluate

๐Ÿš€ Installation

  1. ๐Ÿ“ฅ Clone the repository:
 1git clone <repository-url>
 2cd mastra-template-evaluator
  1. ๐Ÿ“ฆ Install dependencies:
npm install
  1. โš™๏ธ Set up environment variables:
# Configure OpenRouter API key and MongoDB connection

๐ŸŽฎ Usage

๐Ÿ”ง Development Mode

npm run dev

๐Ÿ—๏ธ Build

npm run build

๐Ÿš€ Production

npm start

๐Ÿ“Š Template Evaluation

The system can evaluate projects by:

  • ๐Ÿ“š Analyzing markdown documentation and README files
  • ๐ŸŽฅ Processing video demonstrations (YouTube links)
  • ๐ŸŽฏ Extracting and verifying feature claims
  • ๐Ÿงช Running automated tests and validations
  • ๐Ÿ“ Generating comprehensive evaluation reports with scores and feedback

๐Ÿ“‹ Example Evaluation Input

Here's an example of how to evaluate a Mastra template project using this system. The evaluator takes a structured input describing the project to be assessed:

 1{
 2  "name": "PDF to Questions Generator",
 3  "repoURLOrShorthand": "mastra-ai/template-pdf-questions",
 4  "videoURL": "https://youtu.be/WQ0rvX8ajeg",
 5  "description": "A Mastra template that demonstrates **how to protect against token limits** by generating AI summaries from large datasets before passing as output from tool calls..."
 6}

Key Input Fields:

  • name: Human-readable project name for identification
  • repoURLOrShorthand: GitHub repository (full URL or owner/repo format)
  • videoURL: YouTube demo video for functionality analysis
  • description: Full project documentation in markdown format

What the evaluator does with this input:

  1. ๐Ÿ”„ Clones the repository and sets up the project environment
  2. ๐Ÿ“ Extracts claims from both documentation and video transcript
  3. ๐Ÿงช Generates test plans to validate claimed functionality
  4. ๐Ÿค– Runs automated tests by interacting with the deployed agent
  5. โญ Provides comprehensive scoring across multiple criteria
  6. ๐Ÿท๏ธ Auto-detects sponsor track eligibility (MCP, auth, RAG, etc.)

๐Ÿงช READY TO TEST: The sample-input.json file contains a fully working test case that you can use immediately to test the evaluation system. This input evaluates the official Mastra PDF Questions template and demonstrates all evaluation features including claims extraction, test plan generation, automated testing, and comprehensive scoring.

๐Ÿ“Š EXAMPLE OUTPUT: The sample-output.json file shows the complete result structure from a successful evaluation run. It demonstrates the comprehensive scoring, test results, sponsor track tags, and detailed analysis that the system produces for each evaluated project.

โœ… VERIFIED WORKING: This sample input has been tested end-to-end and successfully evaluates the target project with full functionality validation, scoring, and sponsor track detection. Simply run the workflow with this input to see the complete evaluation process in action.

๐Ÿ› ๏ธ Tech Stack

๐ŸŽฏ Core Framework

  • Mastra - Multi-agent orchestration and workflow management
  • TypeScript - Type-safe development with full IntelliSense support
  • Node.js - Runtime environment for scalable server-side applications

๐Ÿค– AI & LLM Integration

  • OpenRouter - Multi-provider LLM routing with model selection optimization
  • AI SDK - Vercel's AI SDK for streamlined language model interactions
  • Mastra Client - Official client library for cross-instance agent communication

๐Ÿ—„๏ธ Data & Storage

  • MongoDB - Document database for project metadata, evaluations, and scoring data
  • Zod - Schema validation and type-safe data parsing
  • UUID - Unique identifier generation for project tracking

๐Ÿ—๏ธ Architecture & Design

  • โš™๏ธ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
  • Dependency Injection - InversifyJS IoC container for loose coupling and testability
  • Repository Pattern - Data access abstraction with clean interfaces

๐Ÿงช Testing & Quality

  • Multi-Agent Testing - Automated agent interaction validation
  • Chat-based Validation - Conversational testing with stateful threads
  • Evidence Collection - Comprehensive interaction logging and analysis

๐Ÿ”ง External Services

  • Arcade AI - Google Sheets integration and tool ecosystem access
  • TranscriptAPI - Video transcript extraction and processing
  • YouTube Integration - Demo video analysis and content extraction
  • Git Integration - Automated repository cloning and project setup

๐Ÿ“ฆ Dependencies

๐Ÿ—๏ธ Core Framework

  • @mastra/core: ๐ŸŽฏ Core Mastra framework functionality
  • @mastra/libsql: ๐Ÿ—„๏ธ SQLite storage for telemetry and evaluations
  • @mastra/memory: ๐Ÿง  Memory management for agent persistence
  • @mastra/loggers: ๐Ÿ“Š Logging infrastructure

๐Ÿค– AI and LLM Integration

  • @openrouter/ai-sdk-provider: ๐Ÿ”€ OpenRouter LLM provider
  • ai: ๐Ÿง  AI SDK for language model interactions

๐Ÿ›๏ธ Infrastructure & Architecture

  • inversify: ๐Ÿ’‰ Dependency injection container (IoC)
  • mongodb: ๐Ÿ—„๏ธ MongoDB database driver
  • zod: โœ… Schema validation and type safety
  • reflect-metadata: ๐ŸŽญ Decorator metadata reflection

๐ŸŽฎ Integration & Tools

  • @arcadeai/arcadejs: ๐ŸŽฎ Arcade AI SDK for Google Sheets integration
  • uuid: ๐Ÿ†” Unique identifier generation

๐Ÿ› ๏ธ Development

  • mastra: ๐Ÿ”ง CLI tools for development and deployment
  • typescript: ๐Ÿ“ TypeScript support and compilation
  • @types/node: ๐ŸŸข Node.js type definitions

๐Ÿš€ Multi-Agent Benefits

This architecture provides several advantages over single-agent approaches:

  1. ๐ŸŽฏ Specialization - Each agent focuses on a specific domain (documentation, testing, scoring)
  2. โœจ Clarity - Clear separation of concerns improves reliability and maintainability
  3. ๐Ÿ“ˆ Scalability - Agents can run concurrently where appropriate
  4. ๐ŸŽฏ Accuracy - Writer-reviewer pattern in scoring agent ensures high-quality evaluations
  5. ๐Ÿ”„ Flexibility - Different models can be used for different complexity levels

๐Ÿ”„ Hackathon Evaluation Process

๐Ÿ“‹ Submission Processing Pipeline

  1. ๐Ÿ“ฅ Project Intake

    • Repository URL and documentation analysis
    • Demo video transcript extraction and processing
    • Environment setup and dependency detection
  2. โšก Parallel Intelligence Gathering

    • Claims Extraction: AI identifies all stated project capabilities
    • Repository Analysis: Code scanning for architectural patterns and sponsor integrations
    • Video Analysis: Demo functionality validation from transcript
  3. ๐Ÿ“‹ Test Plan Generation

    • AI Test Designer: Creates 3 targeted test plans per project
    • Resource Allocation: Assigns appropriate datasets, PDFs, and web resources
    • Interaction Planning: Designs realistic user scenarios for agent testing
  4. ๐Ÿงช Live Functionality Testing

    • Automated Agent Interaction: Tests each claimed feature through chat interfaces
    • Success Validation: Empirical verification against stated capabilities
    • Evidence Collection: Detailed logs and response analysis
  5. ๐Ÿ† Multi-Dimensional Scoring

    • Technical Merit: Architecture quality, code patterns, innovation
    • Functional Completeness: Validation of all claimed features
    • Sponsor Alignment: Automatic detection of prize track eligibility
    • Impact Assessment: Productivity gains, user value, market potential
  6. ๐Ÿ“Š Results Compilation

    • Detailed Scorecards: Transparent breakdown of all evaluation criteria
    • Prize Recommendations: AI-identified sponsor track matches
    • Improvement Feedback: Specific suggestions for enhancement
    • Comparative Ranking: Position relative to other submissions

๐Ÿš€ Value for Mastra.Build Hackathon

๐ŸŽฏ Immediate Hackathon Impact

๐Ÿ† Judges & Organizers

  • โšก 10x faster evaluation: Process hundreds of submissions in hours, not days
  • ๐ŸŽฏ Consistent scoring: Every project evaluated using the same rigorous criteria
  • ๐Ÿ“Š Data-driven decisions: Replace gut feelings with empirical evidence
  • ๐Ÿท๏ธ Automatic categorization: AI identifies sponsor prize eligibility instantly
  • ๐Ÿ“ˆ Detailed rankings: Transparent scoring breakdown for every submission

๐Ÿค– Participants

  • ๐Ÿ“‹ Clear expectations: Understand exactly how projects will be evaluated
  • ๐Ÿ”„ Immediate feedback: Get detailed analysis of strengths and improvement areas
  • ๐ŸŽฏ Strategic insights: See which sponsor tracks your project aligns with
  • โญ Fair evaluation: No bias based on presentation skills or demo timing
  • ๐Ÿ“š Learning opportunity: Understand enterprise-grade Mastra patterns

๐Ÿ›๏ธ Long-term Mastra Ecosystem Value

๐ŸŒŸ Template Library Leadership

This template pioneers critical architectural patterns missing from Mastra's current library:

  • โš™๏ธ Environment Variable Injection: Seamless config propagation from parent to testing playgrounds
  • ๐Ÿ’‰ Dependency Injection: Professional IoC container setup with InversifyJS
  • ๐Ÿข Enterprise Architecture: Production-ready patterns for complex business logic
  • ๐Ÿงช Systematic Testing: Multi-agent evaluation workflows for quality assurance

๐Ÿ“š Educational Excellence

  • ๐Ÿ“– Reference Implementation: Complete environment injection + DI example for enterprise developers
  • ๐ŸŽฏ Real Business Logic: Project evaluation domain demonstrates complex workflows
  • ๐Ÿ”ง Best Practices: Proper error handling, logging, and monitoring integration
  • โšก Scalability Patterns: Built to handle hundreds of concurrent evaluations

๐Ÿš€ Market Positioning

  • ๐Ÿข Enterprise Credibility: Positions Mastra as enterprise-capable framework
  • ๐Ÿ‘ฅ Developer Attraction: Sophisticated examples attract senior developers
  • ๐Ÿ“Š Quality Standards: Establishes architectural benchmarks for future templates
  • ๐Ÿ”ฎ Ecosystem Foundation: Enables complex multi-agent applications in production

๐Ÿ™ Acknowledgments

Thanks to the Mastra team for creating the PDF Questions template which was used as a target to test this evaluation tool.

I've uploaded a demo video to YouTube at https://www.youtube.com/watch?v=WQ0rvX8ajeg just to test things out - if the Mastra team would like it taken down, please reach out!

Thanks to TranscriptAPI for providing video transcription services with permission for this hackathon project.

๐Ÿ“„ License

ISC