๐ Mastra.Build Hackathon Evaluator
Automated, unbiased evaluation system for Mastra.Build hackathon submissions using advanced multi-agent workflows.
๐ฅ Watch Demo Video - See the evaluator in action!
This system revolutionizes hackathon judging by replacing subjective manual reviews with systematic, data-driven evaluation. Built specifically for the Mastra.Build hackathon, it automatically evaluates submitted projects, validates claimed features through live testing, and provides sponsor-aligned scoring with track eligibility detection.
๐ฏ Core Purpose
Problem: Mastra.Build hackathon judging requires evaluating diverse AI agent projects consistently across multiple sponsor prize categories.
Solution: An AI-powered evaluation pipeline that:
- ๐ Extracts verifiable claims from project documentation and demo videos
- ๐งช Tests functionality through automated agent interactions
- โญ Scores objectively using standardized criteria across all submissions
- ๐ท๏ธ Tags for sponsor tracks with automatic eligibility detection for Smithery, WorkOS, Browserbase, Arcade, Chroma, Recall, and Confident AI prizes
- ๐ Ranks submissions with transparent, auditable results
๐ Why This Matters for Mastra.Build
This system transforms Mastra.Build evaluation from "subjective demos" to "empirical validation":
- โ๏ธ Eliminates judging bias through systematic evaluation criteria
- ๐ฏ Validates AI agent functionality instead of relying on presentations alone
- ๐ Automatically detects sponsor alignment for prize categories (MCP servers, auth integration, web browsing, etc.)
- โก Scales to evaluate hundreds of Mastra framework submissions efficiently
- ๐ Provides detailed feedback to help Mastra community members improve their agents
๐ Novel Approach: Mastra Evaluating Mastra
๐ฏ Revolutionary Insight: This project demonstrates a groundbreaking approach to AI agent evaluation by using the Mastra framework to evaluate Mastra-built agents.
๐ Self-Evaluation Architecture
Rather than building evaluation as a separate framework or external tool, we've created something unprecedented:
- ๐ค Mastra Agents Evaluating Mastra Agents: The evaluator itself is a sophisticated Mastra multi-agent workflow
- ๐ Native Framework Integration: Deep understanding of Mastra patterns, conventions, and architectural decisions
- ๐ก Cross-Instance Communication: Uses official
@mastra/client-js
to programmatically test agents running in separate Mastra instances - ๐งฌ Framework-Aware Testing: Inherent knowledge of Mastra workflows, tools, and agent patterns enables more intelligent evaluation
๐ก Why This Matters Beyond Hackathons
This "framework evaluating itself" approach represents a new paradigm in AI system assessment:
Traditional Approach โ:
External Eval Tool โ Tests โ AI Framework Project
Our Novel Approach โ :
1Mastra Evaluator Agent โ Tests โ Mastra Target Agent
2(Same framework, deep native understanding)
๐ฏ Unique Advantages
- ๐ง Native Intelligence: The evaluator inherently understands Mastra conventions, making evaluations more contextually accurate
- ๐ง Self-Improving Ecosystem: Insights from evaluations can directly improve the framework itself
- ๐ Framework-Specific Metrics: Evaluation criteria tailored specifically to Mastra's multi-agent, workflow-oriented architecture
- ๐ Proof of Concept: Demonstrates Mastra's capability to build sophisticated, production-ready evaluation systems
๐๏ธ Industry First: To our knowledge, this is the first time a multi-agent framework has been used to systematically evaluate projects built with itself, showcasing both the maturity and self-reflective capabilities of the Mastra ecosystem.
๐ Sponsor Prize Track Detection
A key differentiator of this evaluation system is automated sponsor alignment detection. The AI scorer analyzes project dependencies, functionality, and implementation patterns to identify eligibility for specific sponsor prize categories:
๐ Mastra.Build Prize Categories
- ๐ฅ Best overall (judged by Mastra)
- ๐ง Best MCP server (judged by Smithery)
- โญ Bonus award: Best use of Smithery (Switch2)
- ๐ค Best use of AgentNetwork (judged by Mastra)
- ๐ Best use of auth (judged by WorkOS)
- ๐ Best use of web browsing (judged by Browserbase)
- ๐ ๏ธ Best use of tool provider (judged by Arcade)
- ๐ Best RAG template (judged by Chroma)
- โก Best productivity (judged by Mastra)
- ๐ป Best coding agent (judged by Mastra)
- ๐ฐ Best crypto agent (judged by Recall)
- ๐งช Best use of Evals (judged by Confident AI)
- ๐ฏ Shane's favorite (judged by Shane)
- ๐ Funniest (judged by Abhi)
๐ฏ Automated Tag Detection Process
The system automatically analyzes:
- ๐ฆ Package Dependencies: Detects
@smithery/sdk
,@workos/node
,browserbase
,@arcadeai/arcadejs
,chromadb
, etc. - ๐ Code Patterns: Identifies authentication flows, web scraping, RAG implementations, MCP server structures
- ๐ Documentation Keywords: Extracts mentions of sponsor technologies and use cases
- ๐งช Functionality Testing: Validates actual integration with sponsor services
โ ๏ธ Important Note: This system is designed to assist and accelerate the evaluation process, not replace human judgment. While it provides systematic analysis and scoring, human review remains essential for final prize decisions, especially for subjective categories like "Shane's favorite" and "Funniest". The AI evaluation serves as a comprehensive first-pass filter and detailed analysis tool for judges.
๐ฏ Key Features
- โ๏ธ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
- Dependency Injection - Leverages InversifyJS for loose coupling and testability
- Multi-Agent Coordination - Specialized agents working in orchestrated harmony
- ๐ฎ Arcade AI Integration - Direct Google Sheets access through Arcade's tool ecosystem
- ๐ Automated Form Processing - Google Forms responses automatically feed into evaluation workflow
- Template Ready - A complete Mastra template showcasing advanced patterns
๐๏ธ Architecture
The system uses a multi-agent pipeline with the following specialized components:
๐ค Core Agents
- ๐ Template Reviewer Agent - Main evaluation agent that coordinates the assessment process
- ๐ Documentation Review Agent - Analyzes project documentation for clarity, completeness, and extracts metadata
- ๐ฏ Promise Extraction Agent - Identifies and extracts stated features, claims, and guarantees from documentation
- ๐งช Testing Agent - Verifies promises through automated testing and validation
- โญ Scoring Agent - Provides final evaluation using a writer-reviewer pattern for high accuracy
๐ง Architecture Features
- ๐ Unique Project ID: Each evaluation uses a UUID for tracking and correlation
- ๐ฆ Structured Input/Output: All agents communicate through well-defined Zod schemas
- ๐ Model Routing: Uses OpenRouter for optimal LLM selection per task
- ๐ท๏ธ Tag Generation: Automatic keyword extraction for searchability
๐ Template Reviewer Workflow Architecture
The heart of this system is the template-reviewer-workflow (src/mastra/workflows/template-reviewer-workflow/
), a sophisticated multi-step evaluation pipeline that demonstrates advanced workflow orchestration patterns.
๐ Workflow Overview
The template reviewer workflow implements a 4-phase evaluation process with parallel execution where possible:
1templateReviewerWorkflow = createWorkflow({
2 id: "template-reviewer-workflow",
3 description: "Coordinator that launches the full template-review workflow",
4 inputSchema: templateReviewerWorkflowInputSchema,
5 outputSchema: templateReviewerWorkflowOutputSchema,
6})
7.then(createStep({ id: "clone-project" })) // Phase 1: Setup
8.parallel([ // Phase 2: Parallel Analysis
9 createStep({ id: "setup-project-repo" }),
10 createStep({ id: "claims-extractor" })
11])
12.then(createStep({ id: "executor-and-scorer" })) // Phase 3: Testing & Scoring
13.commit();
๐ง Core Workflow Components
1. ๐ Project Setup & Cloning
- Creates new project entity with UUID tracking
- Persists project metadata to database
- Initializes evaluation context with environment configuration
2. ๐ Claims Extractor
Purpose: Systematically extracts capabilities claimed by template submitted from project documentation and video transcripts.
Key features:
- ๐ Dual-source analysis: Processes both documentation and video transcripts
- ๐ฏ Present-tense filtering: Distinguishes between current capabilities vs future promises
- ๐ท๏ธ Structured extraction: Outputs standardized claim objects with evidence references
1export const claimsSchema = z.object({
2 claims: z.array(z.object({
3 name: z.string().describe("Concise, verb-first summary (โค 10 words)"),
4 description: z.string().describe("Full claim text with โค 25-word evidence snippet")
5 }))
6});
Why it's critical: Claims extraction forms the foundation for all subsequent testing and evaluation. Without accurate claim identification, the testing phase cannot validate the right functionality.
3. ๐ Plan Maker
Purpose: Generates comprehensive test plans that validate extracted claims through systematic chat-based interactions.
This component is strategically vital because it:
- ๐ฏ Bridges claims to testing: Converts abstract capability claims into concrete, executable test scenarios
- ๐ Resource-aware planning: Leverages a curated resource kit (PDFs, CSVs, websites, locations) for realistic testing
- ๐ Multi-plan generation: Creates exactly 3 complementary test plans to maximize claim coverage
- ๐ฌ Chat-based validation: Designs conversational tests that mirror real user interactions
Resource Kit Integration: We understand that sample data must be needed to test out certain agents, we have already taken care of that! This evaluator includes sample data for the following:
- ๐ Document Processing: Universal Declaration of Human Rights, Sherlock Holmes stories, AI Agent principles Mastra book
- ๐ Data Analysis: Iris dataset, Penguins dataset, Apple stock data
- ๐ Web Content: Hacker News, Wikipedia pages
- ๐ Location Data: Coordinates for weather-related testing
4. ๐งช Tester Component (tester.ts
)
Purpose: Executes the generated test plans and validates agent responses against success criteria. It uses a multi-pass tester-validator loop to chat with target agent and verify if the target agent does what it claims in its documentation and video demos
1export const testerOutputSchema = z.array(z.object({
2 id: z.string(), // Links back to plan-1, plan-2, plan-3
3 passed: z.boolean(), // Binary pass/fail result
4 explanation: z.string(), // Detailed reasoning for the result
5}));
๐ Programmatic Agent Testing with Mastra Client
A key innovation of this evaluation system is its ability to programmatically control and test agents running in separate Mastra playground instances. This is accomplished using the official @mastra/client-js
library, enabling sophisticated cross-instance agent orchestration.
๐ Multi-Instance Testing Architecture
The system operates using a dual-playground architecture:
- ๐ฏ Evaluator Instance - Runs the Template Reviewer Workflow (main evaluation agent)
- ๐ฌ Target Instance - Runs the project being evaluated (cloned and deployed automatically)
Key Innovation: The evaluator agent can programmatically discover, connect to, and test agents running on completely different Mastra playground instances.
๐ก Official Mastra Client Integration
The tester component leverages the official Mastra JavaScript client (@mastra/client-js
) for seamless agent communication:
1import { MastraClient } from "@mastra/client-js";
2
3export async function runPlansAgainstAgent(props: {
4 port: string;
5 plans: z.infer<typeof planMakerOutputSchema>["plans"];
6}) {
7 // Connect to target Mastra instance
8 const baseUrl = `http://localhost:${props.port}/`;
9 const client = new MastraClient({ baseUrl });
10
11 // Discover available agents and choose the one with most tools
12 const agents = await discoverAgentsWithClient(client);
13 // ... rest of testing logic
14}
๐ฌ Threaded Conversation Testing
The system uses stateful conversation threads for realistic multi-turn testing:
1async function sendChatWithClient(
2 client: MastraClient,
3 agentId: string,
4 messages: Messages,
5 threadId?: string
6): Promise<string> {
7 const agent = client.getAgent(agentId);
8 const res: any = await agent.generate({ messages, threadId });
9
10 // Intelligent response parsing
11 if (typeof res === "string") return res;
12 if (typeof res.text === "string") return res.text;
13 if (typeof res.message === "string") return res.message;
14 if (typeof res.content === "string") return res.content;
15
16 // Handle message arrays (conversation format)
17 if (Array.isArray(res.messages)) {
18 const last = res.messages[res.messages.length - 1];
19 if (last?.content) return String(last.content);
20 }
21
22 return JSON.stringify(res);
23}
๐ฏ Context-Aware Testing Process
The complete testing workflow demonstrates:
- ๐ Client Connection - Establishes connection to target Mastra instance using official client
- ๐ค Agent Discovery - Queries available agents and their capabilities via client API
- ๐ฏ Smart Selection - Chooses optimal agent based on name matching or tool count
- ๐งต Thread Management - Creates stable conversation threads per test plan for context continuity
- ๐ฌ Interactive Testing - Conducts realistic chat-based validation of claimed functionality
- ๐ Evidence Collection - Documents complete interaction transcripts for transparent scoring
๐ Why This Approach Matters for Judges
This professional client-based architecture demonstrates several key advantages:
๐ง Official Integration
- โ Standards Compliance: Uses official Mastra client library, not custom API calls
- ๐ Future-Proof: Benefits from official library updates and improvements
- ๐ก๏ธ Error Handling: Robust error handling through established client patterns
๐งต Stateful Conversations
- ๐ฌ Realistic Testing: Multi-turn conversations with proper context preservation
- ๐ฏ Thread Isolation: Each test plan maintains its own conversation thread
- ๐ Scalable Design: Concurrent testing across multiple agent instances
๐ฏ Tool-Based Agent Selection
- ๐ง Tool-Centric: Always selects the agent with the highest tool count for comprehensive testing
- ๐ Objective Criteria: Uses quantifiable metrics (tool count) rather than subjective name matching
- ๐ฏ Optimal Coverage: Ensures testing against the most capable agent available
This approach showcases integration patterns with the Mastra ecosystem, demonstrating how to build sophisticated agent orchestration systems using official tooling rather than ad-hoc API integrations.
5. โญ Scorer Component (scorer.ts
)
Purpose: Provides comprehensive evaluation across multiple dimensions with detailed explanations.
1export const scorerOutputSchema = z.object({
2 descriptionQuality: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
3 tests: testerOutputSchema, // Integration with test results
4 appeal: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
5 creativity: z.object({ score: z.number().min(1).max(5), explanation: z.string() }),
6 architecture: z.object({
7 agents: z.object({ count: z.number() }),
8 tools: z.object({ count: z.number() }),
9 workflows: z.object({ count: z.number() }),
10 }),
11 tags: z.array(z.string()), // Automatic categorization
12});
๐ Workflow Execution Flow
- ๐ฅ Input Processing: Accepts project name, repository URL, description, video URL, and optional environment configuration
- ๐ Parallel Phase:
- Repo Setup: Clones repository, runs
npm install
, creates.env
file - Claims Analysis: Extracts video transcript, analyzes documentation, identifies capabilities
- Repo Setup: Clones repository, runs
- ๐ Plan Generation: Creates 3 targeted test plans based on extracted claims
- ๐งช Test Execution: Runs chat-based tests against the deployed project
- โญ Final Scoring: Generates comprehensive evaluation with detailed explanations
๐ฏ Why This Architecture Matters
๐ Systematic Claim Validation
Unlike ad-hoc evaluation approaches, this workflow ensures every stated capability is systematically:
- ๐ Documented (claims extractor)
- ๐ Planned for testing (plan maker)
- ๐งช Empirically validated (tester)
- โญ Scored with evidence (scorer)
๐ Reproducible Evaluation Process
The workflow creates an audit trail from initial claims through final scores, enabling:
- ๐ Traceability: Every score traces back to specific test results
- ๐ Reproducibility: Same project always produces consistent evaluations
- ๐ Comparative Analysis: Standardized scoring enables project comparisons
โก Parallel Processing Optimization
Smart parallelization reduces evaluation time:
- Repository setup and claims extraction run concurrently
- Video processing happens alongside documentation analysis
- Database persistence is optimized for workflow state management
๐ก Real-World Example: Evaluating Deep Research Assistant
Let's walk through how our template reviewer workflow would evaluate the Deep Research Assistant project:
๐ฅ Input Processing
1{
2 "name": "Deep Research Assistant",
3 "repoURLOrShorthand": "https://github.com/mastra-ai/template-deep-research.git",
4 "description": "Advanced AI deep research assistant with human-in-the-loop workflows",
5 "videoURL": "https://youtube.com/watch?v=demo-video",
6 "envConfig": {
7 "EXA_API_KEY": "demo-key-for-testing"
8 }
9}
๐ Claims Extraction Output
Our claims extractor would identify these present-tense capabilities:
1{
2 "claims": [
3 {
4 "name": "Implements interactive human-in-loop research system",
5 "description": "Creates an interactive, human-in-the-loop research system that allows users to explore topics - README line 3"
6 },
7 {
8 "name": "Searches web using Exa API integration",
9 "description": "webSearchTool: Searches the web using the Exa API for relevant information - README line 15"
10 },
11 {
12 "name": "Evaluates research result relevance automatically",
13 "description": "evaluateResultTool: Assesses result relevance to the research topic - README line 16"
14 },
15 {
16 "name": "Generates comprehensive markdown reports",
17 "description": "reportAgent: Transforms research findings into comprehensive markdown reports - README line 20"
18 },
19 {
20 "name": "Extracts key learnings and follow-up questions",
21 "description": "extractLearningsTool: Identifies key learnings and generates follow-up questions - README line 17"
22 }
23 ]
24}
๐ Generated Test Plans
Our plan maker would create 3 targeted chat-based test plans:
Plan 1: End-to-End Research Process
1{
2 "id": "plan-1",
3 "title": "Validate complete research workflow with report generation",
4 "claims_targeted": [
5 "Searches web using Exa API integration",
6 "Generates comprehensive markdown reports"
7 ],
8 "steps": [
9 {
10 "message": "I need you to research 'AI agent frameworks in 2024' and provide me with a comprehensive analysis. Please use the Principles of Building AI Agents document at https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf as a reference.",
11 "expected_agent_behavior": "Should initiate web search using Exa API, retrieve relevant information, and reference the provided PDF"
12 },
13 {
14 "message": "Now generate a final research report in markdown format with your findings.",
15 "expected_agent_behavior": "Should produce a well-structured markdown report containing research findings, analysis, and references"
16 }
17 ],
18 "success_criteria": [
19 "Successfully searches web using Exa API",
20 "References the provided PDF document",
21 "Generates properly formatted markdown report",
22 "Report contains research findings and analysis"
23 ],
24 "resourcesToUse": [
25 {"name": "AI Agent Principles PDF", "url": "https://hs-47815345.f.hubspotemail.net/hub/47815345/hubfs/book/principles_2nd_edition_updated.pdf"}
26 ]
27}
Plan 2: Result Evaluation and Learning Extraction
1{
2 "id": "plan-2",
3 "title": "Test relevance evaluation and learning extraction capabilities",
4 "claims_targeted": [
5 "Evaluates research result relevance automatically",
6 "Extracts key learnings and follow-up questions"
7 ],
8 "steps": [
9 {
10 "message": "Research Python programming trends using information from https://en.wikipedia.org/wiki/Python_(programming_language) and evaluate how relevant each piece of information is to modern software development.",
11 "expected_agent_behavior": "Should retrieve Wikipedia content, assess relevance of different sections, and provide relevance ratings"
12 },
13 {
14 "message": "Based on your research, extract the top 3 key learnings and suggest 2 follow-up research questions.",
15 "expected_agent_behavior": "Should identify key insights from the research and generate relevant follow-up questions for deeper investigation"
16 }
17 ],
18 "success_criteria": [
19 "Demonstrates relevance evaluation for search results",
20 "Extracts meaningful key learnings from research data",
21 "Generates logical follow-up research questions",
22 "Shows clear reasoning for relevance assessments"
23 ],
24 "resourcesToUse": [
25 {"name": "Wikipedia Python Page", "url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}
26 ]
27}
Plan 3: Multi-Source Research Integration
1{
2 "id": "plan-3",
3 "title": "Validate research across multiple data sources and formats",
4 "claims_targeted": [
5 "Implements interactive human-in-loop research system",
6 "Searches web using Exa API integration"
7 ],
8 "steps": [
9 {
10 "message": "Research current trends in data science by analyzing information from https://news.ycombinator.com/ and correlating it with data patterns from the Iris dataset at https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
11 "expected_agent_behavior": "Should fetch and analyze both web content from Hacker News and CSV data, then find correlations or connections"
12 },
13 {
14 "message": "Summarize how the current discussions on Hacker News relate to data science methodologies, using examples from the Iris dataset analysis.",
15 "expected_agent_behavior": "Should synthesize findings from both sources and demonstrate connections between current discussions and classic data science examples"
16 }
17 ],
18 "success_criteria": [
19 "Successfully processes both web content and CSV data",
20 "Demonstrates integration across multiple data formats",
21 "Provides meaningful synthesis of disparate information sources",
22 "Shows ability to correlate web discussions with data analysis"
23 ],
24 "resourcesToUse": [
25 {"name": "Hacker News", "url": "https://news.ycombinator.com/"},
26 {"name": "Iris Dataset", "url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"}
27 ]
28}
๐งช Sample Test Results
1[
2 {
3 "id": "plan-1",
4 "passed": true,
5 "explanation": "Successfully completed end-to-end research with Exa API integration. Generated comprehensive markdown report with proper structure and citations."
6 },
7 {
8 "id": "plan-2",
9 "passed": true,
10 "explanation": "Demonstrated clear relevance evaluation process. Extracted meaningful insights and generated logical follow-up questions."
11 },
12 {
13 "id": "plan-3",
14 "passed": false,
15 "explanation": "Successfully processed both data sources but failed to establish meaningful correlations between HN discussions and Iris dataset patterns."
16 }
17]
โญ Final Evaluation Score
1{
2 "descriptionQuality": {
3 "score": 4,
4 "explanation": "Clear, well-structured documentation with good technical detail and usage examples"
5 },
6 "tests": [
7 {"id": "plan-1", "passed": true, "explanation": "End-to-end research workflow validated"},
8 {"id": "plan-2", "passed": true, "explanation": "Relevance evaluation and learning extraction working"},
9 {"id": "plan-3", "passed": false, "explanation": "Multi-source integration needs improvement"}
10 ],
11 "appeal": {
12 "score": 4,
13 "explanation": "Compelling use case for research automation with clear business value"
14 },
15 "creativity": {
16 "score": 3,
17 "explanation": "Good implementation of known patterns but limited novel approaches"
18 },
19 "architecture": {
20 "agents": {"count": 2},
21 "tools": {"count": 3},
22 "workflows": {"count": 2}
23 },
24 "tags": [
25 "exa-api",
26 "web-search",
27 "report-generation",
28 "human-in-the-loop",
29 "eligible-browserbase",
30 "eligible-productivity",
31 "eligible-best-overall"
32 ]
33}
๐ Sponsor Track Eligibility Detection
The AI automatically detected sponsor eligibilities based on:
- ๐
eligible-browserbase
: Uses Exa API for web search (similar to web browsing functionality) - โก
eligible-productivity
: Research automation enhances user productivity - ๐ฅ
eligible-best-overall
: Solid implementation with good architecture and functionality
Additional tags would be generated for projects using:
eligible-smithery
: Projects with@smithery/sdk
dependencyeligible-workos
: Authentication flows using@workos/node
eligible-arcade
: Tool integrations using@arcadeai/arcadejs
eligible-chroma
: RAG implementations withchromadb
eligible-recall
: Crypto/blockchain functionalityeligible-confident-ai
: Evaluation frameworks integration
๐ Dependency Injection Architecture
Comprehensive IoC implementation using InversifyJS - a nice add for mastra templates library in my opinion:
๐ฎ Arcade AI Integration
โฐ Extension Enhancement: The Arcade AI integration and Google Forms workflow was added during the 3-hour extension period granted by judges and was not part of the original hackathon submission. This demonstrates the system's extensibility and rapid integration capabilities.
๐ Google Sheets Tool (google-sheets-tool.ts
)
Seamlessly integrates with Google Sheets through Arcade AI's tool ecosystem:
1export const googleSheetsTool = ({ arcadeApiKey, arcadeUserId, defaultSpreadsheetId }) => {
2 return createTool({
3 id: "get_google_spreadsheet",
4 description: "Fetch data from a Google Spreadsheet using Arcade AI",
5 execute: async ({ context }) => {
6 const client = new Arcade({ apiKey: arcadeApiKey });
7 const result = await client.tools.execute({
8 tool_name: "GoogleSheets.GetSpreadsheet@3.0.0",
9 input: { spreadsheet_id: finalSpreadsheetId },
10 user_id: arcadeUserId,
11 });
12 }
13 });
14};
๐ Google Forms to Evaluation Pipeline
Automated hackathon submission processing:
- ๐ Form Submissions - Participants submit via Google Forms (repo URL, demo video, description)
- ๐ Sheet Integration - Responses automatically populate Google Sheets
- ๐ฎ Arcade Processing - Google Sheets tool fetches new submissions
- ๐ค Auto-Evaluation - Each row triggers template-reviewer-workflow
- ๐ Live Results - Scores and rankings update in real-time
๐ฏ Benefits for Hackathon Organizers
- โก Zero Manual Input - Forms directly feed evaluation pipeline
- ๐ Real-time Processing - Submissions evaluated as they arrive
- ๐ Automated Tracking - Complete audit trail from form to final score
- ๐ Instant Rankings - Live leaderboard updates with new submissions
๐ง Container Setup (index.ts
)
1import "reflect-metadata";
2import { Container } from "inversify";
3
4const container = new Container();
5container.bind(Config).toDynamicValue(() => new Config()).inSingletonScope();
6container.bind(DB_SYMBOL).toConstantValue(getDB(container));
7
8// Professional DI container with proper lifecycle management
๐ญ Service Abstractions (infra/repositories/
)
1@injectable()
2class ProjectRepository implements IProjectRepository {
3 constructor(@inject(DB_SYMBOL) private db: Database) {}
4 // Clean dependency injection with interface segregation
5}
๐งช Testability Benefits
1// Easy mocking for unit tests
2const mockRepo = mock<IProjectRepository>();
3container.rebind(PROJECT_REPO_SYMBOL).toConstantValue(mockRepo);
4// Dependency injection enables effortless testing
๐ Project Structure
1src/mastra/
2โโโ ๐ค agents/ # AI agents for specialized evaluation tasks
3โ โโโ claims-extractor-agent.ts # ๐ Claims extraction specialist
4โ โโโ template-reviewer-agent.ts # ๐ Main coordinator agent
5โ โโโ weather-agent.ts # ๐ค๏ธ Example weather agent
6โโโ ๐๏ธ domain/ # Domain entities and business logic
7โ โโโ aggregates/ # ๐ Domain aggregates and configuration
8โ โ โโโ config.ts # โ๏ธ Application configuration
9โ โ โโโ project/ # ๐ Project domain model
10โ โโโ shared/ # ๐ Shared value objects
11โ โโโ value-objects/
12โ โโโ id.ts # ๐ Type-safe identifiers
13โ โโโ index.ts # ๐ค Value object exports
14โโโ ๐๏ธ infra/ # Infrastructure layer
15โ โโโ database/ # ๐๏ธ MongoDB connection and setup
16โ โ โโโ mongodb.ts # ๐ Database configuration
17โ โโโ model/ # ๐ง AI model configuration
18โ โ โโโ index.ts # ๐ค OpenRouter model setup
19โ โโโ repositories/ # ๐ Data persistence layer
20โ โ โโโ project.ts # ๐ Project data access
21โ โโโ services/ # ๐ง External service integrations
22โ โโโ video/ # ๐ฅ Video processing services
23โโโ ๐ ๏ธ tools/ # Mastra tools for agent capabilities
24โ โโโ google-sheets-tool.ts # ๐ Google Sheets integration via Arcade AI
25โ โโโ list-projects-tool.ts # ๐ Project listing tool
26โโโ ๐ workflows/ # Business process workflows
27โ โโโ template-reviewer-workflow/ # ๐ Main evaluation pipeline
28โ โ โโโ claim-extractor.ts # ๐ Claims extraction step
29โ โ โโโ index.ts # ๐ Workflow orchestration
30โ โ โโโ plan-maker.ts # ๐ Test plan generation
31โ โ โโโ scorer.ts # โญ Evaluation scoring
32โ โ โโโ tester.ts # ๐งช Automated testing
33โ โ โโโ sample-input.json # ๐ Example input data
34โ โ โโโ sample-output.json # ๐ Example output structure
35โ โโโ test-workflow.ts # ๐งช Simple test workflow
36โโโ index.ts # ๐ฏ Main application entry point
โจ Features
๐ Multi-Agent Evaluation Pipeline
- ๐ Documentation Analysis - Comprehensive review of README and project documentation
- ๐ฏ Promise Extraction - Systematic identification of project claims and features
- ๐งช Automated Testing - Verification of promises through code execution and testing
- โญ Structured Scoring - Evidence-based evaluation using defined rubrics
- ๐ท๏ธ Tag Generation - Automatic classification for searchability and organization
๐ Evaluation Criteria
- ๐ Documentation Quality - Clarity, completeness, usability assessment
- โ Feature Completeness - Delivery verification of promised functionality
- ๐ก๏ธ Reliability - Error-free operation validation through testing
- ๐ก Innovation/Impact - Novelty and significance evaluation of solution
- ๐๏ธ Technical Implementation - Code quality and architecture assessment
๐ Template Contribution to Mastra Ecosystem
This project represents a groundbreaking addition to Mastra's template library, introducing enterprise-grade architectural patterns that are currently missing from the official collection:
โ๏ธ Seamless Environment Variable Injection
- ๐ Parent-to-Child Propagation - Automatically injects parent playground's environment variables into testing playgrounds
- ๏ฟฝ๏ธ API Key Inheritance - Testing environments inherit all AI API keys from the evaluator playground
- ๐ฏ Zero-Config Testing - Target projects receive all necessary environment variables without manual setup
- ๐ Dynamic Configuration Merging - Combines parent playground config with project-specific environment variables
- ๏ฟฝ Effortless Multi-Instance Testing - Eliminates setup friction for cross-playground agent communication
- โก Automated Environment Provisioning - Testing playgrounds get fully configured environments automatically
๐ Dependency Injection Mastery
- ๐จ Critical Gap Filled - Addresses the complete absence of DI examples in current templates
- ๐ง InversifyJS Integration - Professional IoC container setup with decorators
- ๐ญ Interface Segregation - Clean abstractions between application layers
- โก Lifecycle Management - Singleton scoping and proper resource management
- ๐งช Testability Focus - Architecture designed for easy mocking and unit testing
- ๐ฆ Modular Design - Loosely coupled components for maximum flexibility
โ๏ธ Advanced Environment Variable Management
A critical innovation for seamless multi-playground testing:
1// Automatic environment injection from parent to testing playgrounds
2envConfig: {
3 ...container.get(Config).aiAPIKeys, // Parent playground's AI keys
4 ...inputData.envConfig, // Project-specific overrides
5}
Key Benefits:
- ๐ Zero-Config Testing - Testing playgrounds inherit all necessary API keys automatically
- ๐ฏ Parent-Child Propagation - Evaluator playground shares environment with target projects
- ๐ ๏ธ AI Provider Continuity - OpenRouter, OpenAI, and other API keys propagate seamlessly
- โ๏ธ Configuration Merging - Smart combination of parent config with project-specific variables
- ๐ Friction-Free Setup - Eliminates manual environment setup for cross-instance testing
- ๐ Secure Key Management - Centralizes API key management in the parent evaluator instance
Why This Matters for Mastra Templates: This pattern solves a critical pain point in multi-instance Mastra deployments where testing environments need access to the same API keys and configurations as the parent system, enabling truly automated evaluation workflows.
๐ข Enterprise-Ready Architecture
Unlike other templates focused on simple demos, this showcases:
- ๐๏ธ Production Patterns - Battle-tested enterprise architectural decisions
- ๐ Scalability Design - Built to handle complex business domains
- ๐ก๏ธ Maintainability - Clean code principles and SOLID design patterns
- ๐ Observability - Comprehensive logging and monitoring integration
๐ Prerequisites
- ๐ข Node.js >= 20.9.0
- ๐๏ธ MongoDB instance for data persistence
- ๐ OpenRouter API key for LLM access
- ๐ Project repository or documentation to evaluate
๐ Installation
- ๐ฅ Clone the repository:
1git clone <repository-url>
2cd mastra-template-evaluator
- ๐ฆ Install dependencies:
npm install
- โ๏ธ Set up environment variables:
# Configure OpenRouter API key and MongoDB connection
๐ฎ Usage
๐ง Development Mode
npm run dev
๐๏ธ Build
npm run build
๐ Production
npm start
๐ Template Evaluation
The system can evaluate projects by:
- ๐ Analyzing markdown documentation and README files
- ๐ฅ Processing video demonstrations (YouTube links)
- ๐ฏ Extracting and verifying feature claims
- ๐งช Running automated tests and validations
- ๐ Generating comprehensive evaluation reports with scores and feedback
๐ Example Evaluation Input
Here's an example of how to evaluate a Mastra template project using this system. The evaluator takes a structured input describing the project to be assessed:
1{
2 "name": "PDF to Questions Generator",
3 "repoURLOrShorthand": "mastra-ai/template-pdf-questions",
4 "videoURL": "https://youtu.be/WQ0rvX8ajeg",
5 "description": "A Mastra template that demonstrates **how to protect against token limits** by generating AI summaries from large datasets before passing as output from tool calls..."
6}
Key Input Fields:
name
: Human-readable project name for identificationrepoURLOrShorthand
: GitHub repository (full URL orowner/repo
format)videoURL
: YouTube demo video for functionality analysisdescription
: Full project documentation in markdown format
What the evaluator does with this input:
- ๐ Clones the repository and sets up the project environment
- ๐ Extracts claims from both documentation and video transcript
- ๐งช Generates test plans to validate claimed functionality
- ๐ค Runs automated tests by interacting with the deployed agent
- โญ Provides comprehensive scoring across multiple criteria
- ๐ท๏ธ Auto-detects sponsor track eligibility (MCP, auth, RAG, etc.)
๐งช READY TO TEST: The
sample-input.json
file contains a fully working test case that you can use immediately to test the evaluation system. This input evaluates the official Mastra PDF Questions template and demonstrates all evaluation features including claims extraction, test plan generation, automated testing, and comprehensive scoring.
๐ EXAMPLE OUTPUT: The
sample-output.json
file shows the complete result structure from a successful evaluation run. It demonstrates the comprehensive scoring, test results, sponsor track tags, and detailed analysis that the system produces for each evaluated project.
โ VERIFIED WORKING: This sample input has been tested end-to-end and successfully evaluates the target project with full functionality validation, scoring, and sponsor track detection. Simply run the workflow with this input to see the complete evaluation process in action.
๐ ๏ธ Tech Stack
๐ฏ Core Framework
- Mastra - Multi-agent orchestration and workflow management
- TypeScript - Type-safe development with full IntelliSense support
- Node.js - Runtime environment for scalable server-side applications
๐ค AI & LLM Integration
- OpenRouter - Multi-provider LLM routing with model selection optimization
- AI SDK - Vercel's AI SDK for streamlined language model interactions
- Mastra Client - Official client library for cross-instance agent communication
๐๏ธ Data & Storage
- MongoDB - Document database for project metadata, evaluations, and scoring data
- Zod - Schema validation and type-safe data parsing
- UUID - Unique identifier generation for project tracking
๐๏ธ Architecture & Design
- โ๏ธ Environment Variable Injection - Seamless config propagation from parent to testing playgrounds
- Dependency Injection - InversifyJS IoC container for loose coupling and testability
- Repository Pattern - Data access abstraction with clean interfaces
๐งช Testing & Quality
- Multi-Agent Testing - Automated agent interaction validation
- Chat-based Validation - Conversational testing with stateful threads
- Evidence Collection - Comprehensive interaction logging and analysis
๐ง External Services
- Arcade AI - Google Sheets integration and tool ecosystem access
- TranscriptAPI - Video transcript extraction and processing
- YouTube Integration - Demo video analysis and content extraction
- Git Integration - Automated repository cloning and project setup
๐ฆ Dependencies
๐๏ธ Core Framework
@mastra/core
: ๐ฏ Core Mastra framework functionality@mastra/libsql
: ๐๏ธ SQLite storage for telemetry and evaluations@mastra/memory
: ๐ง Memory management for agent persistence@mastra/loggers
: ๐ Logging infrastructure
๐ค AI and LLM Integration
@openrouter/ai-sdk-provider
: ๐ OpenRouter LLM providerai
: ๐ง AI SDK for language model interactions
๐๏ธ Infrastructure & Architecture
inversify
: ๐ Dependency injection container (IoC)mongodb
: ๐๏ธ MongoDB database driverzod
: โ Schema validation and type safetyreflect-metadata
: ๐ญ Decorator metadata reflection
๐ฎ Integration & Tools
@arcadeai/arcadejs
: ๐ฎ Arcade AI SDK for Google Sheets integrationuuid
: ๐ Unique identifier generation
๐ ๏ธ Development
mastra
: ๐ง CLI tools for development and deploymenttypescript
: ๐ TypeScript support and compilation@types/node
: ๐ข Node.js type definitions
๐ Multi-Agent Benefits
This architecture provides several advantages over single-agent approaches:
- ๐ฏ Specialization - Each agent focuses on a specific domain (documentation, testing, scoring)
- โจ Clarity - Clear separation of concerns improves reliability and maintainability
- ๐ Scalability - Agents can run concurrently where appropriate
- ๐ฏ Accuracy - Writer-reviewer pattern in scoring agent ensures high-quality evaluations
- ๐ Flexibility - Different models can be used for different complexity levels
๐ Hackathon Evaluation Process
๐ Submission Processing Pipeline
-
๐ฅ Project Intake
- Repository URL and documentation analysis
- Demo video transcript extraction and processing
- Environment setup and dependency detection
-
โก Parallel Intelligence Gathering
- Claims Extraction: AI identifies all stated project capabilities
- Repository Analysis: Code scanning for architectural patterns and sponsor integrations
- Video Analysis: Demo functionality validation from transcript
-
๐ Test Plan Generation
- AI Test Designer: Creates 3 targeted test plans per project
- Resource Allocation: Assigns appropriate datasets, PDFs, and web resources
- Interaction Planning: Designs realistic user scenarios for agent testing
-
๐งช Live Functionality Testing
- Automated Agent Interaction: Tests each claimed feature through chat interfaces
- Success Validation: Empirical verification against stated capabilities
- Evidence Collection: Detailed logs and response analysis
-
๐ Multi-Dimensional Scoring
- Technical Merit: Architecture quality, code patterns, innovation
- Functional Completeness: Validation of all claimed features
- Sponsor Alignment: Automatic detection of prize track eligibility
- Impact Assessment: Productivity gains, user value, market potential
-
๐ Results Compilation
- Detailed Scorecards: Transparent breakdown of all evaluation criteria
- Prize Recommendations: AI-identified sponsor track matches
- Improvement Feedback: Specific suggestions for enhancement
- Comparative Ranking: Position relative to other submissions
๐ Value for Mastra.Build Hackathon
๐ฏ Immediate Hackathon Impact
๐ Judges & Organizers
- โก 10x faster evaluation: Process hundreds of submissions in hours, not days
- ๐ฏ Consistent scoring: Every project evaluated using the same rigorous criteria
- ๐ Data-driven decisions: Replace gut feelings with empirical evidence
- ๐ท๏ธ Automatic categorization: AI identifies sponsor prize eligibility instantly
- ๐ Detailed rankings: Transparent scoring breakdown for every submission
๐ค Participants
- ๐ Clear expectations: Understand exactly how projects will be evaluated
- ๐ Immediate feedback: Get detailed analysis of strengths and improvement areas
- ๐ฏ Strategic insights: See which sponsor tracks your project aligns with
- โญ Fair evaluation: No bias based on presentation skills or demo timing
- ๐ Learning opportunity: Understand enterprise-grade Mastra patterns
๐๏ธ Long-term Mastra Ecosystem Value
๐ Template Library Leadership
This template pioneers critical architectural patterns missing from Mastra's current library:
- โ๏ธ Environment Variable Injection: Seamless config propagation from parent to testing playgrounds
- ๐ Dependency Injection: Professional IoC container setup with InversifyJS
- ๐ข Enterprise Architecture: Production-ready patterns for complex business logic
- ๐งช Systematic Testing: Multi-agent evaluation workflows for quality assurance
๐ Educational Excellence
- ๐ Reference Implementation: Complete environment injection + DI example for enterprise developers
- ๐ฏ Real Business Logic: Project evaluation domain demonstrates complex workflows
- ๐ง Best Practices: Proper error handling, logging, and monitoring integration
- โก Scalability Patterns: Built to handle hundreds of concurrent evaluations
๐ Market Positioning
- ๐ข Enterprise Credibility: Positions Mastra as enterprise-capable framework
- ๐ฅ Developer Attraction: Sophisticated examples attract senior developers
- ๐ Quality Standards: Establishes architectural benchmarks for future templates
- ๐ฎ Ecosystem Foundation: Enables complex multi-agent applications in production
๐ Acknowledgments
Thanks to the Mastra team for creating the PDF Questions template which was used as a target to test this evaluation tool.
I've uploaded a demo video to YouTube at https://www.youtube.com/watch?v=WQ0rvX8ajeg just to test things out - if the Mastra team would like it taken down, please reach out!
Thanks to TranscriptAPI for providing video transcription services with permission for this hackathon project.
๐ License
ISC