CSV to Questions Generator

A Mastra template that demonstrates how to protect against token limits by generating AI summaries from large CSV datasets before passing as output from tool calls.

šŸŽÆ Key Learning: This template shows how to use large context window models (OpenAI GPT-4.1 Mini) as a "summarization layer" to compress large CSV datasets into focused summaries, enabling efficient downstream processing without hitting token limits.

Overview

This template showcases a crucial architectural pattern for working with large datasets and LLMs:

🚨 The Problem: Large CSV files can contain 100,000+ rows and columns, which would overwhelm context windows and cost thousands of tokens for processing.

āœ… The Solution: Use a large context window model (OpenAI GPT-4.1 Mini) to generate focused summaries, then use those summaries for downstream processing.

Workflow

  1. Input: CSV URL
  2. Download & Summarize: Fetch CSV, parse data, and generate AI summary using OpenAI GPT-4.1 Mini
  3. Generate Questions: Create focused questions from the summary (not the raw data)

Key Benefits

  • šŸ“‰ Token Reduction: 80-95% reduction in token usage
  • šŸŽÆ Better Quality: More focused questions from key data insights
  • šŸ’° Cost Savings: Dramatically reduced processing costs
  • ⚔ Faster Processing: Summaries are much faster to process than raw CSV data

Prerequisites

  • Node.js 20.9.0 or higher
  • OpenAI API key (for both summarization and question generation)

Setup

  1. Clone and install dependencies:

     1git clone <repository-url>
     2cd template-csv-to-questions
     3pnpm install
  2. Set up environment variables:

     1cp .env.example .env
     2# Edit .env and add your API keys
    OPENAI_API_KEY="your-openai-api-key-here"
  3. Run the example:

    npx tsx example.ts

šŸ—ļø Architectural Pattern: Token Limit Protection

This template demonstrates a crucial pattern for working with large datasets in LLM applications:

The Challenge

When processing large CSV files (sales data, logs, surveys), you often encounter:

  • Token limits: Datasets can exceed context windows
  • High costs: Processing 100,000+ rows repeatedly is expensive
  • Poor quality: LLMs perform worse on extremely long inputs
  • Slow processing: Large datasets take longer to process

The Solution: Summarization Layer

Instead of passing raw CSV data through your pipeline:

  1. Use a large context window model (OpenAI GPT-4.1 Mini) to digest the full dataset
  2. Generate focused summaries that capture key insights and patterns
  3. Pass summaries to downstream processing instead of raw data

Implementation Details

 1// āŒ BAD: Pass full CSV through pipeline
 2const questions = await generateQuestions(fullCSVData); // 100,000+ tokens!
 3
 4// āœ… GOOD: Summarize first, then process
 5const summary = await summarizeWithGPT41Mini(fullCSVData); // 500-1000 tokens
 6const questions = await generateQuestions(summary); // Much better!

When to Use This Pattern

  • Large datasets: CSV files with many rows/columns
  • Batch processing: Multiple CSV files
  • Cost optimization: Reduce token usage
  • Quality improvement: More focused processing
  • Chain operations: Multiple LLM calls on same data

Usage

Using the Workflow

 1import { mastra } from './src/mastra/index';
 2
 3const run = await mastra.getWorkflow('csvToQuestionsWorkflow').createRunAsync();
 4
 5// Using a CSV URL
 6const result = await run.start({
 7  inputData: {
 8    csvUrl: 'https://example.com/dataset.csv',
 9  },
10});
11
12console.log(result.result.questions);

Using the CSV Questions Agent

 1import { mastra } from './src/mastra/index';
 2
 3const agent = mastra.getAgent('csvQuestionAgent');
 4
 5// The agent can handle the full process with natural language
 6const response = await agent.stream([
 7  {
 8    role: 'user',
 9    content: 'Please download this CSV and generate questions from it: https://example.com/dataset.csv',
10  },
11]);
12
13for await (const chunk of response.textStream) {
14  console.log(chunk);
15}

Using Individual Tools

 1import { mastra } from './src/mastra/index';
 2import { csvFetcherTool } from './src/mastra/tools/download-csv-tool';
 3import { generateQuestionsFromTextTool } from './src/mastra/tools/generate-questions-from-text-tool';
 4
 5// Step 1: Download CSV and generate summary
 6const csvResult = await csvFetcherTool.execute({
 7  context: { csvUrl: 'https://example.com/dataset.csv' },
 8  mastra,
 9  runtimeContext: new RuntimeContext(),
10});
11
12console.log(`Downloaded ${csvResult.fileSize} bytes from ${csvResult.rowCount} rows`);
13console.log(`Generated ${csvResult.summary.length} character summary`);
14
15// Step 2: Generate questions from summary
16const questionsResult = await generateQuestionsFromTextTool.execute({
17  context: {
18    extractedText: csvResult.summary,
19    maxQuestions: 10,
20  },
21  mastra,
22  runtimeContext: new RuntimeContext(),
23});
24
25console.log(questionsResult.questions);

Expected Output

 1{
 2  status: 'success',
 3  result: {
 4    questions: [
 5      "What are the main columns in this CSV dataset?",
 6      "How many total entries are included in the data?",
 7      "Which category shows the highest values?",
 8      "What patterns can you identify in the data?",
 9      "What insights can be drawn from this dataset for business decisions?",
10      // ... more questions
11    ],
12    success: true
13  }
14}

Architecture

Components

  • csvToQuestionsWorkflow: Main workflow orchestrating the process
  • textQuestionAgent: Mastra agent specialized in generating educational questions from text
  • csvQuestionAgent: Complete agent that can handle the full CSV to questions pipeline
  • csvSummarizationAgent: Agent specialized in creating focused summaries from CSV data

Tools

  • csvFetcherTool: Downloads CSV files from URLs, parses data, and generates AI summaries
  • generateQuestionsFromTextTool: Generates comprehensive questions from summarized content

Workflow Steps

  1. download-and-summarize-csv: Downloads CSV from provided URL and generates AI summary
  2. generate-questions-from-summary: Creates comprehensive questions from the AI summary

Features

  • āœ… Token Limit Protection: Demonstrates how to handle large datasets without hitting context limits
  • āœ… 80-95% Token Reduction: AI summarization drastically reduces processing costs
  • āœ… Large Context Window: Uses OpenAI GPT-4.1 Mini to handle large datasets efficiently
  • āœ… Zero System Dependencies: Pure JavaScript solution
  • āœ… Single API Setup: OpenAI for both summarization and question generation
  • āœ… Fast Data Processing: Direct CSV parsing with intelligent sampling
  • āœ… Data Analysis Focus: Generates questions focused on patterns, insights, and practical applications
  • āœ… Multiple Interfaces: Workflow, Agent, and individual tools available

How It Works

Data Processing Strategy

This template uses a pure JavaScript approach that works for most CSV files:

  1. CSV Parsing: Direct parsing using custom CSV parser

    • ⚔ Fast and reliable
    • šŸ”§ Handles quoted fields and various delimiters
    • āœ… Works out of the box
  2. Data Analysis: Automatic data type detection and structure analysis

    • šŸ“Š Row/column counting
    • šŸ” Data type inference
    • šŸ“ˆ Sample data extraction
  3. AI Summarization: Intelligent compression of large datasets

    • 🧠 Pattern recognition
    • šŸ“ Key insights extraction
    • šŸ’” Actionable intelligence

Why This Approach?

  • Scalability: Handles large datasets without token limits
  • Cost Efficiency: Dramatically reduces processing costs
  • Quality: More focused questions from key insights
  • Speed: Summaries process much faster than raw data
  • Flexibility: Works with various CSV formats and structures

Configuration

Environment Variables

OPENAI_API_KEY=your_openai_api_key_here

Customization

You can customize the question generation by modifying the agents:

 1export const textQuestionAgent = new Agent({
 2  name: 'Generate questions from text agent',
 3  instructions: `
 4    // Customize instructions here for different question types
 5    // Focus on specific aspects like statistical analysis, patterns, etc.
 6  `,
 7  model: openai('gpt-4o'),
 8});

Development

Project Structure

 1src/mastra/
 2ā”œā”€ā”€ agents/
 3│   ā”œā”€ā”€ csv-question-agent.ts       # CSV processing and question generation agent
 4│   ā”œā”€ā”€ csv-summarization-agent.ts  # CSV data summarization agent
 5│   └── text-question-agent.ts      # Text to questions generation agent
 6ā”œā”€ā”€ tools/
 7│   ā”œā”€ā”€ download-csv-tool.ts         # CSV download and summarization tool
 8│   └── generate-questions-from-text-tool.ts # Question generation tool
 9ā”œā”€ā”€ workflows/
10│   └── csv-to-questions-workflow.ts # Main workflow
11└── index.ts                         # Mastra configuration

Testing

 1# Run with a test CSV
 2export OPENAI_API_KEY="your-api-key"
 3npx tsx example.ts

Common Issues

"OPENAI_API_KEY is not set"

  • Make sure you've set the environment variable
  • Check that your API key is valid and has sufficient credits

"Failed to download CSV"

  • Verify the CSV URL is accessible and publicly available
  • Check network connectivity
  • Ensure the URL points to a valid CSV file
  • Some servers may require authentication or have restrictions

"No data could be parsed"

  • The CSV might be malformed or use unusual delimiters
  • Very large CSV files might take longer to process
  • Check that the file actually contains CSV data

"Context length exceeded" or Token Limit Errors

  • This shouldn't happen with the new architecture!
  • The AI summarization should prevent token limits
  • If it occurs, try using a smaller CSV file for testing

Example CSV URLs

For testing, you can use these public CSV files:

  • World GDP Data: https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv
  • Cities Data: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
  • Sample Dataset: https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv

What Makes This Template Special

šŸŽÆ Token Limit Protection

  • Demonstrates the summarization pattern for large datasets
  • Shows how to compress data while preserving key insights
  • Prevents token limit errors that plague other approaches

⚔ Performance & Cost Optimization

  • 80-95% reduction in token usage
  • Much faster processing than raw data approaches
  • Dramatically lower API costs

šŸ”§ Developer-Friendly Architecture

  • Clean separation of concerns
  • Multiple usage patterns (workflow, agent, tools)
  • Easy to understand and modify
  • Comprehensive error handling

šŸ“š Educational Value

  • Generates questions focused on data analysis and insights
  • Covers different comprehension levels
  • Perfect for creating learning materials from datasets

šŸš€ Broader Applications

This token limit protection pattern can be applied to many other scenarios:

Data Processing

  • Log analysis: Summarize large log files before pattern analysis
  • Survey data: Compress responses before sentiment analysis
  • Financial data: Extract key metrics before trend analysis

Content Analysis

  • Social media: Summarize large datasets before insight extraction
  • Customer feedback: Compress reviews before theme identification
  • Research data: Extract key findings before comparison

Business Intelligence

  • Sales data: Summarize transactions before performance analysis
  • User behavior: Compress activity logs before pattern detection
  • Market research: Extract insights before strategic planning

Implementation Tips

  • Use OpenAI GPT-4.1 Mini for initial summarization (large context window)
  • Pass summaries to downstream tools, not raw data
  • Chain summaries for multi-step processing
  • Preserve metadata (row count, column info) for context

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request