
CSV to Questions Generator
A Mastra template that demonstrates how to protect against token limits by generating AI summaries from large CSV datasets before passing as output from tool calls.
šÆ Key Learning: This template shows how to use large context window models (OpenAI GPT-4.1 Mini) as a "summarization layer" to compress large CSV datasets into focused summaries, enabling efficient downstream processing without hitting token limits.
Overview
This template showcases a crucial architectural pattern for working with large datasets and LLMs:
šØ The Problem: Large CSV files can contain 100,000+ rows and columns, which would overwhelm context windows and cost thousands of tokens for processing.
ā The Solution: Use a large context window model (OpenAI GPT-4.1 Mini) to generate focused summaries, then use those summaries for downstream processing.
Workflow
- Input: CSV URL
- Download & Summarize: Fetch CSV, parse data, and generate AI summary using OpenAI GPT-4.1 Mini
- Generate Questions: Create focused questions from the summary (not the raw data)
Key Benefits
- š Token Reduction: 80-95% reduction in token usage
- šÆ Better Quality: More focused questions from key data insights
- š° Cost Savings: Dramatically reduced processing costs
- ā” Faster Processing: Summaries are much faster to process than raw CSV data
Prerequisites
- Node.js 20.9.0 or higher
- OpenAI API key (for both summarization and question generation)
Setup
-
Clone and install dependencies:
1git clone <repository-url> 2cd template-csv-to-questions 3pnpm install
-
Set up environment variables:
1cp .env.example .env 2# Edit .env and add your API keys
OPENAI_API_KEY="your-openai-api-key-here"
-
Run the example:
npx tsx example.ts
šļø Architectural Pattern: Token Limit Protection
This template demonstrates a crucial pattern for working with large datasets in LLM applications:
The Challenge
When processing large CSV files (sales data, logs, surveys), you often encounter:
- Token limits: Datasets can exceed context windows
- High costs: Processing 100,000+ rows repeatedly is expensive
- Poor quality: LLMs perform worse on extremely long inputs
- Slow processing: Large datasets take longer to process
The Solution: Summarization Layer
Instead of passing raw CSV data through your pipeline:
- Use a large context window model (OpenAI GPT-4.1 Mini) to digest the full dataset
- Generate focused summaries that capture key insights and patterns
- Pass summaries to downstream processing instead of raw data
Implementation Details
1// ā BAD: Pass full CSV through pipeline
2const questions = await generateQuestions(fullCSVData); // 100,000+ tokens!
3
4// ā
GOOD: Summarize first, then process
5const summary = await summarizeWithGPT41Mini(fullCSVData); // 500-1000 tokens
6const questions = await generateQuestions(summary); // Much better!
When to Use This Pattern
- Large datasets: CSV files with many rows/columns
- Batch processing: Multiple CSV files
- Cost optimization: Reduce token usage
- Quality improvement: More focused processing
- Chain operations: Multiple LLM calls on same data
Usage
Using the Workflow
1import { mastra } from './src/mastra/index';
2
3const run = await mastra.getWorkflow('csvToQuestionsWorkflow').createRunAsync();
4
5// Using a CSV URL
6const result = await run.start({
7 inputData: {
8 csvUrl: 'https://example.com/dataset.csv',
9 },
10});
11
12console.log(result.result.questions);
Using the CSV Questions Agent
1import { mastra } from './src/mastra/index';
2
3const agent = mastra.getAgent('csvQuestionAgent');
4
5// The agent can handle the full process with natural language
6const response = await agent.stream([
7 {
8 role: 'user',
9 content: 'Please download this CSV and generate questions from it: https://example.com/dataset.csv',
10 },
11]);
12
13for await (const chunk of response.textStream) {
14 console.log(chunk);
15}
Using Individual Tools
1import { mastra } from './src/mastra/index';
2import { csvFetcherTool } from './src/mastra/tools/download-csv-tool';
3import { generateQuestionsFromTextTool } from './src/mastra/tools/generate-questions-from-text-tool';
4
5// Step 1: Download CSV and generate summary
6const csvResult = await csvFetcherTool.execute({
7 context: { csvUrl: 'https://example.com/dataset.csv' },
8 mastra,
9 runtimeContext: new RuntimeContext(),
10});
11
12console.log(`Downloaded ${csvResult.fileSize} bytes from ${csvResult.rowCount} rows`);
13console.log(`Generated ${csvResult.summary.length} character summary`);
14
15// Step 2: Generate questions from summary
16const questionsResult = await generateQuestionsFromTextTool.execute({
17 context: {
18 extractedText: csvResult.summary,
19 maxQuestions: 10,
20 },
21 mastra,
22 runtimeContext: new RuntimeContext(),
23});
24
25console.log(questionsResult.questions);
Expected Output
1{
2 status: 'success',
3 result: {
4 questions: [
5 "What are the main columns in this CSV dataset?",
6 "How many total entries are included in the data?",
7 "Which category shows the highest values?",
8 "What patterns can you identify in the data?",
9 "What insights can be drawn from this dataset for business decisions?",
10 // ... more questions
11 ],
12 success: true
13 }
14}
Architecture
Components
csvToQuestionsWorkflow
: Main workflow orchestrating the processtextQuestionAgent
: Mastra agent specialized in generating educational questions from textcsvQuestionAgent
: Complete agent that can handle the full CSV to questions pipelinecsvSummarizationAgent
: Agent specialized in creating focused summaries from CSV data
Tools
csvFetcherTool
: Downloads CSV files from URLs, parses data, and generates AI summariesgenerateQuestionsFromTextTool
: Generates comprehensive questions from summarized content
Workflow Steps
download-and-summarize-csv
: Downloads CSV from provided URL and generates AI summarygenerate-questions-from-summary
: Creates comprehensive questions from the AI summary
Features
- ā Token Limit Protection: Demonstrates how to handle large datasets without hitting context limits
- ā 80-95% Token Reduction: AI summarization drastically reduces processing costs
- ā Large Context Window: Uses OpenAI GPT-4.1 Mini to handle large datasets efficiently
- ā Zero System Dependencies: Pure JavaScript solution
- ā Single API Setup: OpenAI for both summarization and question generation
- ā Fast Data Processing: Direct CSV parsing with intelligent sampling
- ā Data Analysis Focus: Generates questions focused on patterns, insights, and practical applications
- ā Multiple Interfaces: Workflow, Agent, and individual tools available
How It Works
Data Processing Strategy
This template uses a pure JavaScript approach that works for most CSV files:
-
CSV Parsing: Direct parsing using custom CSV parser
- ā” Fast and reliable
- š§ Handles quoted fields and various delimiters
- ā Works out of the box
-
Data Analysis: Automatic data type detection and structure analysis
- š Row/column counting
- š Data type inference
- š Sample data extraction
-
AI Summarization: Intelligent compression of large datasets
- š§ Pattern recognition
- š Key insights extraction
- š” Actionable intelligence
Why This Approach?
- Scalability: Handles large datasets without token limits
- Cost Efficiency: Dramatically reduces processing costs
- Quality: More focused questions from key insights
- Speed: Summaries process much faster than raw data
- Flexibility: Works with various CSV formats and structures
Configuration
Environment Variables
OPENAI_API_KEY=your_openai_api_key_here
Customization
You can customize the question generation by modifying the agents:
1export const textQuestionAgent = new Agent({
2 name: 'Generate questions from text agent',
3 instructions: `
4 // Customize instructions here for different question types
5 // Focus on specific aspects like statistical analysis, patterns, etc.
6 `,
7 model: openai('gpt-4o'),
8});
Development
Project Structure
1src/mastra/
2āāā agents/
3ā āāā csv-question-agent.ts # CSV processing and question generation agent
4ā āāā csv-summarization-agent.ts # CSV data summarization agent
5ā āāā text-question-agent.ts # Text to questions generation agent
6āāā tools/
7ā āāā download-csv-tool.ts # CSV download and summarization tool
8ā āāā generate-questions-from-text-tool.ts # Question generation tool
9āāā workflows/
10ā āāā csv-to-questions-workflow.ts # Main workflow
11āāā index.ts # Mastra configuration
Testing
1# Run with a test CSV
2export OPENAI_API_KEY="your-api-key"
3npx tsx example.ts
Common Issues
"OPENAI_API_KEY is not set"
- Make sure you've set the environment variable
- Check that your API key is valid and has sufficient credits
"Failed to download CSV"
- Verify the CSV URL is accessible and publicly available
- Check network connectivity
- Ensure the URL points to a valid CSV file
- Some servers may require authentication or have restrictions
"No data could be parsed"
- The CSV might be malformed or use unusual delimiters
- Very large CSV files might take longer to process
- Check that the file actually contains CSV data
"Context length exceeded" or Token Limit Errors
- This shouldn't happen with the new architecture!
- The AI summarization should prevent token limits
- If it occurs, try using a smaller CSV file for testing
Example CSV URLs
For testing, you can use these public CSV files:
- World GDP Data:
https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv
- Cities Data:
https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
- Sample Dataset:
https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv
What Makes This Template Special
šÆ Token Limit Protection
- Demonstrates the summarization pattern for large datasets
- Shows how to compress data while preserving key insights
- Prevents token limit errors that plague other approaches
ā” Performance & Cost Optimization
- 80-95% reduction in token usage
- Much faster processing than raw data approaches
- Dramatically lower API costs
š§ Developer-Friendly Architecture
- Clean separation of concerns
- Multiple usage patterns (workflow, agent, tools)
- Easy to understand and modify
- Comprehensive error handling
š Educational Value
- Generates questions focused on data analysis and insights
- Covers different comprehension levels
- Perfect for creating learning materials from datasets
š Broader Applications
This token limit protection pattern can be applied to many other scenarios:
Data Processing
- Log analysis: Summarize large log files before pattern analysis
- Survey data: Compress responses before sentiment analysis
- Financial data: Extract key metrics before trend analysis
Content Analysis
- Social media: Summarize large datasets before insight extraction
- Customer feedback: Compress reviews before theme identification
- Research data: Extract key findings before comparison
Business Intelligence
- Sales data: Summarize transactions before performance analysis
- User behavior: Compress activity logs before pattern detection
- Market research: Extract insights before strategic planning
Implementation Tips
- Use OpenAI GPT-4.1 Mini for initial summarization (large context window)
- Pass summaries to downstream tools, not raw data
- Chain summaries for multi-step processing
- Preserve metadata (row count, column info) for context
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request