
PDF to Questions Generator
A Mastra template that demonstrates how to protect against token limits by generating AI summaries from large datasets before passing as output from tool calls.
šÆ Key Learning: This template shows how to use large context window models (OpenAI GPT-4.1 Mini) as a "summarization layer" to compress large documents into focused summaries, enabling efficient downstream processing without hitting token limits.
Overview
This template showcases a crucial architectural pattern for working with large documents and LLMs:
šØ The Problem: Large PDFs can contain 50,000+ tokens, which would overwhelm context windows and cost thousands of tokens for processing.
ā The Solution: Use a large context window model (OpenAI GPT-4.1 Mini) to generate focused summaries, then use those summaries for downstream processing.
Workflow
- Input: PDF URL
- Download & Summarize: Fetch PDF, extract text, and generate AI summary using OpenAI GPT-4.1 Mini
- Generate Questions: Create focused questions from the summary (not the full text)
Key Benefits
- š Token Reduction: 80-95% reduction in token usage
- šÆ Better Quality: More focused questions from key insights
- š° Cost Savings: Dramatically reduced processing costs
- ā” Faster Processing: Summaries are much faster to process than full text
Prerequisites
- Node.js 20.9.0 or higher
- OpenAI API key (for both summarization and question generation)
Setup
-
Clone and install dependencies:
1git clone <repository-url> 2cd template-pdf-questions 3pnpm install
-
Set up environment variables:
1cp .env.example .env 2# Edit .env and add your API keys
OPENAI_API_KEY="your-openai-api-key-here"
-
Run the example:
npx tsx example.ts
šļø Architectural Pattern: Token Limit Protection
This template demonstrates a crucial pattern for working with large datasets in LLM applications:
The Challenge
When processing large documents (PDFs, reports, transcripts), you often encounter:
- Token limits: Documents can exceed context windows
- High costs: Processing 50,000+ tokens repeatedly is expensive
- Poor quality: LLMs perform worse on extremely long inputs
- Slow processing: Large inputs take longer to process
The Solution: Summarization Layer
Instead of passing raw data through your pipeline:
- Use a large context window model (OpenAI GPT-4.1 Mini) to digest the full content
- Generate focused summaries that capture key information
- Pass summaries to downstream processing instead of raw data
Implementation Details
1// ā BAD: Pass full text through pipeline
2const questions = await generateQuestions(fullPdfText); // 50,000 tokens!
3
4// ā
GOOD: Summarize first, then process
5const summary = await summarizeWithGPT41Mini(fullPdfText); // 2,000 tokens
6const questions = await generateQuestions(summary); // Much better!
When to Use This Pattern
- Large documents: PDFs, reports, transcripts
- Batch processing: Multiple documents
- Cost optimization: Reduce token usage
- Quality improvement: More focused processing
- Chain operations: Multiple LLM calls on same data
Usage
Using the Workflow
1import { mastra } from './src/mastra/index';
2
3const run = await mastra.getWorkflow('pdfToQuestionsWorkflow').createRunAsync();
4
5// Using a PDF URL
6const result = await run.start({
7 inputData: {
8 pdfUrl: 'https://example.com/document.pdf',
9 },
10});
11
12console.log(result.result.questions);
Using the PDF Questions Agent
1import { mastra } from './src/mastra/index';
2
3const agent = mastra.getAgent('pdfQuestionsAgent');
4
5// The agent can handle the full process with natural language
6const response = await agent.stream([
7 {
8 role: 'user',
9 content: 'Please download this PDF and generate questions from it: https://example.com/document.pdf',
10 },
11]);
12
13for await (const chunk of response.textStream) {
14 console.log(chunk);
15}
Using Individual Tools
1import { mastra } from './src/mastra/index';
2import { pdfFetcherTool } from './src/mastra/tools/download-pdf-tool';
3import { generateQuestionsFromTextTool } from './src/mastra/tools/generate-questions-from-text-tool';
4
5// Step 1: Download PDF and generate summary
6const pdfResult = await pdfFetcherTool.execute({
7 context: { pdfUrl: 'https://example.com/document.pdf' },
8 mastra,
9 runtimeContext: new RuntimeContext(),
10});
11
12console.log(`Downloaded ${pdfResult.fileSize} bytes from ${pdfResult.pagesCount} pages`);
13console.log(`Generated ${pdfResult.summary.length} character summary`);
14
15// Step 2: Generate questions from summary
16const questionsResult = await generateQuestionsFromTextTool.execute({
17 context: {
18 extractedText: pdfResult.summary,
19 maxQuestions: 10,
20 },
21 mastra,
22 runtimeContext: new RuntimeContext(),
23});
24
25console.log(questionsResult.questions);
Expected Output
1{
2 status: 'success',
3 result: {
4 questions: [
5 "What is the main objective of the research presented in this paper?",
6 "Which methodology was used to collect the data?",
7 "What are the key findings of the study?",
8 // ... more questions
9 ],
10 success: true
11 }
12}
Architecture
Components
pdfToQuestionsWorkflow
: Main workflow orchestrating the processtextQuestionAgent
: Mastra agent specialized in generating educational questionspdfQuestionAgent
: Complete agent that can handle the full PDF to questions pipeline
Tools
pdfFetcherTool
: Downloads PDF files from URLs, extracts text, and generates AI summariesgenerateQuestionsFromTextTool
: Generates comprehensive questions from summarized content
Workflow Steps
download-and-summarize-pdf
: Downloads PDF from provided URL and generates AI summarygenerate-questions-from-summary
: Creates comprehensive questions from the AI summary
Features
- ā Token Limit Protection: Demonstrates how to handle large datasets without hitting context limits
- ā 80-95% Token Reduction: AI summarization drastically reduces processing costs
- ā Large Context Window: Uses OpenAI GPT-4.1 Mini to handle large documents efficiently
- ā Zero System Dependencies: Pure JavaScript solution
- ā Single API Setup: OpenAI for both summarization and question generation
- ā Fast Text Extraction: Direct PDF parsing (no OCR needed for text-based PDFs)
- ā Educational Focus: Generates focused learning questions from key insights
- ā Multiple Interfaces: Workflow, Agent, and individual tools available
How It Works
Text Extraction Strategy
This template uses a pure JavaScript approach that works for most PDFs:
-
Text-based PDFs (90% of cases): Direct text extraction using
pdf2json
- ā” Fast and reliable
- š§ No system dependencies
- ā Works out of the box
-
Scanned PDFs: Would require OCR, but most PDFs today contain embedded text
Why This Approach?
- Simplicity: No GraphicsMagick, ImageMagick, or other system tools needed
- Speed: Direct text extraction is much faster than OCR
- Reliability: Works consistently across different environments
- Educational: Easy for developers to understand and modify
- Single Path: One clear workflow with no complex branching
Configuration
Environment Variables
OPENAI_API_KEY=your_openai_api_key_here
Customization
You can customize the question generation by modifying the textQuestionAgent
:
1export const textQuestionAgent = new Agent({
2 name: 'Generate questions from text agent',
3 instructions: `
4 You are an expert educational content creator...
5 // Customize instructions here
6 `,
7 model: openai('gpt-4o'),
8});
Development
Project Structure
1src/mastra/
2āāā agents/
3ā āāā pdf-question-agent.ts # PDF processing and question generation agent
4ā āāā text-question-agent.ts # Text to questions generation agent
5āāā tools/
6ā āāā download-pdf-tool.ts # PDF download tool
7ā āāā extract-text-from-pdf-tool.ts # PDF text extraction tool
8ā āāā generate-questions-from-text-tool.ts # Question generation tool
9āāā workflows/
10ā āāā generate-questions-from-pdf-workflow.ts # Main workflow
11āāā lib/
12ā āāā util.ts # Utility functions including PDF text extraction
13āāā index.ts # Mastra configuration
Testing
1# Run with a test PDF
2export OPENAI_API_KEY="your-api-key"
3npx tsx example.ts
Common Issues
"OPENAI_API_KEY is not set"
- Make sure you've set the environment variable
- Check that your API key is valid and has sufficient credits
"Failed to download PDF"
- Verify the PDF URL is accessible and publicly available
- Check network connectivity
- Ensure the URL points to a valid PDF file
- Some servers may require authentication or have restrictions
"No text could be extracted"
- The PDF might be password-protected
- Very large PDFs might take longer to process
- Scanned PDFs without embedded text won't work (rare with modern PDFs)
"Context length exceeded" or Token Limit Errors
- Solution: Use a smaller PDF file (under ~5-10 pages)
- Automatic Truncation: The tool automatically uses only the first 4000 characters for very large documents
- Helpful Errors: Clear messages guide you to use smaller PDFs when needed
What Makes This Template Special
šÆ True Simplicity
- Single dependency for PDF processing (
pdf2json
) - No system tools or complex setup required
- Works immediately after
pnpm install
- Multiple usage patterns (workflow, agent, tools)
ā” Performance
- Direct text extraction (no image conversion)
- Much faster than OCR-based approaches
- Handles reasonably-sized documents efficiently
š§ Developer-Friendly
- Pure JavaScript/TypeScript
- Easy to understand and modify
- Clear separation of concerns
- Simple error handling with helpful messages
š Educational Value
- Generates multiple question types
- Covers different comprehension levels
- Perfect for creating study materials
š Broader Applications
This token limit protection pattern can be applied to many other scenarios:
Document Processing
- Legal documents: Summarize contracts before analysis
- Research papers: Extract key findings before comparison
- Technical manuals: Create focused summaries for specific topics
Content Analysis
- Social media: Summarize large thread conversations
- Customer feedback: Compress reviews before sentiment analysis
- Meeting transcripts: Extract action items and decisions
Data Processing
- Log analysis: Summarize error patterns before classification
- Survey responses: Compress feedback before theme extraction
- Code reviews: Summarize changes before generating reports
Implementation Tips
- Use OpenAI GPT-4.1 Mini for initial summarization (large context window)
- Pass summaries to downstream tools, not raw data
- Chain summaries for multi-step processing
- Preserve metadata (file size, page count) for context
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request