PDF to Questions Generator

A Mastra template that demonstrates how to protect against token limits by generating AI summaries from large datasets before passing as output from tool calls.

šŸŽÆ Key Learning: This template shows how to use large context window models (OpenAI GPT-4.1 Mini) as a "summarization layer" to compress large documents into focused summaries, enabling efficient downstream processing without hitting token limits.

Overview

This template showcases a crucial architectural pattern for working with large documents and LLMs:

🚨 The Problem: Large PDFs can contain 50,000+ tokens, which would overwhelm context windows and cost thousands of tokens for processing.

āœ… The Solution: Use a large context window model (OpenAI GPT-4.1 Mini) to generate focused summaries, then use those summaries for downstream processing.

Workflow

  1. Input: PDF URL
  2. Download & Summarize: Fetch PDF, extract text, and generate AI summary using OpenAI GPT-4.1 Mini
  3. Generate Questions: Create focused questions from the summary (not the full text)

Key Benefits

  • šŸ“‰ Token Reduction: 80-95% reduction in token usage
  • šŸŽÆ Better Quality: More focused questions from key insights
  • šŸ’° Cost Savings: Dramatically reduced processing costs
  • ⚔ Faster Processing: Summaries are much faster to process than full text

Prerequisites

  • Node.js 20.9.0 or higher
  • OpenAI API key (for both summarization and question generation)

Setup

  1. Clone and install dependencies:

     1git clone <repository-url>
     2cd template-pdf-questions
     3pnpm install
  2. Set up environment variables:

     1cp .env.example .env
     2# Edit .env and add your API keys
    OPENAI_API_KEY="your-openai-api-key-here"
  3. Run the example:

    npx tsx example.ts

šŸ—ļø Architectural Pattern: Token Limit Protection

This template demonstrates a crucial pattern for working with large datasets in LLM applications:

The Challenge

When processing large documents (PDFs, reports, transcripts), you often encounter:

  • Token limits: Documents can exceed context windows
  • High costs: Processing 50,000+ tokens repeatedly is expensive
  • Poor quality: LLMs perform worse on extremely long inputs
  • Slow processing: Large inputs take longer to process

The Solution: Summarization Layer

Instead of passing raw data through your pipeline:

  1. Use a large context window model (OpenAI GPT-4.1 Mini) to digest the full content
  2. Generate focused summaries that capture key information
  3. Pass summaries to downstream processing instead of raw data

Implementation Details

 1// āŒ BAD: Pass full text through pipeline
 2const questions = await generateQuestions(fullPdfText); // 50,000 tokens!
 3
 4// āœ… GOOD: Summarize first, then process
 5const summary = await summarizeWithGPT41Mini(fullPdfText); // 2,000 tokens
 6const questions = await generateQuestions(summary); // Much better!

When to Use This Pattern

  • Large documents: PDFs, reports, transcripts
  • Batch processing: Multiple documents
  • Cost optimization: Reduce token usage
  • Quality improvement: More focused processing
  • Chain operations: Multiple LLM calls on same data

Usage

Using the Workflow

 1import { mastra } from './src/mastra/index';
 2
 3const run = await mastra.getWorkflow('pdfToQuestionsWorkflow').createRunAsync();
 4
 5// Using a PDF URL
 6const result = await run.start({
 7  inputData: {
 8    pdfUrl: 'https://example.com/document.pdf',
 9  },
10});
11
12console.log(result.result.questions);

Using the PDF Questions Agent

 1import { mastra } from './src/mastra/index';
 2
 3const agent = mastra.getAgent('pdfQuestionsAgent');
 4
 5// The agent can handle the full process with natural language
 6const response = await agent.stream([
 7  {
 8    role: 'user',
 9    content: 'Please download this PDF and generate questions from it: https://example.com/document.pdf',
10  },
11]);
12
13for await (const chunk of response.textStream) {
14  console.log(chunk);
15}

Using Individual Tools

 1import { mastra } from './src/mastra/index';
 2import { pdfFetcherTool } from './src/mastra/tools/download-pdf-tool';
 3import { generateQuestionsFromTextTool } from './src/mastra/tools/generate-questions-from-text-tool';
 4
 5// Step 1: Download PDF and generate summary
 6const pdfResult = await pdfFetcherTool.execute({
 7  context: { pdfUrl: 'https://example.com/document.pdf' },
 8  mastra,
 9  runtimeContext: new RuntimeContext(),
10});
11
12console.log(`Downloaded ${pdfResult.fileSize} bytes from ${pdfResult.pagesCount} pages`);
13console.log(`Generated ${pdfResult.summary.length} character summary`);
14
15// Step 2: Generate questions from summary
16const questionsResult = await generateQuestionsFromTextTool.execute({
17  context: {
18    extractedText: pdfResult.summary,
19    maxQuestions: 10,
20  },
21  mastra,
22  runtimeContext: new RuntimeContext(),
23});
24
25console.log(questionsResult.questions);

Expected Output

 1{
 2  status: 'success',
 3  result: {
 4    questions: [
 5      "What is the main objective of the research presented in this paper?",
 6      "Which methodology was used to collect the data?",
 7      "What are the key findings of the study?",
 8      // ... more questions
 9    ],
10    success: true
11  }
12}

Architecture

Components

  • pdfToQuestionsWorkflow: Main workflow orchestrating the process
  • textQuestionAgent: Mastra agent specialized in generating educational questions
  • pdfQuestionAgent: Complete agent that can handle the full PDF to questions pipeline

Tools

  • pdfFetcherTool: Downloads PDF files from URLs, extracts text, and generates AI summaries
  • generateQuestionsFromTextTool: Generates comprehensive questions from summarized content

Workflow Steps

  1. download-and-summarize-pdf: Downloads PDF from provided URL and generates AI summary
  2. generate-questions-from-summary: Creates comprehensive questions from the AI summary

Features

  • āœ… Token Limit Protection: Demonstrates how to handle large datasets without hitting context limits
  • āœ… 80-95% Token Reduction: AI summarization drastically reduces processing costs
  • āœ… Large Context Window: Uses OpenAI GPT-4.1 Mini to handle large documents efficiently
  • āœ… Zero System Dependencies: Pure JavaScript solution
  • āœ… Single API Setup: OpenAI for both summarization and question generation
  • āœ… Fast Text Extraction: Direct PDF parsing (no OCR needed for text-based PDFs)
  • āœ… Educational Focus: Generates focused learning questions from key insights
  • āœ… Multiple Interfaces: Workflow, Agent, and individual tools available

How It Works

Text Extraction Strategy

This template uses a pure JavaScript approach that works for most PDFs:

  1. Text-based PDFs (90% of cases): Direct text extraction using pdf2json

    • ⚔ Fast and reliable
    • šŸ”§ No system dependencies
    • āœ… Works out of the box
  2. Scanned PDFs: Would require OCR, but most PDFs today contain embedded text

Why This Approach?

  • Simplicity: No GraphicsMagick, ImageMagick, or other system tools needed
  • Speed: Direct text extraction is much faster than OCR
  • Reliability: Works consistently across different environments
  • Educational: Easy for developers to understand and modify
  • Single Path: One clear workflow with no complex branching

Configuration

Environment Variables

OPENAI_API_KEY=your_openai_api_key_here

Customization

You can customize the question generation by modifying the textQuestionAgent:

 1export const textQuestionAgent = new Agent({
 2  name: 'Generate questions from text agent',
 3  instructions: `
 4    You are an expert educational content creator...
 5    // Customize instructions here
 6  `,
 7  model: openai('gpt-4o'),
 8});

Development

Project Structure

 1src/mastra/
 2ā”œā”€ā”€ agents/
 3│   ā”œā”€ā”€ pdf-question-agent.ts       # PDF processing and question generation agent
 4│   └── text-question-agent.ts      # Text to questions generation agent
 5ā”œā”€ā”€ tools/
 6│   ā”œā”€ā”€ download-pdf-tool.ts         # PDF download tool
 7│   ā”œā”€ā”€ extract-text-from-pdf-tool.ts # PDF text extraction tool
 8│   └── generate-questions-from-text-tool.ts # Question generation tool
 9ā”œā”€ā”€ workflows/
10│   └── generate-questions-from-pdf-workflow.ts # Main workflow
11ā”œā”€ā”€ lib/
12│   └── util.ts                      # Utility functions including PDF text extraction
13└── index.ts                         # Mastra configuration

Testing

 1# Run with a test PDF
 2export OPENAI_API_KEY="your-api-key"
 3npx tsx example.ts

Common Issues

"OPENAI_API_KEY is not set"

  • Make sure you've set the environment variable
  • Check that your API key is valid and has sufficient credits

"Failed to download PDF"

  • Verify the PDF URL is accessible and publicly available
  • Check network connectivity
  • Ensure the URL points to a valid PDF file
  • Some servers may require authentication or have restrictions

"No text could be extracted"

  • The PDF might be password-protected
  • Very large PDFs might take longer to process
  • Scanned PDFs without embedded text won't work (rare with modern PDFs)

"Context length exceeded" or Token Limit Errors

  • Solution: Use a smaller PDF file (under ~5-10 pages)
  • Automatic Truncation: The tool automatically uses only the first 4000 characters for very large documents
  • Helpful Errors: Clear messages guide you to use smaller PDFs when needed

What Makes This Template Special

šŸŽÆ True Simplicity

  • Single dependency for PDF processing (pdf2json)
  • No system tools or complex setup required
  • Works immediately after pnpm install
  • Multiple usage patterns (workflow, agent, tools)

⚔ Performance

  • Direct text extraction (no image conversion)
  • Much faster than OCR-based approaches
  • Handles reasonably-sized documents efficiently

šŸ”§ Developer-Friendly

  • Pure JavaScript/TypeScript
  • Easy to understand and modify
  • Clear separation of concerns
  • Simple error handling with helpful messages

šŸ“š Educational Value

  • Generates multiple question types
  • Covers different comprehension levels
  • Perfect for creating study materials

šŸš€ Broader Applications

This token limit protection pattern can be applied to many other scenarios:

Document Processing

  • Legal documents: Summarize contracts before analysis
  • Research papers: Extract key findings before comparison
  • Technical manuals: Create focused summaries for specific topics

Content Analysis

  • Social media: Summarize large thread conversations
  • Customer feedback: Compress reviews before sentiment analysis
  • Meeting transcripts: Extract action items and decisions

Data Processing

  • Log analysis: Summarize error patterns before classification
  • Survey responses: Compress feedback before theme extraction
  • Code reviews: Summarize changes before generating reports

Implementation Tips

  • Use OpenAI GPT-4.1 Mini for initial summarization (large context window)
  • Pass summaries to downstream tools, not raw data
  • Chain summaries for multi-step processing
  • Preserve metadata (file size, page count) for context

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request