5 min readUpdated Mar 2, 2026

pdfExtract

Overview

The pdfExtract logic is designed to fetch PDF documents from specified URLs, extract text content from those PDFs, and optionally run AI-based analyses on the extracted content. This functionality is useful in scenarios where businesses need to process large quantities of PDF documents for information retrieval, summarization, or structured data extraction.

Purpose

The main purpose of pdfExtract is to bridge the gap between unstructured PDF content and actionable insights by leveraging AI capabilities, enabling users to extract valuable information from their documents efficiently.

Settings

1. urlColumn

2. extractionMode

3. outputColumn

4. customPrompt

5. maxPages

How It Works

The pdfExtract node operates by processing input data consisting of PDF URLs and, if available, base64 encoded PDF content. The processing flow is as follows:

  1. Input Validation: The node checks for valid input data and extracts relevant columns based on urlColumn.
  2. PDF Fetching and Parsing: It either fetches the PDF from the given URL or decodes a base64 PDF present in the input, up to the specified maxPages.
  3. Text Extraction: It uses the pdfjs-dist library to extract text content from the PDF pages.
  4. AI Processing (if applicable): Depending on the extractionMode, it may send the extracted text to an AI model for further analysis. This requires fetching an AI integration if necessary.
  5. Result Compilation: The extracted information is collated into the specified outputColumn, and any errors during processing are noted.

Expected Data

The data expected by the pdfExtract logic consists of input objects that must include fields specified by the urlColumn. The logic is designed to handle both URLs and base64 encoded PDF content. Each row of input data must follow this structure:

json
{
  "pdf_url": "http://example.com/document.pdf", // or content_base64: "base64_encoded_pdf_data"
  "other_field": "value"
}

Use Cases & Examples

Use Case 1: Document Summarization

A law firm needs to summarize multiple legal agreements stored as PDFs for quick client updates. By using the ai_summary extraction mode, they can automatically generate concise summaries of each document.

Use Case 2: Data Extraction for Claims Processing

An insurance company wants to extract key data such as claim amounts and dates from uploaded policy documents. Using the ai_extract mode, they can systematically pull necessary information to streamline their processing workflow.

Use Case 3: E-commerce Product Descriptions

An e-commerce company needs to convert several product catalogs from PDF to structured data. Using text_only, they can extract all text descriptions for later use in their product database.

Detailed Example

Scenario: Auto-summarization of multiple legal documents.

Configuration:

Sample Input Data:

json
[
  {
    "pdf_url": "http://example.com/legal_agreement_1.pdf"
  },
  {
    "pdf_url": "http://example.com/legal_agreement_2.pdf"
  }
]

Execution: The pdfExtract node will fetch the PDFs, extract text from the first 5 pages of each document, and then use AI to summarize this text, storing the results in the pdf_summary field of the output.

By utilizing the pdfExtract node correctly, businesses can streamline their document workflows, improve efficiency, and leverage AI for enhanced data insights.