5 min readUpdated Mar 2, 2026

PdfExtractNodeEditor Documentation

Purpose

The PdfExtractNodeEditor is a component of the Vantage analytics platform designed for extracting text and data from PDF files within a data processing workflow. It allows users to define how PDF content is interpreted, aggregate key information using AI, and specify parameters governing the extraction process. This component is crucial for workflows where data needs to be derived from PDF documents, such as invoices, reports, or forms.

How It Works

The PdfExtractNodeEditor operates by enabling users to configure various settings that dictate how the PDF content is processed:

Users can select an extraction mode, which determines the type of information extracted from the PDF (text only or AI-augmented key extraction).
It automatically detects whether the component is receiving PDF data via a URL or directly from an uploaded file.
Based on the user's configurations, it extracts the data and stores it in specified output columns for further processing in the workflow.

The component dynamically updates its UI based on user input and the upstream data structure, enhancing user experience and ensuring clarity about available actions.

Settings

Detailed Settings Overview

1. Extraction Mode

Name: extractionMode
Input Type: Dropdown
Description: This setting allows users to choose how text is extracted from the PDF. Options include:
- Text Only: Extracts raw text from the PDF with no AI integration.
- AI Summary: Extracts text and produces a summary using AI capabilities.
- AI Key Extraction: Extracts specific elements like names, dates, and terms using AI.
- Custom Prompt: Allows for text extraction followed by processing using a user-defined AI prompt.
Default Value: text_only

2. Custom Prompt

Name: customPrompt
Input Type: Textarea
Description: When the extraction mode is set to 'custom', users can enter a custom prompt here. The placeholder {{text}} can be used within the prompt to represent the extracted text. This enables advanced users to apply tailored AI processing to the extracted content.
Default Value: '' (empty string)

3. Output Column

Name: outputColumn
Input Type: String
Description: This defines the name of the column where the extracted text will be stored. A secondary column indicating the number of pages extracted (e.g., pdf_text_pages) is automatically generated.
Default Value: pdf_text

4. Max Pages

Name: maxPages
Input Type: Numeric
Description: Specifies the maximum number of pages to extract from each PDF. This can help control performance, especially for larger documents.
Default Value: 10
Note: Ranges from 1 to 50. Extracting more pages may slow down processing.

5. PDF URL Column

Name: urlColumn
Input Type: Dropdown (shown only if no file upload is detected)
Description: Identifies the column in the dataset that contains URLs leading to the PDFs to be processed. This facilitates the extraction process without requiring direct file uploads.
Default Value: pdf_url

Data Expectations

The PdfExtractNodeEditor expects the following data inputs:

Upstream Columns: The component retrieves the available columns from upstream nodes, particularly looking for files with base64 content or a column with PDF URLs.
PDF Format: The PDFs should ideally be well-structured since the extraction effectiveness can vary depending on the PDF formatting.
AI Integration: For modes utilizing AI, the platform must be configured appropriately with an AI service to perform necessary computations.

AI Integrations

The component incorporates AI functionalities in several extraction modes:

AI Summary and AI Key Extraction modes leverage advanced AI algorithms to analyze the text extracted from the PDFs.
Users must ensure proper integration and access to AI services governed by their Vantage account settings, which may have billing implications based on the volume of AI processing used.

Billing Impacts

The use of AI-enhanced extraction modes may incur additional costs based on the data processed and specific terms outlined in Vantage's billing policy. Users are advised to review their current subscription plan to understand any extra charges related to AI usage, particularly when dealing with large volumes of documents or frequent AI-based processing requests.

Use Cases & Examples

Use Cases

Invoice Processing: Automating the extraction of invoice details such as vendor names, dates, and amounts from PDFs, enabling automated accounts payable processes.
Contract Analysis: Extracting key contractual terms from legal documents for compliance tracking and analysis by legal teams.
Data Entry Automation: Reducing manual data entry workload in scenarios such as digitizing form submissions by extracting fields from PDF forms.

Example Configuration

Use Case: Invoice Processing

Scenario: A finance department needs to extract invoice information from a series of received PDF invoices for automated tracking.
Configuration:
- Extraction Mode: AI Key Extraction (to capture vendor names, dates, and amounts)
- Output Column: invoice_data
- Max Pages: 50 (to ensure all relevant invoice pages are processed)
- PDF URL Column: invoice_url (the column containing links to the invoice PDFs)

Sample Configuration Data:

json

{
  "extractionMode": "ai_extract",
  "outputColumn": "invoice_data",
  "maxPages": 50,
  "urlColumn": "invoice_url"
}

This configuration targets the automation of invoice processing while leveraging AI to ensure key details are accurately captured.

← PreviousImage Analysis Editor Next →Text Extraction Editor