PdfExtractNodeEditor Documentation
Purpose
The PdfExtractNodeEditor is a component of the Vantage analytics platform designed for extracting text and data from PDF files within a data processing workflow. It allows users to define how PDF content is interpreted, aggregate key information using AI, and specify parameters governing the extraction process. This component is crucial for workflows where data needs to be derived from PDF documents, such as invoices, reports, or forms.
How It Works
The PdfExtractNodeEditor operates by enabling users to configure various settings that dictate how the PDF content is processed:
- Users can select an extraction mode, which determines the type of information extracted from the PDF (text only or AI-augmented key extraction).
- It automatically detects whether the component is receiving PDF data via a URL or directly from an uploaded file.
- Based on the user's configurations, it extracts the data and stores it in specified output columns for further processing in the workflow.
The component dynamically updates its UI based on user input and the upstream data structure, enhancing user experience and ensuring clarity about available actions.
Settings
Detailed Settings Overview
1. Extraction Mode
- Name:
extractionMode - Input Type: Dropdown
- Description: This setting allows users to choose how text is extracted from the PDF. Options include:
- Text Only: Extracts raw text from the PDF with no AI integration.
- AI Summary: Extracts text and produces a summary using AI capabilities.
- AI Key Extraction: Extracts specific elements like names, dates, and terms using AI.
- Custom Prompt: Allows for text extraction followed by processing using a user-defined AI prompt.
- Default Value:
text_only
2. Custom Prompt
- Name:
customPrompt - Input Type: Textarea
- Description: When the extraction mode is set to 'custom', users can enter a custom prompt here. The placeholder
{{text}}can be used within the prompt to represent the extracted text. This enables advanced users to apply tailored AI processing to the extracted content. - Default Value:
''(empty string)
3. Output Column
- Name:
outputColumn - Input Type: String
- Description: This defines the name of the column where the extracted text will be stored. A secondary column indicating the number of pages extracted (e.g.,
pdf_text_pages) is automatically generated. - Default Value:
pdf_text
4. Max Pages
- Name:
maxPages - Input Type: Numeric
- Description: Specifies the maximum number of pages to extract from each PDF. This can help control performance, especially for larger documents.
- Default Value:
10 - Note: Ranges from 1 to 50. Extracting more pages may slow down processing.
5. PDF URL Column
- Name:
urlColumn - Input Type: Dropdown (shown only if no file upload is detected)
- Description: Identifies the column in the dataset that contains URLs leading to the PDFs to be processed. This facilitates the extraction process without requiring direct file uploads.
- Default Value:
pdf_url
Data Expectations
The PdfExtractNodeEditor expects the following data inputs:
- Upstream Columns: The component retrieves the available columns from upstream nodes, particularly looking for files with base64 content or a column with PDF URLs.
- PDF Format: The PDFs should ideally be well-structured since the extraction effectiveness can vary depending on the PDF formatting.
- AI Integration: For modes utilizing AI, the platform must be configured appropriately with an AI service to perform necessary computations.
AI Integrations
The component incorporates AI functionalities in several extraction modes:
- AI Summary and AI Key Extraction modes leverage advanced AI algorithms to analyze the text extracted from the PDFs.
- Users must ensure proper integration and access to AI services governed by their Vantage account settings, which may have billing implications based on the volume of AI processing used.
Billing Impacts
The use of AI-enhanced extraction modes may incur additional costs based on the data processed and specific terms outlined in Vantage's billing policy. Users are advised to review their current subscription plan to understand any extra charges related to AI usage, particularly when dealing with large volumes of documents or frequent AI-based processing requests.
Use Cases & Examples
Use Cases
-
Invoice Processing: Automating the extraction of invoice details such as vendor names, dates, and amounts from PDFs, enabling automated accounts payable processes.
-
Contract Analysis: Extracting key contractual terms from legal documents for compliance tracking and analysis by legal teams.
-
Data Entry Automation: Reducing manual data entry workload in scenarios such as digitizing form submissions by extracting fields from PDF forms.
Example Configuration
Use Case: Invoice Processing
-
Scenario: A finance department needs to extract invoice information from a series of received PDF invoices for automated tracking.
-
Configuration:
- Extraction Mode: AI Key Extraction (to capture vendor names, dates, and amounts)
- Output Column:
invoice_data - Max Pages:
50(to ensure all relevant invoice pages are processed) - PDF URL Column:
invoice_url(the column containing links to the invoice PDFs)
Sample Configuration Data:
{
"extractionMode": "ai_extract",
"outputColumn": "invoice_data",
"maxPages": 50,
"urlColumn": "invoice_url"
}This configuration targets the automation of invoice processing while leveraging AI to ensure key details are accurately captured.