5 min read

pdfExtract

Overview

This node is designed to fetch PDF documents from specified URLs, extract text content from those PDFs, and optionally run AI-based analyses on the extracted content. This functionality is useful in scenarios where businesses need to process large quantities of PDF documents for information retrieval, summarization, or structured data extraction.

Purpose

The main purpose of pdfExtract is to bridge the gap between unstructured PDF content and actionable insights by leveraging AI capabilities, enabling users to extract valuable information from their documents efficiently.

Settings

1. `urlColumn`

Input Type: String
Description: This setting specifies the name of the field in the input data that contains the URL of the PDF to be fetched. Changing this setting allows you to customize which column is used to retrieve PDF links from your data.
Default Value: "pdf_url"

2. `extractionMode`

Input Type: Dropdown (options: text_only, ai_summary, ai_extract, custom)
Description: This setting defines the mode of extraction for the content of the PDF.
- text_only: Only extracts plain text from the PDF.
- ai_summary: Summarizes the extracted text using an AI model.
- ai_extract: Extracts structured information (e.g., named entities, dates) using an AI model.
- custom: Uses a user-defined prompt for AI processing. Changing this setting alters whether simple text extraction is performed or if advanced AI analyses are applied.
Default Value: "text_only"

3. `outputColumn`

Input Type: String
Description: This setting indicates the name of the field in which the extracted text or analysis results will be stored in the output data. Modifying this allows you to direct the output to a different column based on your data schema.
Default Value: "pdf_text"

4. `customPrompt`

Input Type: String
Description: Only applicable when extractionMode is set to custom. This setting allows users to define a specific prompt that the AI model will follow to analyze the extracted text. Adjusting this can lead to more tailored AI responses based on user-defined conditions.
Default Value: "" (empty string - no custom prompt is set)

5. `maxPages`

Input Type: Numeric (integer)
Description: This setting determines the maximum number of pages to process from the PDF. It controls the amount of content extracted, which can significantly impact processing time and cost. Setting a lower number can help manage performance and resource usage.
Default Value: 10

How It Works

The pdfExtract node operates by processing input data consisting of PDF URLs and, if available, base64 encoded PDF content. The processing flow is as follows:

Input Validation: The node checks for valid input data and extracts relevant columns based on urlColumn.
PDF Fetching and Parsing: It either fetches the PDF from the given URL or decodes a base64 PDF present in the input, up to the specified maxPages.
Text Extraction: It uses the pdfjs-dist library to extract text content from the PDF pages.
AI Processing (if applicable): Depending on the extractionMode, it may send the extracted text to an AI model for further analysis. This requires fetching an AI integration if necessary.
Result Compilation: The extracted information is collated into the specified outputColumn, and any errors during processing are noted.

Expected Data

The data expected by this node consists of input objects that must include fields specified by the urlColumn. The logic is designed to handle both URLs and base64 encoded PDF content. Each row of input data must follow this structure:

json

{
  "pdf_url": "http://example.com/document.pdf", // or content_base64: "base64_encoded_pdf_data"
  "other_field": "value"
}

Use Cases & Examples

Use Case 1: Document Summarization

A law firm needs to summarize multiple legal agreements stored as PDFs for quick client updates. By using the ai_summary extraction mode, they can automatically generate concise summaries of each document.

Use Case 2: Data Extraction for Claims Processing

An insurance company wants to extract key data such as claim amounts and dates from uploaded policy documents. Using the ai_extract mode, they can systematically pull necessary information to streamline their processing workflow.

Use Case 3: E-commerce Product Descriptions

An e-commerce company needs to convert several product catalogs from PDF to structured data. Using text_only, they can extract all text descriptions for later use in their product database.

Detailed Example

Scenario: Auto-summarization of multiple legal documents.

Configuration:

urlColumn: "pdf_url"
extractionMode: "ai_summary"
outputColumn: "pdf_summary"
customPrompt: "" (leave empty for default AI behavior)
maxPages: 5 (to control processing time)

Sample Input Data:

json

[
  {
    "pdf_url": "http://example.com/legal_agreement_1.pdf"
  },
  {
    "pdf_url": "http://example.com/legal_agreement_2.pdf"
  }
]

Execution: The pdfExtract node will fetch the PDFs, extract text from the first 5 pages of each document, and then use AI to summarize this text, storing the results in the pdf_summary field of the output.

By utilizing the pdfExtract node correctly, businesses can streamline their document workflows, improve efficiency, and leverage AI for enhanced data insights.

← PreviousImage Search Next →Text Extraction