Document Processing

Convert, merge, split, and manipulate existing documents with automated workflows and batch operations.

Process and manipulate existing documents with comprehensive conversion, merging, splitting, and transformation capabilities for automated document workflows.

Features

Format conversion between multiple document types
Document merging and splitting operations
Text extraction and content analysis
Batch processing with parallel operations
Metadata manipulation and preservation

Connector Options

The node uses reusable connector configuration that applies to all processing operations:

Parameter	Type	Required	Description
`inputDirectory`	TEXT	No	Default directory for input documents
`outputDirectory`	TEXT	No	Default directory for processed documents
`tempDirectory`	TEXT	No	Temporary directory for processing operations
`maxFileSize`	INT	No	Maximum file size for processing (MB, default: 100)
`parallelProcessing`	BOOLEAN	No	Enable parallel processing (default: true)

Methods

convertDocument

Convert documents between different formats with customizable options.

Parameter	Type	Required	Description
`inputFile`	TEXT	Yes	Path to the source document
`outputFile`	TEXT	Yes	Path for the converted document
`targetFormat`	TEXT	Yes	Target format: pdf, docx, html, txt, rtf, odt
`options`	Object	No	Conversion-specific options and settings

{
  "inputFile": "/sandbox/documents/report.docx",
  "outputFile": "/sandbox/converted/report.pdf",
  "targetFormat": "pdf",
  "options": {
    "quality": "high",
    "compression": true,
    "preserveFormatting": true,
    "includeBookmarks": true,
    "security": {
      "passwordProtection": false,
      "restrictPrinting": false,
      "restrictCopying": false
    },
    "metadata": {
      "title": "Quarterly Report",
      "author": "Finance Team",
      "subject": "Q3 Performance"
    }
  }
}

Output:

success (Boolean) - Conversion success status
outputPath (String) - Path to the converted document
originalFormat (String) - Source document format
targetFormat (String) - Converted document format
fileSize (Object) - Original and converted file sizes
metadata (Object) - Document properties and conversion info

mergeDocuments

Combine multiple documents into a single document with options for formatting and organization.

Parameter	Type	Required	Description
`inputFiles`	Array	Yes	Array of document paths to merge
`outputFile`	TEXT	Yes	Path for the merged document
`mergeOptions`	Object	No	Options for merge behavior and formatting
`insertOptions`	Object	No	Page breaks, spacing, and insertion rules

{
  "inputFiles": [
    "/sandbox/documents/chapter1.docx",
    "/sandbox/documents/chapter2.docx", 
    "/sandbox/documents/appendix.docx"
  ],
  "outputFile": "/sandbox/merged/complete_report.docx",
  "mergeOptions": {
    "preserveFormatting": true,
    "includeHeaders": false,
    "includeFooters": false,
    "resetPageNumbers": true,
    "maintainStyles": true
  },
  "insertOptions": {
    "pageBreakBetween": true,
    "spacingBetween": 12,
    "insertTableOfContents": true,
    "addSeparators": false
  }
}

splitDocument

Split large documents into smaller files based on various criteria.

Parameter	Type	Required	Description
`inputFile`	TEXT	Yes	Path to the document to split
`outputDirectory`	TEXT	Yes	Directory for split document parts
`splitCriteria`	Object	Yes	Rules for how to split the document
`naming`	Object	No	Naming convention for split files

{
  "inputFile": "/sandbox/documents/large_manual.pdf",
  "outputDirectory": "/sandbox/split_documents",
  "splitCriteria": {
    "method": "by_pages",
    "pagesPerFile": 10,
    "startPage": 1,
    "endPage": 100
  },
  "naming": {
    "prefix": "manual_section",
    "suffix": "_v1",
    "numbering": "sequential",
    "includeRange": true
  }
}

extractContent

Extract text, images, and metadata from documents for analysis or processing.

Parameter	Type	Required	Description
`inputFile`	TEXT	Yes	Path to the source document
`extractionOptions`	Object	Yes	Types of content to extract
`outputOptions`	Object	No	How to save extracted content

{
  "inputFile": "/sandbox/documents/presentation.pptx",
  "extractionOptions": {
    "text": {
      "enabled": true,
      "preserveFormatting": false,
      "includeNotes": true,
      "includeHidden": false
    },
    "images": {
      "enabled": true,
      "format": "png",
      "quality": "high",
      "minSize": "100x100"
    },
    "metadata": {
      "enabled": true,
      "includeProperties": true,
      "includeStats": true
    },
    "tables": {
      "enabled": true,
      "format": "csv",
      "preserveStructure": true
    }
  },
  "outputOptions": {
    "textFile": "/sandbox/extracted/presentation_text.txt",
    "imageDirectory": "/sandbox/extracted/images",
    "metadataFile": "/sandbox/extracted/metadata.json"
  }
}

Format Conversion

Supported Conversions

From/To	PDF	DOCX	HTML	TXT	RTF	ODT	PPTX	XLSX
PDF	✓	✓	✓	✓	✓	✓	✗	✗
DOCX	✓	✓	✓	✓	✓	✓	✗	✗
HTML	✓	✓	✓	✓	✓	✓	✗	✗
TXT	✓	✓	✓	✓	✓	✓	✗	✗
RTF	✓	✓	✓	✓	✓	✓	✗	✗
ODT	✓	✓	✓	✓	✓	✓	✗	✗
PPTX	✓	✓	✓	✓	✗	✗	✓	✗
XLSX	✓	✓	✓	✓	✗	✗	✗	✓

Conversion Options

PDF Output Options

{
  "pdfOptions": {
    "quality": "high",
    "compression": "medium",
    "colorMode": "rgb",
    "resolution": 300,
    "preserveHyperlinks": true,
    "includeBookmarks": true,
    "security": {
      "ownerPassword": "",
      "userPassword": "",
      "restrictPrinting": false,
      "restrictCopying": false,
      "restrictAnnotations": false
    },
    "accessibility": {
      "taggedPDF": true,
      "structureTree": true,
      "altText": true
    }
  }
}

HTML Output Options

{
  "htmlOptions": {
    "cssHandling": "embedded",
    "imageHandling": "embedded",
    "responsive": true,
    "compatibility": "html5",
    "encoding": "utf-8",
    "preserveLayout": true,
    "includeMetadata": true,
    "optimizeForWeb": true
  }
}

Text Extraction Options

{
  "textOptions": {
    "encoding": "utf-8",
    "preserveLineBreaks": true,
    "preserveSpacing": false,
    "includeHeaders": false,
    "includeFooters": false,
    "stripFormatting": true,
    "paragraphSeparator": "\n\n",
    "pageSeparator": "\n\n---PAGE BREAK---\n\n"
  }
}

Document Merging

Merge Strategies

Sequential Merge

Append documents one after another:

{
  "mergeStrategy": "sequential",
  "options": {
    "pageBreakBetween": true,
    "preserveIndividualFormatting": true,
    "resetPageNumbering": false,
    "includeSourceNames": false
  }
}

Chapter-based Merge

Organize documents as chapters with TOC:

{
  "mergeStrategy": "chapters",
  "options": {
    "generateTOC": true,
    "chapterTitles": ["Introduction", "Analysis", "Conclusions"],
    "numberChapters": true,
    "resetPageNumberingPerChapter": false,
    "chapterBreakType": "page"
  }
}

Template-based Merge

Merge into a structured template:

{
  "mergeStrategy": "template",
  "templateFile": "/sandbox/templates/report_template.docx",
  "placeholderMapping": {
    "{{SECTION_1}}": "/sandbox/content/introduction.docx",
    "{{SECTION_2}}": "/sandbox/content/analysis.docx",
    "{{APPENDIX}}": "/sandbox/content/appendix.pdf"
  }
}

Document Splitting

Split Methods

Page-based Splitting

{
  "splitMethod": "pages",
  "configuration": {
    "pagesPerFile": 5,
    "startPage": 1,
    "endPage": null,
    "includeEmptyPages": false,
    "maintainFormatting": true
  }
}

Section-based Splitting

{
  "splitMethod": "sections",
  "configuration": {
    "sectionMarkers": ["h1", "h2"],
    "includeSubsections": true,
    "minimumSectionSize": 2,
    "preserveHierarchy": true
  }
}

Size-based Splitting

{
  "splitMethod": "file_size",
  "configuration": {
    "maxSizePerFile": "5MB",
    "balanceFiles": true,
    "avoidBreakingParagraphs": true
  }
}

Custom Marker Splitting

{
  "splitMethod": "markers",
  "configuration": {
    "markers": ["---SPLIT---", "<!--BREAK-->"],
    "removeMarkers": true,
    "minimumContentBetween": 100
  }
}

Content Extraction

Text Extraction

Extract and process text content from documents:

{
  "textExtraction": {
    "formatting": {
      "preserveBold": false,
      "preserveItalics": false,
      "preserveUnderline": false,
      "convertToMarkdown": false
    },
    "structure": {
      "preserveHeadings": true,
      "preserveLists": true,
      "preserveTables": true,
      "preserveParagraphs": true
    },
    "filtering": {
      "removeHeaders": true,
      "removeFooters": true,
      "removePageNumbers": true,
      "removeWatermarks": true,
      "minParagraphLength": 10
    }
  }
}

Image Extraction

Extract images with metadata and processing:

{
  "imageExtraction": {
    "formats": ["png", "jpg", "gif"],
    "quality": {
      "outputFormat": "png",
      "compression": "medium",
      "resolution": 150
    },
    "filtering": {
      "minWidth": 100,
      "minHeight": 100,
      "maxWidth": 4096,
      "maxHeight": 4096,
      "skipDuplicates": true
    },
    "naming": {
      "prefix": "extracted_img",
      "includePageNumber": true,
      "includeIndex": true,
      "sanitizeNames": true
    }
  }
}

Metadata Extraction

Extract comprehensive document metadata:

{
  "metadataExtraction": {
    "documentProperties": {
      "title": true,
      "author": true,
      "subject": true,
      "keywords": true,
      "createdDate": true,
      "modifiedDate": true,
      "lastSavedBy": true
    },
    "statistics": {
      "pageCount": true,
      "wordCount": true,
      "characterCount": true,
      "paragraphCount": true,
      "imageCount": true,
      "tableCount": true
    },
    "structure": {
      "headingCount": true,
      "listCount": true,
      "linkCount": true,
      "footnoteCount": true
    }
  }
}

Batch Processing

Parallel Document Processing

Process multiple documents simultaneously:

{
  "batchProcessing": {
    "inputDirectory": "/sandbox/documents/batch",
    "outputDirectory": "/sandbox/processed",
    "operations": [
      {
        "type": "convert",
        "targetFormat": "pdf",
        "options": {"quality": "high"}
      },
      {
        "type": "extract_text",
        "outputFile": "{basename}_text.txt"
      }
    ],
    "parallelism": {
      "maxConcurrent": 5,
      "queueSize": 20,
      "retryAttempts": 3
    },
    "filtering": {
      "includeFormats": [".docx", ".doc", ".rtf"],
      "excludePatterns": ["*_temp*", "*_backup*"],
      "minFileSize": "1KB",
      "maxFileSize": "100MB"
    }
  }
}

Workflow Automation

Create automated document processing workflows:

{
  "workflow": {
    "name": "document_standardization",
    "steps": [
      {
        "step": "validate",
        "action": "check_format_support",
        "continueOnFailure": false
      },
      {
        "step": "backup",
        "action": "copy_to_backup",
        "location": "/sandbox/backups"
      },
      {
        "step": "convert",
        "action": "convert_to_pdf",
        "options": {"quality": "high", "compression": true}
      },
      {
        "step": "extract",
        "action": "extract_metadata",
        "outputFile": "{basename}_metadata.json"
      },
      {
        "step": "organize",
        "action": "move_to_output",
        "directory": "/sandbox/standardized"
      }
    ],
    "errorHandling": {
      "onStepFailure": "continue_workflow",
      "logErrors": true,
      "notifyOnFailure": true
    }
  }
}

Quality Control

Validation and Verification

Ensure document processing quality:

{
  "qualityControl": {
    "validation": {
      "checkFileIntegrity": true,
      "verifyFormat": true,
      "validateContent": true,
      "compareFileSize": true
    },
    "verification": {
      "comparePageCount": true,
      "checkTextPreservation": true,
      "validateImageQuality": true,
      "verifyMetadata": true
    },
    "reporting": {
      "generateReport": true,
      "includeStatistics": true,
      "logDiscrepancies": true,
      "reportPath": "/sandbox/quality_reports"
    }
  }
}

Error Recovery

Handle processing errors gracefully:

{
  "errorRecovery": {
    "retryPolicy": {
      "maxAttempts": 3,
      "backoffStrategy": "exponential",
      "retryDelay": 1000
    },
    "fallbackOptions": {
      "useAlternativeConverter": true,
      "reduceQualitySettings": true,
      "skipProblematicPages": false
    },
    "logging": {
      "logLevel": "detailed",
      "includeStackTrace": true,
      "saveFailedFiles": true
    }
  }
}

Performance Optimization

Processing Optimization

{
  "performanceSettings": {
    "memory": {
      "maxMemoryUsage": "1GB",
      "enableGarbageCollection": true,
      "streamProcessing": true
    },
    "processing": {
      "enableCaching": true,
      "cacheSize": "256MB",
      "preloadFonts": true,
      "optimizeImages": true
    },
    "concurrency": {
      "maxParallelJobs": 8,
      "queueSize": 50,
      "threadPoolSize": 16
    }
  }
}

Error Handling

Common Processing Errors

Error Type	Cause	Resolution
`UNSUPPORTED_FORMAT`	File format not supported	Check supported format list
`CORRUPTED_FILE`	Source file is damaged	Verify file integrity
`CONVERSION_FAILED`	Format conversion error	Try alternative converter
`MEMORY_EXCEEDED`	File too large for processing	Increase memory or split file
`PERMISSION_DENIED`	Insufficient file permissions	Check file access rights

Error Response Format

{
  "success": false,
  "error": {
    "type": "CONVERSION_FAILED",
    "message": "Failed to convert DOCX to PDF due to embedded object issues",
    "inputFile": "/sandbox/documents/complex_report.docx",
    "operation": "convert_to_pdf",
    "details": {
      "errorCode": "EMBEDDED_OBJECT_ERROR",
      "affectedPages": [5, 12, 18],
      "objectTypes": ["embedded_excel", "complex_chart"]
    },
    "suggestions": [
      "Try converting without embedded objects",
      "Use alternative output format",
      "Split document and convert sections separately",
      "Check embedded object compatibility"
    ]
  }
}

Usage Examples

Document Standardization Workflow

{
  "batchProcessing": {
    "inputDirectory": "/sandbox/mixed_documents",
    "outputDirectory": "/sandbox/standardized",
    "operations": [
      {
        "type": "convert",
        "targetFormat": "pdf",
        "options": {
          "quality": "high",
          "security": {"restrictCopying": false}
        }
      },
      {
        "type": "extract_metadata",
        "outputFile": "/sandbox/metadata/{basename}_info.json"
      }
    ],
    "filtering": {
      "includeFormats": [".doc", ".docx", ".rtf", ".odt"]
    }
  }
}

Report Assembly Pipeline

{
  "workflow": [
    {
      "step": "merge_sections",
      "inputFiles": [
        "/sandbox/sections/executive_summary.docx",
        "/sandbox/sections/financial_analysis.docx",
        "/sandbox/sections/recommendations.docx"
      ],
      "outputFile": "/sandbox/temp/merged_report.docx"
    },
    {
      "step": "convert_to_pdf",
      "inputFile": "/sandbox/temp/merged_report.docx",
      "outputFile": "/sandbox/reports/final_report.pdf",
      "options": {"includeBookmarks": true}
    },
    {
      "step": "extract_summary",
      "inputFile": "/sandbox/reports/final_report.pdf",
      "pages": "1-3",
      "outputFile": "/sandbox/summaries/executive_summary.pdf"
    }
  ]
}

Content Analysis Pipeline

{
  "contentAnalysis": {
    "inputFile": "/sandbox/documents/research_paper.pdf",
    "extractions": [
      {
        "type": "text",
        "outputFile": "/sandbox/analysis/full_text.txt",
        "options": {"preserveStructure": true}
      },
      {
        "type": "images",
        "outputDirectory": "/sandbox/analysis/figures",
        "options": {"minSize": "300x200"}
      },
      {
        "type": "tables",
        "outputDirectory": "/sandbox/analysis/data",
        "format": "csv"
      },
      {
        "type": "metadata",
        "outputFile": "/sandbox/analysis/document_info.json"
      }
    ]
  }
}