logo_smallAxellero.io

Document Processing

Convert, merge, split, and manipulate existing documents with automated workflows and batch operations.

Process and manipulate existing documents with comprehensive conversion, merging, splitting, and transformation capabilities for automated document workflows.

Features

  • Format conversion between multiple document types
  • Document merging and splitting operations
  • Text extraction and content analysis
  • Batch processing with parallel operations
  • Metadata manipulation and preservation

Connector Options

The node uses reusable connector configuration that applies to all processing operations:

ParameterTypeRequiredDescription
inputDirectoryTEXTNoDefault directory for input documents
outputDirectoryTEXTNoDefault directory for processed documents
tempDirectoryTEXTNoTemporary directory for processing operations
maxFileSizeINTNoMaximum file size for processing (MB, default: 100)
parallelProcessingBOOLEANNoEnable parallel processing (default: true)

Methods

convertDocument

Convert documents between different formats with customizable options.

ParameterTypeRequiredDescription
inputFileTEXTYesPath to the source document
outputFileTEXTYesPath for the converted document
targetFormatTEXTYesTarget format: pdf, docx, html, txt, rtf, odt
optionsObjectNoConversion-specific options and settings
{
  "inputFile": "/sandbox/documents/report.docx",
  "outputFile": "/sandbox/converted/report.pdf",
  "targetFormat": "pdf",
  "options": {
    "quality": "high",
    "compression": true,
    "preserveFormatting": true,
    "includeBookmarks": true,
    "security": {
      "passwordProtection": false,
      "restrictPrinting": false,
      "restrictCopying": false
    },
    "metadata": {
      "title": "Quarterly Report",
      "author": "Finance Team",
      "subject": "Q3 Performance"
    }
  }
}

Output:

  • success (Boolean) - Conversion success status
  • outputPath (String) - Path to the converted document
  • originalFormat (String) - Source document format
  • targetFormat (String) - Converted document format
  • fileSize (Object) - Original and converted file sizes
  • metadata (Object) - Document properties and conversion info

mergeDocuments

Combine multiple documents into a single document with options for formatting and organization.

ParameterTypeRequiredDescription
inputFilesArrayYesArray of document paths to merge
outputFileTEXTYesPath for the merged document
mergeOptionsObjectNoOptions for merge behavior and formatting
insertOptionsObjectNoPage breaks, spacing, and insertion rules
{
  "inputFiles": [
    "/sandbox/documents/chapter1.docx",
    "/sandbox/documents/chapter2.docx", 
    "/sandbox/documents/appendix.docx"
  ],
  "outputFile": "/sandbox/merged/complete_report.docx",
  "mergeOptions": {
    "preserveFormatting": true,
    "includeHeaders": false,
    "includeFooters": false,
    "resetPageNumbers": true,
    "maintainStyles": true
  },
  "insertOptions": {
    "pageBreakBetween": true,
    "spacingBetween": 12,
    "insertTableOfContents": true,
    "addSeparators": false
  }
}

splitDocument

Split large documents into smaller files based on various criteria.

ParameterTypeRequiredDescription
inputFileTEXTYesPath to the document to split
outputDirectoryTEXTYesDirectory for split document parts
splitCriteriaObjectYesRules for how to split the document
namingObjectNoNaming convention for split files
{
  "inputFile": "/sandbox/documents/large_manual.pdf",
  "outputDirectory": "/sandbox/split_documents",
  "splitCriteria": {
    "method": "by_pages",
    "pagesPerFile": 10,
    "startPage": 1,
    "endPage": 100
  },
  "naming": {
    "prefix": "manual_section",
    "suffix": "_v1",
    "numbering": "sequential",
    "includeRange": true
  }
}

extractContent

Extract text, images, and metadata from documents for analysis or processing.

ParameterTypeRequiredDescription
inputFileTEXTYesPath to the source document
extractionOptionsObjectYesTypes of content to extract
outputOptionsObjectNoHow to save extracted content
{
  "inputFile": "/sandbox/documents/presentation.pptx",
  "extractionOptions": {
    "text": {
      "enabled": true,
      "preserveFormatting": false,
      "includeNotes": true,
      "includeHidden": false
    },
    "images": {
      "enabled": true,
      "format": "png",
      "quality": "high",
      "minSize": "100x100"
    },
    "metadata": {
      "enabled": true,
      "includeProperties": true,
      "includeStats": true
    },
    "tables": {
      "enabled": true,
      "format": "csv",
      "preserveStructure": true
    }
  },
  "outputOptions": {
    "textFile": "/sandbox/extracted/presentation_text.txt",
    "imageDirectory": "/sandbox/extracted/images",
    "metadataFile": "/sandbox/extracted/metadata.json"
  }
}

Format Conversion

Supported Conversions

From/ToPDFDOCXHTMLTXTRTFODTPPTXXLSX
PDF
DOCX
HTML
TXT
RTF
ODT
PPTX
XLSX

Conversion Options

PDF Output Options

{
  "pdfOptions": {
    "quality": "high",
    "compression": "medium",
    "colorMode": "rgb",
    "resolution": 300,
    "preserveHyperlinks": true,
    "includeBookmarks": true,
    "security": {
      "ownerPassword": "",
      "userPassword": "",
      "restrictPrinting": false,
      "restrictCopying": false,
      "restrictAnnotations": false
    },
    "accessibility": {
      "taggedPDF": true,
      "structureTree": true,
      "altText": true
    }
  }
}

HTML Output Options

{
  "htmlOptions": {
    "cssHandling": "embedded",
    "imageHandling": "embedded",
    "responsive": true,
    "compatibility": "html5",
    "encoding": "utf-8",
    "preserveLayout": true,
    "includeMetadata": true,
    "optimizeForWeb": true
  }
}

Text Extraction Options

{
  "textOptions": {
    "encoding": "utf-8",
    "preserveLineBreaks": true,
    "preserveSpacing": false,
    "includeHeaders": false,
    "includeFooters": false,
    "stripFormatting": true,
    "paragraphSeparator": "\n\n",
    "pageSeparator": "\n\n---PAGE BREAK---\n\n"
  }
}

Document Merging

Merge Strategies

Sequential Merge

Append documents one after another:

{
  "mergeStrategy": "sequential",
  "options": {
    "pageBreakBetween": true,
    "preserveIndividualFormatting": true,
    "resetPageNumbering": false,
    "includeSourceNames": false
  }
}

Chapter-based Merge

Organize documents as chapters with TOC:

{
  "mergeStrategy": "chapters",
  "options": {
    "generateTOC": true,
    "chapterTitles": ["Introduction", "Analysis", "Conclusions"],
    "numberChapters": true,
    "resetPageNumberingPerChapter": false,
    "chapterBreakType": "page"
  }
}

Template-based Merge

Merge into a structured template:

{
  "mergeStrategy": "template",
  "templateFile": "/sandbox/templates/report_template.docx",
  "placeholderMapping": {
    "{{SECTION_1}}": "/sandbox/content/introduction.docx",
    "{{SECTION_2}}": "/sandbox/content/analysis.docx",
    "{{APPENDIX}}": "/sandbox/content/appendix.pdf"
  }
}

Document Splitting

Split Methods

Page-based Splitting

{
  "splitMethod": "pages",
  "configuration": {
    "pagesPerFile": 5,
    "startPage": 1,
    "endPage": null,
    "includeEmptyPages": false,
    "maintainFormatting": true
  }
}

Section-based Splitting

{
  "splitMethod": "sections",
  "configuration": {
    "sectionMarkers": ["h1", "h2"],
    "includeSubsections": true,
    "minimumSectionSize": 2,
    "preserveHierarchy": true
  }
}

Size-based Splitting

{
  "splitMethod": "file_size",
  "configuration": {
    "maxSizePerFile": "5MB",
    "balanceFiles": true,
    "avoidBreakingParagraphs": true
  }
}

Custom Marker Splitting

{
  "splitMethod": "markers",
  "configuration": {
    "markers": ["---SPLIT---", "<!--BREAK-->"],
    "removeMarkers": true,
    "minimumContentBetween": 100
  }
}

Content Extraction

Text Extraction

Extract and process text content from documents:

{
  "textExtraction": {
    "formatting": {
      "preserveBold": false,
      "preserveItalics": false,
      "preserveUnderline": false,
      "convertToMarkdown": false
    },
    "structure": {
      "preserveHeadings": true,
      "preserveLists": true,
      "preserveTables": true,
      "preserveParagraphs": true
    },
    "filtering": {
      "removeHeaders": true,
      "removeFooters": true,
      "removePageNumbers": true,
      "removeWatermarks": true,
      "minParagraphLength": 10
    }
  }
}

Image Extraction

Extract images with metadata and processing:

{
  "imageExtraction": {
    "formats": ["png", "jpg", "gif"],
    "quality": {
      "outputFormat": "png",
      "compression": "medium",
      "resolution": 150
    },
    "filtering": {
      "minWidth": 100,
      "minHeight": 100,
      "maxWidth": 4096,
      "maxHeight": 4096,
      "skipDuplicates": true
    },
    "naming": {
      "prefix": "extracted_img",
      "includePageNumber": true,
      "includeIndex": true,
      "sanitizeNames": true
    }
  }
}

Metadata Extraction

Extract comprehensive document metadata:

{
  "metadataExtraction": {
    "documentProperties": {
      "title": true,
      "author": true,
      "subject": true,
      "keywords": true,
      "createdDate": true,
      "modifiedDate": true,
      "lastSavedBy": true
    },
    "statistics": {
      "pageCount": true,
      "wordCount": true,
      "characterCount": true,
      "paragraphCount": true,
      "imageCount": true,
      "tableCount": true
    },
    "structure": {
      "headingCount": true,
      "listCount": true,
      "linkCount": true,
      "footnoteCount": true
    }
  }
}

Batch Processing

Parallel Document Processing

Process multiple documents simultaneously:

{
  "batchProcessing": {
    "inputDirectory": "/sandbox/documents/batch",
    "outputDirectory": "/sandbox/processed",
    "operations": [
      {
        "type": "convert",
        "targetFormat": "pdf",
        "options": {"quality": "high"}
      },
      {
        "type": "extract_text",
        "outputFile": "{basename}_text.txt"
      }
    ],
    "parallelism": {
      "maxConcurrent": 5,
      "queueSize": 20,
      "retryAttempts": 3
    },
    "filtering": {
      "includeFormats": [".docx", ".doc", ".rtf"],
      "excludePatterns": ["*_temp*", "*_backup*"],
      "minFileSize": "1KB",
      "maxFileSize": "100MB"
    }
  }
}

Workflow Automation

Create automated document processing workflows:

{
  "workflow": {
    "name": "document_standardization",
    "steps": [
      {
        "step": "validate",
        "action": "check_format_support",
        "continueOnFailure": false
      },
      {
        "step": "backup",
        "action": "copy_to_backup",
        "location": "/sandbox/backups"
      },
      {
        "step": "convert",
        "action": "convert_to_pdf",
        "options": {"quality": "high", "compression": true}
      },
      {
        "step": "extract",
        "action": "extract_metadata",
        "outputFile": "{basename}_metadata.json"
      },
      {
        "step": "organize",
        "action": "move_to_output",
        "directory": "/sandbox/standardized"
      }
    ],
    "errorHandling": {
      "onStepFailure": "continue_workflow",
      "logErrors": true,
      "notifyOnFailure": true
    }
  }
}

Quality Control

Validation and Verification

Ensure document processing quality:

{
  "qualityControl": {
    "validation": {
      "checkFileIntegrity": true,
      "verifyFormat": true,
      "validateContent": true,
      "compareFileSize": true
    },
    "verification": {
      "comparePageCount": true,
      "checkTextPreservation": true,
      "validateImageQuality": true,
      "verifyMetadata": true
    },
    "reporting": {
      "generateReport": true,
      "includeStatistics": true,
      "logDiscrepancies": true,
      "reportPath": "/sandbox/quality_reports"
    }
  }
}

Error Recovery

Handle processing errors gracefully:

{
  "errorRecovery": {
    "retryPolicy": {
      "maxAttempts": 3,
      "backoffStrategy": "exponential",
      "retryDelay": 1000
    },
    "fallbackOptions": {
      "useAlternativeConverter": true,
      "reduceQualitySettings": true,
      "skipProblematicPages": false
    },
    "logging": {
      "logLevel": "detailed",
      "includeStackTrace": true,
      "saveFailedFiles": true
    }
  }
}

Performance Optimization

Processing Optimization

{
  "performanceSettings": {
    "memory": {
      "maxMemoryUsage": "1GB",
      "enableGarbageCollection": true,
      "streamProcessing": true
    },
    "processing": {
      "enableCaching": true,
      "cacheSize": "256MB",
      "preloadFonts": true,
      "optimizeImages": true
    },
    "concurrency": {
      "maxParallelJobs": 8,
      "queueSize": 50,
      "threadPoolSize": 16
    }
  }
}

Error Handling

Common Processing Errors

Error TypeCauseResolution
UNSUPPORTED_FORMATFile format not supportedCheck supported format list
CORRUPTED_FILESource file is damagedVerify file integrity
CONVERSION_FAILEDFormat conversion errorTry alternative converter
MEMORY_EXCEEDEDFile too large for processingIncrease memory or split file
PERMISSION_DENIEDInsufficient file permissionsCheck file access rights

Error Response Format

{
  "success": false,
  "error": {
    "type": "CONVERSION_FAILED",
    "message": "Failed to convert DOCX to PDF due to embedded object issues",
    "inputFile": "/sandbox/documents/complex_report.docx",
    "operation": "convert_to_pdf",
    "details": {
      "errorCode": "EMBEDDED_OBJECT_ERROR",
      "affectedPages": [5, 12, 18],
      "objectTypes": ["embedded_excel", "complex_chart"]
    },
    "suggestions": [
      "Try converting without embedded objects",
      "Use alternative output format",
      "Split document and convert sections separately",
      "Check embedded object compatibility"
    ]
  }
}

Usage Examples

Document Standardization Workflow

{
  "batchProcessing": {
    "inputDirectory": "/sandbox/mixed_documents",
    "outputDirectory": "/sandbox/standardized",
    "operations": [
      {
        "type": "convert",
        "targetFormat": "pdf",
        "options": {
          "quality": "high",
          "security": {"restrictCopying": false}
        }
      },
      {
        "type": "extract_metadata",
        "outputFile": "/sandbox/metadata/{basename}_info.json"
      }
    ],
    "filtering": {
      "includeFormats": [".doc", ".docx", ".rtf", ".odt"]
    }
  }
}

Report Assembly Pipeline

{
  "workflow": [
    {
      "step": "merge_sections",
      "inputFiles": [
        "/sandbox/sections/executive_summary.docx",
        "/sandbox/sections/financial_analysis.docx",
        "/sandbox/sections/recommendations.docx"
      ],
      "outputFile": "/sandbox/temp/merged_report.docx"
    },
    {
      "step": "convert_to_pdf",
      "inputFile": "/sandbox/temp/merged_report.docx",
      "outputFile": "/sandbox/reports/final_report.pdf",
      "options": {"includeBookmarks": true}
    },
    {
      "step": "extract_summary",
      "inputFile": "/sandbox/reports/final_report.pdf",
      "pages": "1-3",
      "outputFile": "/sandbox/summaries/executive_summary.pdf"
    }
  ]
}

Content Analysis Pipeline

{
  "contentAnalysis": {
    "inputFile": "/sandbox/documents/research_paper.pdf",
    "extractions": [
      {
        "type": "text",
        "outputFile": "/sandbox/analysis/full_text.txt",
        "options": {"preserveStructure": true}
      },
      {
        "type": "images",
        "outputDirectory": "/sandbox/analysis/figures",
        "options": {"minSize": "300x200"}
      },
      {
        "type": "tables",
        "outputDirectory": "/sandbox/analysis/data",
        "format": "csv"
      },
      {
        "type": "metadata",
        "outputFile": "/sandbox/analysis/document_info.json"
      }
    ]
  }
}

Integration Patterns

With File System Tools

Organize processed documents in structured directories with automatic cleanup and archiving.

With Data Analysis Tools

Extract and analyze content from documents for insights and pattern detection.

With DOCX Generation Tools

Convert documents to standardized formats before using as templates or merging.

Best Practices

File Management

  • Maintain backup copies before processing
  • Use descriptive naming conventions
  • Organize output in logical directory structures
  • Implement cleanup procedures for temporary files

Performance

  • Process documents in appropriate batch sizes
  • Monitor memory usage during large operations
  • Use parallel processing for independent operations
  • Cache frequently accessed conversion settings

Quality Assurance

  • Validate input files before processing
  • Compare output with source for accuracy
  • Monitor conversion quality metrics
  • Implement error recovery procedures

Security

  • Validate file types before processing
  • Scan for malicious content
  • Respect document security settings
  • Maintain audit logs for processed files

Getting Started

  1. Identify document processing requirements and workflows
  2. Test conversion and processing operations with sample files
  3. Configure batch processing settings and error handling
  4. Implement quality validation and verification procedures
  5. Set up automated workflows for routine operations
  6. Monitor performance and optimize settings

Resources