Document Processing
Convert, merge, split, and manipulate existing documents with automated workflows and batch operations.
Process and manipulate existing documents with comprehensive conversion, merging, splitting, and transformation capabilities for automated document workflows.
Features
- Format conversion between multiple document types
- Document merging and splitting operations
- Text extraction and content analysis
- Batch processing with parallel operations
- Metadata manipulation and preservation
Connector Options
The node uses reusable connector configuration that applies to all processing operations:
| Parameter | Type | Required | Description |
|---|---|---|---|
inputDirectory | TEXT | No | Default directory for input documents |
outputDirectory | TEXT | No | Default directory for processed documents |
tempDirectory | TEXT | No | Temporary directory for processing operations |
maxFileSize | INT | No | Maximum file size for processing (MB, default: 100) |
parallelProcessing | BOOLEAN | No | Enable parallel processing (default: true) |
Methods
convertDocument
Convert documents between different formats with customizable options.
| Parameter | Type | Required | Description |
|---|---|---|---|
inputFile | TEXT | Yes | Path to the source document |
outputFile | TEXT | Yes | Path for the converted document |
targetFormat | TEXT | Yes | Target format: pdf, docx, html, txt, rtf, odt |
options | Object | No | Conversion-specific options and settings |
{
"inputFile": "/sandbox/documents/report.docx",
"outputFile": "/sandbox/converted/report.pdf",
"targetFormat": "pdf",
"options": {
"quality": "high",
"compression": true,
"preserveFormatting": true,
"includeBookmarks": true,
"security": {
"passwordProtection": false,
"restrictPrinting": false,
"restrictCopying": false
},
"metadata": {
"title": "Quarterly Report",
"author": "Finance Team",
"subject": "Q3 Performance"
}
}
}Output:
success(Boolean) - Conversion success statusoutputPath(String) - Path to the converted documentoriginalFormat(String) - Source document formattargetFormat(String) - Converted document formatfileSize(Object) - Original and converted file sizesmetadata(Object) - Document properties and conversion info
mergeDocuments
Combine multiple documents into a single document with options for formatting and organization.
| Parameter | Type | Required | Description |
|---|---|---|---|
inputFiles | Array | Yes | Array of document paths to merge |
outputFile | TEXT | Yes | Path for the merged document |
mergeOptions | Object | No | Options for merge behavior and formatting |
insertOptions | Object | No | Page breaks, spacing, and insertion rules |
{
"inputFiles": [
"/sandbox/documents/chapter1.docx",
"/sandbox/documents/chapter2.docx",
"/sandbox/documents/appendix.docx"
],
"outputFile": "/sandbox/merged/complete_report.docx",
"mergeOptions": {
"preserveFormatting": true,
"includeHeaders": false,
"includeFooters": false,
"resetPageNumbers": true,
"maintainStyles": true
},
"insertOptions": {
"pageBreakBetween": true,
"spacingBetween": 12,
"insertTableOfContents": true,
"addSeparators": false
}
}splitDocument
Split large documents into smaller files based on various criteria.
| Parameter | Type | Required | Description |
|---|---|---|---|
inputFile | TEXT | Yes | Path to the document to split |
outputDirectory | TEXT | Yes | Directory for split document parts |
splitCriteria | Object | Yes | Rules for how to split the document |
naming | Object | No | Naming convention for split files |
{
"inputFile": "/sandbox/documents/large_manual.pdf",
"outputDirectory": "/sandbox/split_documents",
"splitCriteria": {
"method": "by_pages",
"pagesPerFile": 10,
"startPage": 1,
"endPage": 100
},
"naming": {
"prefix": "manual_section",
"suffix": "_v1",
"numbering": "sequential",
"includeRange": true
}
}extractContent
Extract text, images, and metadata from documents for analysis or processing.
| Parameter | Type | Required | Description |
|---|---|---|---|
inputFile | TEXT | Yes | Path to the source document |
extractionOptions | Object | Yes | Types of content to extract |
outputOptions | Object | No | How to save extracted content |
{
"inputFile": "/sandbox/documents/presentation.pptx",
"extractionOptions": {
"text": {
"enabled": true,
"preserveFormatting": false,
"includeNotes": true,
"includeHidden": false
},
"images": {
"enabled": true,
"format": "png",
"quality": "high",
"minSize": "100x100"
},
"metadata": {
"enabled": true,
"includeProperties": true,
"includeStats": true
},
"tables": {
"enabled": true,
"format": "csv",
"preserveStructure": true
}
},
"outputOptions": {
"textFile": "/sandbox/extracted/presentation_text.txt",
"imageDirectory": "/sandbox/extracted/images",
"metadataFile": "/sandbox/extracted/metadata.json"
}
}Format Conversion
Supported Conversions
| From/To | DOCX | HTML | TXT | RTF | ODT | PPTX | XLSX | |
|---|---|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | |
| DOCX | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| HTML | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| TXT | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| RTF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| ODT | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| PPTX | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
| XLSX | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Conversion Options
PDF Output Options
{
"pdfOptions": {
"quality": "high",
"compression": "medium",
"colorMode": "rgb",
"resolution": 300,
"preserveHyperlinks": true,
"includeBookmarks": true,
"security": {
"ownerPassword": "",
"userPassword": "",
"restrictPrinting": false,
"restrictCopying": false,
"restrictAnnotations": false
},
"accessibility": {
"taggedPDF": true,
"structureTree": true,
"altText": true
}
}
}HTML Output Options
{
"htmlOptions": {
"cssHandling": "embedded",
"imageHandling": "embedded",
"responsive": true,
"compatibility": "html5",
"encoding": "utf-8",
"preserveLayout": true,
"includeMetadata": true,
"optimizeForWeb": true
}
}Text Extraction Options
{
"textOptions": {
"encoding": "utf-8",
"preserveLineBreaks": true,
"preserveSpacing": false,
"includeHeaders": false,
"includeFooters": false,
"stripFormatting": true,
"paragraphSeparator": "\n\n",
"pageSeparator": "\n\n---PAGE BREAK---\n\n"
}
}Document Merging
Merge Strategies
Sequential Merge
Append documents one after another:
{
"mergeStrategy": "sequential",
"options": {
"pageBreakBetween": true,
"preserveIndividualFormatting": true,
"resetPageNumbering": false,
"includeSourceNames": false
}
}Chapter-based Merge
Organize documents as chapters with TOC:
{
"mergeStrategy": "chapters",
"options": {
"generateTOC": true,
"chapterTitles": ["Introduction", "Analysis", "Conclusions"],
"numberChapters": true,
"resetPageNumberingPerChapter": false,
"chapterBreakType": "page"
}
}Template-based Merge
Merge into a structured template:
{
"mergeStrategy": "template",
"templateFile": "/sandbox/templates/report_template.docx",
"placeholderMapping": {
"{{SECTION_1}}": "/sandbox/content/introduction.docx",
"{{SECTION_2}}": "/sandbox/content/analysis.docx",
"{{APPENDIX}}": "/sandbox/content/appendix.pdf"
}
}Document Splitting
Split Methods
Page-based Splitting
{
"splitMethod": "pages",
"configuration": {
"pagesPerFile": 5,
"startPage": 1,
"endPage": null,
"includeEmptyPages": false,
"maintainFormatting": true
}
}Section-based Splitting
{
"splitMethod": "sections",
"configuration": {
"sectionMarkers": ["h1", "h2"],
"includeSubsections": true,
"minimumSectionSize": 2,
"preserveHierarchy": true
}
}Size-based Splitting
{
"splitMethod": "file_size",
"configuration": {
"maxSizePerFile": "5MB",
"balanceFiles": true,
"avoidBreakingParagraphs": true
}
}Custom Marker Splitting
{
"splitMethod": "markers",
"configuration": {
"markers": ["---SPLIT---", "<!--BREAK-->"],
"removeMarkers": true,
"minimumContentBetween": 100
}
}Content Extraction
Text Extraction
Extract and process text content from documents:
{
"textExtraction": {
"formatting": {
"preserveBold": false,
"preserveItalics": false,
"preserveUnderline": false,
"convertToMarkdown": false
},
"structure": {
"preserveHeadings": true,
"preserveLists": true,
"preserveTables": true,
"preserveParagraphs": true
},
"filtering": {
"removeHeaders": true,
"removeFooters": true,
"removePageNumbers": true,
"removeWatermarks": true,
"minParagraphLength": 10
}
}
}Image Extraction
Extract images with metadata and processing:
{
"imageExtraction": {
"formats": ["png", "jpg", "gif"],
"quality": {
"outputFormat": "png",
"compression": "medium",
"resolution": 150
},
"filtering": {
"minWidth": 100,
"minHeight": 100,
"maxWidth": 4096,
"maxHeight": 4096,
"skipDuplicates": true
},
"naming": {
"prefix": "extracted_img",
"includePageNumber": true,
"includeIndex": true,
"sanitizeNames": true
}
}
}Metadata Extraction
Extract comprehensive document metadata:
{
"metadataExtraction": {
"documentProperties": {
"title": true,
"author": true,
"subject": true,
"keywords": true,
"createdDate": true,
"modifiedDate": true,
"lastSavedBy": true
},
"statistics": {
"pageCount": true,
"wordCount": true,
"characterCount": true,
"paragraphCount": true,
"imageCount": true,
"tableCount": true
},
"structure": {
"headingCount": true,
"listCount": true,
"linkCount": true,
"footnoteCount": true
}
}
}Batch Processing
Parallel Document Processing
Process multiple documents simultaneously:
{
"batchProcessing": {
"inputDirectory": "/sandbox/documents/batch",
"outputDirectory": "/sandbox/processed",
"operations": [
{
"type": "convert",
"targetFormat": "pdf",
"options": {"quality": "high"}
},
{
"type": "extract_text",
"outputFile": "{basename}_text.txt"
}
],
"parallelism": {
"maxConcurrent": 5,
"queueSize": 20,
"retryAttempts": 3
},
"filtering": {
"includeFormats": [".docx", ".doc", ".rtf"],
"excludePatterns": ["*_temp*", "*_backup*"],
"minFileSize": "1KB",
"maxFileSize": "100MB"
}
}
}Workflow Automation
Create automated document processing workflows:
{
"workflow": {
"name": "document_standardization",
"steps": [
{
"step": "validate",
"action": "check_format_support",
"continueOnFailure": false
},
{
"step": "backup",
"action": "copy_to_backup",
"location": "/sandbox/backups"
},
{
"step": "convert",
"action": "convert_to_pdf",
"options": {"quality": "high", "compression": true}
},
{
"step": "extract",
"action": "extract_metadata",
"outputFile": "{basename}_metadata.json"
},
{
"step": "organize",
"action": "move_to_output",
"directory": "/sandbox/standardized"
}
],
"errorHandling": {
"onStepFailure": "continue_workflow",
"logErrors": true,
"notifyOnFailure": true
}
}
}Quality Control
Validation and Verification
Ensure document processing quality:
{
"qualityControl": {
"validation": {
"checkFileIntegrity": true,
"verifyFormat": true,
"validateContent": true,
"compareFileSize": true
},
"verification": {
"comparePageCount": true,
"checkTextPreservation": true,
"validateImageQuality": true,
"verifyMetadata": true
},
"reporting": {
"generateReport": true,
"includeStatistics": true,
"logDiscrepancies": true,
"reportPath": "/sandbox/quality_reports"
}
}
}Error Recovery
Handle processing errors gracefully:
{
"errorRecovery": {
"retryPolicy": {
"maxAttempts": 3,
"backoffStrategy": "exponential",
"retryDelay": 1000
},
"fallbackOptions": {
"useAlternativeConverter": true,
"reduceQualitySettings": true,
"skipProblematicPages": false
},
"logging": {
"logLevel": "detailed",
"includeStackTrace": true,
"saveFailedFiles": true
}
}
}Performance Optimization
Processing Optimization
{
"performanceSettings": {
"memory": {
"maxMemoryUsage": "1GB",
"enableGarbageCollection": true,
"streamProcessing": true
},
"processing": {
"enableCaching": true,
"cacheSize": "256MB",
"preloadFonts": true,
"optimizeImages": true
},
"concurrency": {
"maxParallelJobs": 8,
"queueSize": 50,
"threadPoolSize": 16
}
}
}Error Handling
Common Processing Errors
| Error Type | Cause | Resolution |
|---|---|---|
UNSUPPORTED_FORMAT | File format not supported | Check supported format list |
CORRUPTED_FILE | Source file is damaged | Verify file integrity |
CONVERSION_FAILED | Format conversion error | Try alternative converter |
MEMORY_EXCEEDED | File too large for processing | Increase memory or split file |
PERMISSION_DENIED | Insufficient file permissions | Check file access rights |
Error Response Format
{
"success": false,
"error": {
"type": "CONVERSION_FAILED",
"message": "Failed to convert DOCX to PDF due to embedded object issues",
"inputFile": "/sandbox/documents/complex_report.docx",
"operation": "convert_to_pdf",
"details": {
"errorCode": "EMBEDDED_OBJECT_ERROR",
"affectedPages": [5, 12, 18],
"objectTypes": ["embedded_excel", "complex_chart"]
},
"suggestions": [
"Try converting without embedded objects",
"Use alternative output format",
"Split document and convert sections separately",
"Check embedded object compatibility"
]
}
}Usage Examples
Document Standardization Workflow
{
"batchProcessing": {
"inputDirectory": "/sandbox/mixed_documents",
"outputDirectory": "/sandbox/standardized",
"operations": [
{
"type": "convert",
"targetFormat": "pdf",
"options": {
"quality": "high",
"security": {"restrictCopying": false}
}
},
{
"type": "extract_metadata",
"outputFile": "/sandbox/metadata/{basename}_info.json"
}
],
"filtering": {
"includeFormats": [".doc", ".docx", ".rtf", ".odt"]
}
}
}Report Assembly Pipeline
{
"workflow": [
{
"step": "merge_sections",
"inputFiles": [
"/sandbox/sections/executive_summary.docx",
"/sandbox/sections/financial_analysis.docx",
"/sandbox/sections/recommendations.docx"
],
"outputFile": "/sandbox/temp/merged_report.docx"
},
{
"step": "convert_to_pdf",
"inputFile": "/sandbox/temp/merged_report.docx",
"outputFile": "/sandbox/reports/final_report.pdf",
"options": {"includeBookmarks": true}
},
{
"step": "extract_summary",
"inputFile": "/sandbox/reports/final_report.pdf",
"pages": "1-3",
"outputFile": "/sandbox/summaries/executive_summary.pdf"
}
]
}Content Analysis Pipeline
{
"contentAnalysis": {
"inputFile": "/sandbox/documents/research_paper.pdf",
"extractions": [
{
"type": "text",
"outputFile": "/sandbox/analysis/full_text.txt",
"options": {"preserveStructure": true}
},
{
"type": "images",
"outputDirectory": "/sandbox/analysis/figures",
"options": {"minSize": "300x200"}
},
{
"type": "tables",
"outputDirectory": "/sandbox/analysis/data",
"format": "csv"
},
{
"type": "metadata",
"outputFile": "/sandbox/analysis/document_info.json"
}
]
}
}Integration Patterns
With File System Tools
Organize processed documents in structured directories with automatic cleanup and archiving.
With Data Analysis Tools
Extract and analyze content from documents for insights and pattern detection.
With DOCX Generation Tools
Convert documents to standardized formats before using as templates or merging.
Best Practices
File Management
- Maintain backup copies before processing
- Use descriptive naming conventions
- Organize output in logical directory structures
- Implement cleanup procedures for temporary files
Performance
- Process documents in appropriate batch sizes
- Monitor memory usage during large operations
- Use parallel processing for independent operations
- Cache frequently accessed conversion settings
Quality Assurance
- Validate input files before processing
- Compare output with source for accuracy
- Monitor conversion quality metrics
- Implement error recovery procedures
Security
- Validate file types before processing
- Scan for malicious content
- Respect document security settings
- Maintain audit logs for processed files
Getting Started
- Identify document processing requirements and workflows
- Test conversion and processing operations with sample files
- Configure batch processing settings and error handling
- Implement quality validation and verification procedures
- Set up automated workflows for routine operations
- Monitor performance and optimize settings