Web Scraping

Extract structured data from websites with respect for robots.txt, rate limiting, and comprehensive content processing.

Extract structured data from websites using advanced scraping techniques with built-in compliance features, dynamic content handling, and comprehensive data validation.

Features

CSS selector and XPath-based data extraction
Dynamic content and JavaScript rendering support
Robots.txt compliance and rate limiting
Form interaction and navigation capabilities
Content validation and sanitization

Connector Options

The node uses reusable connector configuration that applies to all scraping operations:

Parameter	Type	Required	Description
`userAgent`	TEXT	No	Custom user agent string for requests
`timeout`	INT	No	Request timeout in milliseconds (default: 30000)
`retryAttempts`	INT	No	Number of retry attempts for failed requests
`respectRobots`	BOOLEAN	No	Check and respect robots.txt (default: true)
`rateLimitDelay`	INT	No	Delay between requests in milliseconds (default: 1000)

Methods

webScraping

Extract structured data from web pages using selectors and extraction rules.

Parameter	Type	Required	Description
`url`	TEXT	Yes	URL of the webpage to scrape
`extractors`	Array	Yes	Data extraction rules and selectors
`options`	Object	No	Scraping behavior and processing options
`dynamic`	Object	No	Dynamic content handling configuration

{
  "url": "https://example.com/products",
  "extractors": [
    {
      "name": "products",
      "selector": ".product-card",
      "fields": {
        "title": "h3.product-title",
        "price": ".price-value",
        "description": ".product-description",
        "image": "img @src",
        "url": "a @href"
      }
    }
  ],
  "options": {
    "waitForLoad": true,
    "respectRobots": true,
    "validateContent": true
  },
  "dynamic": {
    "enabled": true,
    "waitFor": "networkidle",
    "timeout": 10000
  }
}

Output:

success (Boolean) - Extraction success status
data (Object) - Extracted data organized by extractor names
metadata (Object) - Page metadata and extraction statistics
errors (Array) - Any errors encountered during extraction
performance (Object) - Timing and resource usage information

batchScraping

Scrape multiple URLs with shared configuration and processing rules.

Parameter	Type	Required	Description
`urls`	Array	Yes	List of URLs to scrape
`extractors`	Array	Yes	Shared extraction rules for all URLs
`batchOptions`	Object	No	Batch processing configuration
`parallelism`	INT	No	Number of concurrent scraping operations

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "extractors": [
    {
      "name": "content",
      "selector": "article",
      "fields": {
        "title": "h1",
        "content": ".article-body",
        "author": ".author-name"
      }
    }
  ],
  "batchOptions": {
    "delayBetweenRequests": 2000,
    "continueOnError": true,
    "saveIndividualResults": true
  },
  "parallelism": 3
}

formSubmission

Interact with web forms to access gated content or perform searches.

Parameter	Type	Required	Description
`url`	TEXT	Yes	URL containing the form
`formSelector`	TEXT	Yes	CSS selector for the target form
`formData`	Object	Yes	Data to submit in the form
`submitAction`	TEXT	No	How to submit: click_button, submit_form, enter_key
`waitAfterSubmit`	INT	No	Time to wait after submission (ms)

{
  "url": "https://example.com/search",
  "formSelector": "#search-form",
  "formData": {
    "query": "machine learning",
    "category": "technology",
    "date_range": "past_year"
  },
  "submitAction": "click_button",
  "waitAfterSubmit": 3000
}

Data Extraction

Selector Types

CSS Selectors:

Element selection: div.product, #main-content, article h2
Attribute extraction: img @src, a @href, meta[name="description"] @content
Text content: .title, .description
Multiple elements: .item (returns array)

XPath Expressions:

Complex paths: //div[@class="product"]//span[contains(@class, "price")]
Text nodes: //h1/text(), //div[@id="content"]//text()
Following siblings: //label[text()="Price"]/following-sibling::span
Attribute values: //img/@src, //a/@href

Field Types and Processing

Field Type	Description	Example
`text`	Extract text content	`"title": "h1"`
`attribute`	Get attribute value	`"url": "a @href"`
`html`	Get HTML content	`"content": ".article @html"`
`number`	Parse as number	`"price": ".price @number"`
`date`	Parse as date	`"published": ".date @date"`
`array`	Multiple elements	`"tags": ".tag @array"`

Advanced Extraction Patterns

{
  "extractors": [
    {
      "name": "product_details",
      "selector": ".product",
      "fields": {
        "basic_info": {
          "title": "h2.title",
          "price": ".price @number",
          "availability": ".stock-status"
        },
        "specifications": {
          "selector": ".specs-table tr",
          "fields": {
            "property": "td:first-child",
            "value": "td:last-child"
          }
        },
        "reviews": {
          "selector": ".review",
          "fields": {
            "rating": ".rating @attribute:data-rating @number",
            "comment": ".review-text",
            "author": ".reviewer-name",
            "date": ".review-date @date"
          }
        }
      }
    }
  ]
}

Dynamic Content Handling

JavaScript Rendering

Configure how the scraper handles JavaScript-rendered content:

{
  "dynamic": {
    "enabled": true,
    "engine": "chromium",
    "waitFor": "networkidle",
    "timeout": 15000,
    "viewportSize": {
      "width": 1920,
      "height": 1080
    },
    "actions": [
      {
        "type": "click",
        "selector": "#load-more-button"
      },
      {
        "type": "scroll",
        "direction": "down",
        "distance": "50vh"
      },
      {
        "type": "wait",
        "duration": 2000
      }
    ]
  }
}

Wait Conditions

Wait Type	Description	Usage
`networkidle`	Wait for network requests to finish	Dynamic content loading
`selector`	Wait for specific element to appear	`"waitFor": {"type": "selector", "value": ".data-loaded"}`
`function`	Wait for JavaScript condition	`"waitFor": {"type": "function", "value": "() => window.dataReady"}`
`timeout`	Fixed time delay	`"waitFor": {"type": "timeout", "value": 5000}`

Interaction Capabilities

{
  "interactions": [
    {
      "type": "click",
      "selector": ".cookie-accept-button",
      "optional": true
    },
    {
      "type": "scroll",
      "direction": "bottom",
      "smooth": true
    },
    {
      "type": "type",
      "selector": "#search-input",
      "text": "search query"
    },
    {
      "type": "select",
      "selector": "#dropdown",
      "value": "option-value"
    },
    {
      "type": "wait",
      "condition": "networkidle"
    }
  ]
}

Compliance and Ethics

Robots.txt Compliance

Automatic robots.txt checking and compliance:

{
  "compliance": {
    "checkRobotsTxt": true,
    "respectCrawlDelay": true,
    "userAgent": "Axellero Web Scraper 1.0",
    "contactInfo": "admin@example.com",
    "honorNoIndex": true,
    "respectMetaRobots": true
  }
}

Rate Limiting

{
  "rateLimiting": {
    "requestsPerMinute": 30,
    "delayBetweenRequests": 2000,
    "randomizeDelay": true,
    "respectRetryAfter": true,
    "maxConcurrentRequests": 3,
    "backoffStrategy": "exponential"
  }
}

Content Filtering

{
  "contentFiltering": {
    "blockPersonalData": true,
    "skipAdultContent": true,
    "copyrightDetection": true,
    "malwareCheck": true,
    "maxContentSize": "10MB",
    "allowedContentTypes": [
      "text/html",
      "application/xhtml+xml"
    ]
  }
}

Error Handling

Retry Logic

{
  "errorHandling": {
    "retryAttempts": 3,
    "retryDelay": 1000,
    "retryOn": [
      "timeout",
      "network_error", 
      "rate_limit"
    ],
    "backoffMultiplier": 2,
    "maxRetryDelay": 30000,
    "continueOnError": false
  }
}

Error Response Format

{
  "success": false,
  "error": {
    "type": "EXTRACTION_FAILED",
    "message": "No elements found matching selector '.product-card'",
    "url": "https://example.com/products",
    "selector": ".product-card",
    "suggestions": [
      "Check if the page structure has changed",
      "Verify the CSS selector is correct",
      "Enable dynamic content handling",
      "Check if the page requires user interaction"
    ]
  }
}

Performance Optimization

Caching Strategy

{
  "caching": {
    "enabled": true,
    "ttl": 3600,
    "cacheKey": ["url", "extractors"],
    "storage": "memory",
    "compression": true,
    "invalidateOn": ["page_change", "content_update"]
  }
}

Resource Management

{
  "resources": {
    "maxMemoryUsage": "512MB",
    "maxExecutionTime": 60000,
    "downloadTimeout": 30000,
    "maxRedirects": 5,
    "resourceTypes": {
      "block": ["image", "font", "media"],
      "allow": ["document", "stylesheet", "script", "xhr"]
    }
  }
}

Usage Examples

E-commerce Product Scraping

{
  "url": "https://shop.example.com/products",
  "extractors": [
    {
      "name": "products",
      "selector": ".product-item",
      "fields": {
        "name": ".product-name",
        "price": ".price-current @number",
        "original_price": ".price-original @number",
        "discount": ".discount-percent",
        "image": ".product-image img @src",
        "rating": ".rating @attribute:data-rating @number",
        "reviews_count": ".reviews-count @number",
        "availability": ".stock-status",
        "product_url": "a @href"
      }
    }
  ],
  "dynamic": {
    "enabled": true,
    "actions": [
      {
        "type": "scroll",
        "direction": "bottom"
      }
    ]
  }
}

News Article Collection

{
  "url": "https://news.example.com/technology",
  "extractors": [
    {
      "name": "articles",
      "selector": "article",
      "fields": {
        "headline": "h2.headline a",
        "summary": ".article-summary",
        "author": ".byline .author-name",
        "publish_date": ".publish-date @date",
        "category": ".category-tag",
        "article_url": "h2.headline a @href",
        "image": ".article-image img @src",
        "read_time": ".read-time @number"
      }
    },
    {
      "name": "pagination",
      "selector": ".pagination",
      "fields": {
        "current_page": ".current-page @number",
        "total_pages": ".total-pages @number",
        "next_url": ".next-page @href"
      }
    }
  ]
}

Table Data Extraction

{
  "url": "https://data.example.com/financial-reports",
  "extractors": [
    {
      "name": "financial_data",
      "selector": "table.data-table tbody tr",
      "fields": {
        "company": "td:nth-child(1)",
        "revenue": "td:nth-child(2) @number",
        "profit": "td:nth-child(3) @number", 
        "growth_rate": "td:nth-child(4) @number",
        "market_cap": "td:nth-child(5) @number",
        "report_date": "td:nth-child(6) @date"
      }
    }
  ],
  "options": {
    "validateContent": true,
    "skipEmptyRows": true
  }
}