logo_smallAxellero.io

Web Scraping

Extract structured data from websites with respect for robots.txt, rate limiting, and comprehensive content processing.

Extract structured data from websites using advanced scraping techniques with built-in compliance features, dynamic content handling, and comprehensive data validation.

Features

  • CSS selector and XPath-based data extraction
  • Dynamic content and JavaScript rendering support
  • Robots.txt compliance and rate limiting
  • Form interaction and navigation capabilities
  • Content validation and sanitization

Connector Options

The node uses reusable connector configuration that applies to all scraping operations:

ParameterTypeRequiredDescription
userAgentTEXTNoCustom user agent string for requests
timeoutINTNoRequest timeout in milliseconds (default: 30000)
retryAttemptsINTNoNumber of retry attempts for failed requests
respectRobotsBOOLEANNoCheck and respect robots.txt (default: true)
rateLimitDelayINTNoDelay between requests in milliseconds (default: 1000)

Methods

webScraping

Extract structured data from web pages using selectors and extraction rules.

ParameterTypeRequiredDescription
urlTEXTYesURL of the webpage to scrape
extractorsArrayYesData extraction rules and selectors
optionsObjectNoScraping behavior and processing options
dynamicObjectNoDynamic content handling configuration
{
  "url": "https://example.com/products",
  "extractors": [
    {
      "name": "products",
      "selector": ".product-card",
      "fields": {
        "title": "h3.product-title",
        "price": ".price-value",
        "description": ".product-description",
        "image": "img @src",
        "url": "a @href"
      }
    }
  ],
  "options": {
    "waitForLoad": true,
    "respectRobots": true,
    "validateContent": true
  },
  "dynamic": {
    "enabled": true,
    "waitFor": "networkidle",
    "timeout": 10000
  }
}

Output:

  • success (Boolean) - Extraction success status
  • data (Object) - Extracted data organized by extractor names
  • metadata (Object) - Page metadata and extraction statistics
  • errors (Array) - Any errors encountered during extraction
  • performance (Object) - Timing and resource usage information

batchScraping

Scrape multiple URLs with shared configuration and processing rules.

ParameterTypeRequiredDescription
urlsArrayYesList of URLs to scrape
extractorsArrayYesShared extraction rules for all URLs
batchOptionsObjectNoBatch processing configuration
parallelismINTNoNumber of concurrent scraping operations
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "extractors": [
    {
      "name": "content",
      "selector": "article",
      "fields": {
        "title": "h1",
        "content": ".article-body",
        "author": ".author-name"
      }
    }
  ],
  "batchOptions": {
    "delayBetweenRequests": 2000,
    "continueOnError": true,
    "saveIndividualResults": true
  },
  "parallelism": 3
}

formSubmission

Interact with web forms to access gated content or perform searches.

ParameterTypeRequiredDescription
urlTEXTYesURL containing the form
formSelectorTEXTYesCSS selector for the target form
formDataObjectYesData to submit in the form
submitActionTEXTNoHow to submit: click_button, submit_form, enter_key
waitAfterSubmitINTNoTime to wait after submission (ms)
{
  "url": "https://example.com/search",
  "formSelector": "#search-form",
  "formData": {
    "query": "machine learning",
    "category": "technology",
    "date_range": "past_year"
  },
  "submitAction": "click_button",
  "waitAfterSubmit": 3000
}

Data Extraction

Selector Types

CSS Selectors:

  • Element selection: div.product, #main-content, article h2
  • Attribute extraction: img @src, a @href, meta[name="description"] @content
  • Text content: .title, .description
  • Multiple elements: .item (returns array)

XPath Expressions:

  • Complex paths: //div[@class="product"]//span[contains(@class, "price")]
  • Text nodes: //h1/text(), //div[@id="content"]//text()
  • Following siblings: //label[text()="Price"]/following-sibling::span
  • Attribute values: //img/@src, //a/@href

Field Types and Processing

Field TypeDescriptionExample
textExtract text content"title": "h1"
attributeGet attribute value"url": "a @href"
htmlGet HTML content"content": ".article @html"
numberParse as number"price": ".price @number"
dateParse as date"published": ".date @date"
arrayMultiple elements"tags": ".tag @array"

Advanced Extraction Patterns

{
  "extractors": [
    {
      "name": "product_details",
      "selector": ".product",
      "fields": {
        "basic_info": {
          "title": "h2.title",
          "price": ".price @number",
          "availability": ".stock-status"
        },
        "specifications": {
          "selector": ".specs-table tr",
          "fields": {
            "property": "td:first-child",
            "value": "td:last-child"
          }
        },
        "reviews": {
          "selector": ".review",
          "fields": {
            "rating": ".rating @attribute:data-rating @number",
            "comment": ".review-text",
            "author": ".reviewer-name",
            "date": ".review-date @date"
          }
        }
      }
    }
  ]
}

Dynamic Content Handling

JavaScript Rendering

Configure how the scraper handles JavaScript-rendered content:

{
  "dynamic": {
    "enabled": true,
    "engine": "chromium",
    "waitFor": "networkidle",
    "timeout": 15000,
    "viewportSize": {
      "width": 1920,
      "height": 1080
    },
    "actions": [
      {
        "type": "click",
        "selector": "#load-more-button"
      },
      {
        "type": "scroll",
        "direction": "down",
        "distance": "50vh"
      },
      {
        "type": "wait",
        "duration": 2000
      }
    ]
  }
}

Wait Conditions

Wait TypeDescriptionUsage
networkidleWait for network requests to finishDynamic content loading
selectorWait for specific element to appear"waitFor": {"type": "selector", "value": ".data-loaded"}
functionWait for JavaScript condition"waitFor": {"type": "function", "value": "() => window.dataReady"}
timeoutFixed time delay"waitFor": {"type": "timeout", "value": 5000}

Interaction Capabilities

{
  "interactions": [
    {
      "type": "click",
      "selector": ".cookie-accept-button",
      "optional": true
    },
    {
      "type": "scroll",
      "direction": "bottom",
      "smooth": true
    },
    {
      "type": "type",
      "selector": "#search-input",
      "text": "search query"
    },
    {
      "type": "select",
      "selector": "#dropdown",
      "value": "option-value"
    },
    {
      "type": "wait",
      "condition": "networkidle"
    }
  ]
}

Compliance and Ethics

Robots.txt Compliance

Automatic robots.txt checking and compliance:

{
  "compliance": {
    "checkRobotsTxt": true,
    "respectCrawlDelay": true,
    "userAgent": "Axellero Web Scraper 1.0",
    "contactInfo": "admin@example.com",
    "honorNoIndex": true,
    "respectMetaRobots": true
  }
}

Rate Limiting

{
  "rateLimiting": {
    "requestsPerMinute": 30,
    "delayBetweenRequests": 2000,
    "randomizeDelay": true,
    "respectRetryAfter": true,
    "maxConcurrentRequests": 3,
    "backoffStrategy": "exponential"
  }
}

Content Filtering

{
  "contentFiltering": {
    "blockPersonalData": true,
    "skipAdultContent": true,
    "copyrightDetection": true,
    "malwareCheck": true,
    "maxContentSize": "10MB",
    "allowedContentTypes": [
      "text/html",
      "application/xhtml+xml"
    ]
  }
}

Error Handling

Retry Logic

{
  "errorHandling": {
    "retryAttempts": 3,
    "retryDelay": 1000,
    "retryOn": [
      "timeout",
      "network_error", 
      "rate_limit"
    ],
    "backoffMultiplier": 2,
    "maxRetryDelay": 30000,
    "continueOnError": false
  }
}

Error Response Format

{
  "success": false,
  "error": {
    "type": "EXTRACTION_FAILED",
    "message": "No elements found matching selector '.product-card'",
    "url": "https://example.com/products",
    "selector": ".product-card",
    "suggestions": [
      "Check if the page structure has changed",
      "Verify the CSS selector is correct",
      "Enable dynamic content handling",
      "Check if the page requires user interaction"
    ]
  }
}

Performance Optimization

Caching Strategy

{
  "caching": {
    "enabled": true,
    "ttl": 3600,
    "cacheKey": ["url", "extractors"],
    "storage": "memory",
    "compression": true,
    "invalidateOn": ["page_change", "content_update"]
  }
}

Resource Management

{
  "resources": {
    "maxMemoryUsage": "512MB",
    "maxExecutionTime": 60000,
    "downloadTimeout": 30000,
    "maxRedirects": 5,
    "resourceTypes": {
      "block": ["image", "font", "media"],
      "allow": ["document", "stylesheet", "script", "xhr"]
    }
  }
}

Usage Examples

E-commerce Product Scraping

{
  "url": "https://shop.example.com/products",
  "extractors": [
    {
      "name": "products",
      "selector": ".product-item",
      "fields": {
        "name": ".product-name",
        "price": ".price-current @number",
        "original_price": ".price-original @number",
        "discount": ".discount-percent",
        "image": ".product-image img @src",
        "rating": ".rating @attribute:data-rating @number",
        "reviews_count": ".reviews-count @number",
        "availability": ".stock-status",
        "product_url": "a @href"
      }
    }
  ],
  "dynamic": {
    "enabled": true,
    "actions": [
      {
        "type": "scroll",
        "direction": "bottom"
      }
    ]
  }
}

News Article Collection

{
  "url": "https://news.example.com/technology",
  "extractors": [
    {
      "name": "articles",
      "selector": "article",
      "fields": {
        "headline": "h2.headline a",
        "summary": ".article-summary",
        "author": ".byline .author-name",
        "publish_date": ".publish-date @date",
        "category": ".category-tag",
        "article_url": "h2.headline a @href",
        "image": ".article-image img @src",
        "read_time": ".read-time @number"
      }
    },
    {
      "name": "pagination",
      "selector": ".pagination",
      "fields": {
        "current_page": ".current-page @number",
        "total_pages": ".total-pages @number",
        "next_url": ".next-page @href"
      }
    }
  ]
}

Table Data Extraction

{
  "url": "https://data.example.com/financial-reports",
  "extractors": [
    {
      "name": "financial_data",
      "selector": "table.data-table tbody tr",
      "fields": {
        "company": "td:nth-child(1)",
        "revenue": "td:nth-child(2) @number",
        "profit": "td:nth-child(3) @number", 
        "growth_rate": "td:nth-child(4) @number",
        "market_cap": "td:nth-child(5) @number",
        "report_date": "td:nth-child(6) @date"
      }
    }
  ],
  "options": {
    "validateContent": true,
    "skipEmptyRows": true
  }
}

Integration Patterns

With Web Search Tools

Use search results as input URLs for targeted content extraction workflows.

With File System Tools

Save scraped data to structured files for offline processing and analysis.

With Data Analysis Tools

Process extracted data for insights, patterns, and statistical analysis.

Best Practices

Selector Strategy

  • Use specific, stable CSS selectors
  • Avoid brittle selectors that depend on styling
  • Test selectors across different page states
  • Implement fallback selectors for critical data

Ethical Scraping

  • Always check and respect robots.txt
  • Implement appropriate delays between requests
  • Use descriptive user agent strings
  • Monitor and limit resource usage

Data Quality

  • Validate extracted data formats
  • Handle missing or malformed data gracefully
  • Implement data cleaning and normalization
  • Monitor extraction success rates

Performance

  • Cache frequently accessed pages
  • Use batch operations for multiple URLs
  • Optimize selector complexity
  • Monitor memory and CPU usage

Getting Started

  1. Identify target websites and data requirements
  2. Analyze page structure and create extraction rules
  3. Configure compliance and rate limiting settings
  4. Test extraction with sample pages
  5. Implement error handling and validation
  6. Monitor performance and adjust configuration

Resources