Web Scraping
Extract structured data from websites with respect for robots.txt, rate limiting, and comprehensive content processing.
Extract structured data from websites using advanced scraping techniques with built-in compliance features, dynamic content handling, and comprehensive data validation.
Features
- CSS selector and XPath-based data extraction
- Dynamic content and JavaScript rendering support
- Robots.txt compliance and rate limiting
- Form interaction and navigation capabilities
- Content validation and sanitization
Connector Options
The node uses reusable connector configuration that applies to all scraping operations:
| Parameter | Type | Required | Description |
|---|---|---|---|
userAgent | TEXT | No | Custom user agent string for requests |
timeout | INT | No | Request timeout in milliseconds (default: 30000) |
retryAttempts | INT | No | Number of retry attempts for failed requests |
respectRobots | BOOLEAN | No | Check and respect robots.txt (default: true) |
rateLimitDelay | INT | No | Delay between requests in milliseconds (default: 1000) |
Methods
webScraping
Extract structured data from web pages using selectors and extraction rules.
| Parameter | Type | Required | Description |
|---|---|---|---|
url | TEXT | Yes | URL of the webpage to scrape |
extractors | Array | Yes | Data extraction rules and selectors |
options | Object | No | Scraping behavior and processing options |
dynamic | Object | No | Dynamic content handling configuration |
{
"url": "https://example.com/products",
"extractors": [
{
"name": "products",
"selector": ".product-card",
"fields": {
"title": "h3.product-title",
"price": ".price-value",
"description": ".product-description",
"image": "img @src",
"url": "a @href"
}
}
],
"options": {
"waitForLoad": true,
"respectRobots": true,
"validateContent": true
},
"dynamic": {
"enabled": true,
"waitFor": "networkidle",
"timeout": 10000
}
}Output:
success(Boolean) - Extraction success statusdata(Object) - Extracted data organized by extractor namesmetadata(Object) - Page metadata and extraction statisticserrors(Array) - Any errors encountered during extractionperformance(Object) - Timing and resource usage information
batchScraping
Scrape multiple URLs with shared configuration and processing rules.
| Parameter | Type | Required | Description |
|---|---|---|---|
urls | Array | Yes | List of URLs to scrape |
extractors | Array | Yes | Shared extraction rules for all URLs |
batchOptions | Object | No | Batch processing configuration |
parallelism | INT | No | Number of concurrent scraping operations |
{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"extractors": [
{
"name": "content",
"selector": "article",
"fields": {
"title": "h1",
"content": ".article-body",
"author": ".author-name"
}
}
],
"batchOptions": {
"delayBetweenRequests": 2000,
"continueOnError": true,
"saveIndividualResults": true
},
"parallelism": 3
}formSubmission
Interact with web forms to access gated content or perform searches.
| Parameter | Type | Required | Description |
|---|---|---|---|
url | TEXT | Yes | URL containing the form |
formSelector | TEXT | Yes | CSS selector for the target form |
formData | Object | Yes | Data to submit in the form |
submitAction | TEXT | No | How to submit: click_button, submit_form, enter_key |
waitAfterSubmit | INT | No | Time to wait after submission (ms) |
{
"url": "https://example.com/search",
"formSelector": "#search-form",
"formData": {
"query": "machine learning",
"category": "technology",
"date_range": "past_year"
},
"submitAction": "click_button",
"waitAfterSubmit": 3000
}Data Extraction
Selector Types
CSS Selectors:
- Element selection:
div.product,#main-content,article h2 - Attribute extraction:
img @src,a @href,meta[name="description"] @content - Text content:
.title,.description - Multiple elements:
.item(returns array)
XPath Expressions:
- Complex paths:
//div[@class="product"]//span[contains(@class, "price")] - Text nodes:
//h1/text(),//div[@id="content"]//text() - Following siblings:
//label[text()="Price"]/following-sibling::span - Attribute values:
//img/@src,//a/@href
Field Types and Processing
| Field Type | Description | Example |
|---|---|---|
text | Extract text content | "title": "h1" |
attribute | Get attribute value | "url": "a @href" |
html | Get HTML content | "content": ".article @html" |
number | Parse as number | "price": ".price @number" |
date | Parse as date | "published": ".date @date" |
array | Multiple elements | "tags": ".tag @array" |
Advanced Extraction Patterns
{
"extractors": [
{
"name": "product_details",
"selector": ".product",
"fields": {
"basic_info": {
"title": "h2.title",
"price": ".price @number",
"availability": ".stock-status"
},
"specifications": {
"selector": ".specs-table tr",
"fields": {
"property": "td:first-child",
"value": "td:last-child"
}
},
"reviews": {
"selector": ".review",
"fields": {
"rating": ".rating @attribute:data-rating @number",
"comment": ".review-text",
"author": ".reviewer-name",
"date": ".review-date @date"
}
}
}
}
]
}Dynamic Content Handling
JavaScript Rendering
Configure how the scraper handles JavaScript-rendered content:
{
"dynamic": {
"enabled": true,
"engine": "chromium",
"waitFor": "networkidle",
"timeout": 15000,
"viewportSize": {
"width": 1920,
"height": 1080
},
"actions": [
{
"type": "click",
"selector": "#load-more-button"
},
{
"type": "scroll",
"direction": "down",
"distance": "50vh"
},
{
"type": "wait",
"duration": 2000
}
]
}
}Wait Conditions
| Wait Type | Description | Usage |
|---|---|---|
networkidle | Wait for network requests to finish | Dynamic content loading |
selector | Wait for specific element to appear | "waitFor": {"type": "selector", "value": ".data-loaded"} |
function | Wait for JavaScript condition | "waitFor": {"type": "function", "value": "() => window.dataReady"} |
timeout | Fixed time delay | "waitFor": {"type": "timeout", "value": 5000} |
Interaction Capabilities
{
"interactions": [
{
"type": "click",
"selector": ".cookie-accept-button",
"optional": true
},
{
"type": "scroll",
"direction": "bottom",
"smooth": true
},
{
"type": "type",
"selector": "#search-input",
"text": "search query"
},
{
"type": "select",
"selector": "#dropdown",
"value": "option-value"
},
{
"type": "wait",
"condition": "networkidle"
}
]
}Compliance and Ethics
Robots.txt Compliance
Automatic robots.txt checking and compliance:
{
"compliance": {
"checkRobotsTxt": true,
"respectCrawlDelay": true,
"userAgent": "Axellero Web Scraper 1.0",
"contactInfo": "admin@example.com",
"honorNoIndex": true,
"respectMetaRobots": true
}
}Rate Limiting
{
"rateLimiting": {
"requestsPerMinute": 30,
"delayBetweenRequests": 2000,
"randomizeDelay": true,
"respectRetryAfter": true,
"maxConcurrentRequests": 3,
"backoffStrategy": "exponential"
}
}Content Filtering
{
"contentFiltering": {
"blockPersonalData": true,
"skipAdultContent": true,
"copyrightDetection": true,
"malwareCheck": true,
"maxContentSize": "10MB",
"allowedContentTypes": [
"text/html",
"application/xhtml+xml"
]
}
}Error Handling
Retry Logic
{
"errorHandling": {
"retryAttempts": 3,
"retryDelay": 1000,
"retryOn": [
"timeout",
"network_error",
"rate_limit"
],
"backoffMultiplier": 2,
"maxRetryDelay": 30000,
"continueOnError": false
}
}Error Response Format
{
"success": false,
"error": {
"type": "EXTRACTION_FAILED",
"message": "No elements found matching selector '.product-card'",
"url": "https://example.com/products",
"selector": ".product-card",
"suggestions": [
"Check if the page structure has changed",
"Verify the CSS selector is correct",
"Enable dynamic content handling",
"Check if the page requires user interaction"
]
}
}Performance Optimization
Caching Strategy
{
"caching": {
"enabled": true,
"ttl": 3600,
"cacheKey": ["url", "extractors"],
"storage": "memory",
"compression": true,
"invalidateOn": ["page_change", "content_update"]
}
}Resource Management
{
"resources": {
"maxMemoryUsage": "512MB",
"maxExecutionTime": 60000,
"downloadTimeout": 30000,
"maxRedirects": 5,
"resourceTypes": {
"block": ["image", "font", "media"],
"allow": ["document", "stylesheet", "script", "xhr"]
}
}
}Usage Examples
E-commerce Product Scraping
{
"url": "https://shop.example.com/products",
"extractors": [
{
"name": "products",
"selector": ".product-item",
"fields": {
"name": ".product-name",
"price": ".price-current @number",
"original_price": ".price-original @number",
"discount": ".discount-percent",
"image": ".product-image img @src",
"rating": ".rating @attribute:data-rating @number",
"reviews_count": ".reviews-count @number",
"availability": ".stock-status",
"product_url": "a @href"
}
}
],
"dynamic": {
"enabled": true,
"actions": [
{
"type": "scroll",
"direction": "bottom"
}
]
}
}News Article Collection
{
"url": "https://news.example.com/technology",
"extractors": [
{
"name": "articles",
"selector": "article",
"fields": {
"headline": "h2.headline a",
"summary": ".article-summary",
"author": ".byline .author-name",
"publish_date": ".publish-date @date",
"category": ".category-tag",
"article_url": "h2.headline a @href",
"image": ".article-image img @src",
"read_time": ".read-time @number"
}
},
{
"name": "pagination",
"selector": ".pagination",
"fields": {
"current_page": ".current-page @number",
"total_pages": ".total-pages @number",
"next_url": ".next-page @href"
}
}
]
}Table Data Extraction
{
"url": "https://data.example.com/financial-reports",
"extractors": [
{
"name": "financial_data",
"selector": "table.data-table tbody tr",
"fields": {
"company": "td:nth-child(1)",
"revenue": "td:nth-child(2) @number",
"profit": "td:nth-child(3) @number",
"growth_rate": "td:nth-child(4) @number",
"market_cap": "td:nth-child(5) @number",
"report_date": "td:nth-child(6) @date"
}
}
],
"options": {
"validateContent": true,
"skipEmptyRows": true
}
}Integration Patterns
With Web Search Tools
Use search results as input URLs for targeted content extraction workflows.
With File System Tools
Save scraped data to structured files for offline processing and analysis.
With Data Analysis Tools
Process extracted data for insights, patterns, and statistical analysis.
Best Practices
Selector Strategy
- Use specific, stable CSS selectors
- Avoid brittle selectors that depend on styling
- Test selectors across different page states
- Implement fallback selectors for critical data
Ethical Scraping
- Always check and respect robots.txt
- Implement appropriate delays between requests
- Use descriptive user agent strings
- Monitor and limit resource usage
Data Quality
- Validate extracted data formats
- Handle missing or malformed data gracefully
- Implement data cleaning and normalization
- Monitor extraction success rates
Performance
- Cache frequently accessed pages
- Use batch operations for multiple URLs
- Optimize selector complexity
- Monitor memory and CPU usage
Getting Started
- Identify target websites and data requirements
- Analyze page structure and create extraction rules
- Configure compliance and rate limiting settings
- Test extraction with sample pages
- Implement error handling and validation
- Monitor performance and adjust configuration