Back to Integrations
Action

Web Crawler

The Web Crawler node is an enterprise-grade data processing engine that effortlessly transforms unstructured web pages into clean JSON payloads. It eliminates the pain of writing fragile scraping scripts, allowing you to instantly monitor competitors or aggregate news. Unleash seamless automation by scraping both static HTML and dynamic Single Page Applications.

Web Crawler
Data Extraction / Action
⚠️

What can you do with Web Crawler?

Advanced CSS Selector Engine

Surgically target exact DOM elements—like product prices or article titles—using standard selectors. Effortlessly bypass complex HTML hierarchies.

Automated List Aggregation

Aggressively pull repeating data lists into distinct workflow items. Seamlessly integrate output arrays with your downstream Data Processing pipelines.

Dynamic JavaScript Rendering

Natively integrates with headless Chrome to fully execute React or Vue sites. Effortlessly scrape modern SPAs that lazily load data.

Detailed Usage & Configuration

The Web Crawler node empowers you to build robust Data Processing pipelines without deploying heavy infrastructure like Puppeteer. It requests a target URL, renders the DOM, and queries elements instantly. If you already possess raw markup, seamlessly utilize the HTML Node to parse it offline.

1. CSS Selectors Configuration

To effortlessly extract information, define your desired output property and its matching CSS Selector. The engine will instantly map the DOM elements into your workflow variables.

  • Use h1.article-title to extract text nodes.
  • Use .product-price to capture numeric pricing strings.
  • Use img.thumbnail with the Return Attribute src to capture image URLs.

2. Array & List Extraction

When scraping e-commerce categories or sitemaps, simply toggle the Return Array option. A selector like ul.products li a will aggressively unroll the list into distinct items. You can flawlessly pass this array into a Loop Node for iterative automation.

3. JavaScript Rendering (SPA Support)

Modern websites (like Shopee or Amazon) heavily rely on client-side frameworks (React/Vue) to lazily hydrate data. Standard requests will return empty skeletons.

By enabling "Render JavaScript", the Web Crawler seamlessly connects to a Headless Browser microservice to fully execute the page scripts before scraping.

4. Browserless Configuration (Self-Hosted)

To utilize the JS Rendering engine, you must explicitly link the node to a running Browserless Docker instance. Configure your backend server by adding the following variable to your .env file:

BROWSERLESS_API_URL=http://localhost:3001/content

If this environment variable is missing, the node will proactively throw an error to prevent silent data loss during your automation runs.

🛡️ Security & Anti-Bot: Sites protected by enterprise-grade Web Application Firewalls (like Cloudflare or DataDome) may block headless requests with a 403 Forbidden error. Consider utilizing Stealth Mode or premium proxy IPs.
💡 Performance Tip: For blazing-fast workflows, always leave JS Rendering disabled unless absolutely necessary. Static HTML extraction operates in milliseconds and consumes zero RAM overhead.

Frequently Asked Questions

  • Why does my scrape return empty JSON payloads? This occurs when the CSS Selector is incorrect or the target website is a dynamic SPA. Enable JS Rendering or use the Chrome DevTools to verify your selectors.
  • How to prevent Out of Memory when crawling massive pages? The node utilizes a Zero-Allocation streaming engine under the hood. Avoid selecting the entire body tag to prevent overwhelming the downstream JSON state.