Back to Integrations
Action

Web Crawler

The Web Crawler node is a powerful data extraction tool that visits URLs, evaluates the DOM, and extracts precise information using CSS selectors. Turn unstructured web pages into structured JSON data effortlessly for monitoring competitors, aggregating news, or scraping product prices.

Web Crawler
Data Extraction / Action
⚠️

What can you do with Web Crawler?

CSS Selector Engine

Surgically target exact elements on a complex page (like a product price tag, an article title, or an image source) using familiar, standard CSS selectors.

Automated List Extraction

Pull robust, repeating lists of data (like all hyper-links on a sitemap, or all products in a retail category) and output them flawlessly as individual workflow items.

Headless DOM Browsing

Highly capable of parsing structured modern web pages and returning the clean, parsed text or HTML attributes directly to your subsequent processing nodes.

Detailed Usage & Configuration

The Web Crawler node provides targeted data extraction without the overhead of massive scripting tools like Puppeteer. It requests a URL, renders the HTML DOM, and lets you query elements exactly like jQuery.

1. Configuring Selectors

Once you input a target URL, define an output property name and its corresponding CSS Selector:

  • h1.article-title extracts the main header text.
  • .product-price extracts the numeric pricing string.
  • img.main-image paired with the Return Attribute src extracts the image URL instead of its text.

2. Returning Arrays & Lists

By toggling "Return Array", a single selector like ul.nav-menu li a will output a dense Array of all 20 link names found, rather than just the first one. Combine this natively with the Loop Node to systematically iterate through all the links extracted.

3. Real-World Limitations

This crawler retrieves static HTML markup returned by the server upon the initial request. It does not execute client-side JavaScript. If the target website heavily relies on React/Vue to lazily render data after the page loads, the crawler won't "see" that data. Always verify the raw page source first.