How does the 5-layer web scraping pipeline work for AI lead generation?

The hierarchy tries layers in order: Static HTTP (regex + schema.org), JS Data (__NEXT_DATA__/React props), Structural (page classification + priority crawling), Semantic (4-method contact extraction), and Browser (Playwright stealth). Each layer only escalates when confidence is too low, minimizing expensive browser renders for B2B sales intelligence.

Why does the web scraping pipeline use a hierarchy instead of Playwright for everything?

Playwright browser rendering is 10-50x slower and more resource-intensive than HTTP requests. The 5-layer hierarchy extracts data from 60-70% of company websites using fast static methods. Only JavaScript-heavy SPAs and click-to-reveal content require the browser layer, keeping the lead enrichment platform fast and cost-effective.

What are the per-company budget limits in the web scraping pipeline?

Each company is budgeted: max 15 pages crawled, max 3 Playwright browser renders, and max 120 seconds total runtime. The Structural layer uses intelligent page selection (team > about > contact) to maximize contact yield within these limits for AI lead generation.

Can the web scraping pipeline handle modern JavaScript frameworks like React and Next.js?

Yes. Layer 2 (JS Data) extracts __NEXT_DATA__ from Next.js, data-react-props from React, inline JSON payloads, and discovers internal API endpoints -- all without launching a browser. Layer 5 (Browser) handles SPAs that require full rendering for the B2B sales intelligence pipeline.

Platform

5 Layers

Scraping Hierarchy

Each layer is tried in order. The system only escalates to the next (more expensive) layer when the current one can't achieve the required confidence threshold. This minimizes browser renders and maximizes throughput.

Static Layer

HTTP + Regex + Schema.org

Fast HTTP requests with BeautifulSoup parsing. Extracts emails, phone numbers, and structured data (JSON-LD, schema.org microdata) from raw HTML. Handles 60-70% of company websites that serve static content.

Techniques

httpx/requests with anti-detection headers
Regex patterns for email and phone extraction
JSON-LD and schema.org structured data parsing
Meta tag extraction (OG, Twitter cards)
Sitemap.xml parsing for page discovery

Escalation Rule

Escalates if confidence < threshold or page returns empty/minimal content

JS Data Layer

__NEXT_DATA__, React Props, API Endpoints

Extracts data from JavaScript-rendered frameworks without launching a browser. Parses __NEXT_DATA__ (Next.js), data-react-props attributes, inline JSON payloads, and discovers internal API endpoints that serve structured data.

Techniques

__NEXT_DATA__ JSON extraction (Next.js SSR)
data-react-props attribute parsing (React)
Inline script JSON payload detection
XHR/fetch API endpoint discovery
GraphQL introspection endpoint probing

Escalation Rule

Escalates if no JS data structures found or extracted data is incomplete

Structural Layer

Page Classification & Priority Crawling

Classifies discovered pages by type (/team, /about, /contact, /leadership, /staff) and crawls them in priority order. Budget-aware: max 15 pages per company with intelligent page selection.

Techniques

URL pattern classification (team, about, contact, careers)
Internal link graph analysis
Sitemap-guided page discovery
Priority scoring (team pages > about > contact > other)
Cross-page contact accumulation

Escalation Rule

Escalates if classified pages don't contain extractable contact data

Semantic Layer

4-Method Contact Extraction

Deep content analysis using 4 cascading extraction methods. Combines structured data parsing, visual layout analysis, proximity heuristics, and social profile matching to find decision makers.

Techniques

JSON-LD person/organization extraction
Team card CSS pattern detection (photo + name + title)
Heuristic proximity analysis (name near email/phone within DOM distance)
LinkedIn profile URL extraction and matching
Company general email separated from personal contacts

Escalation Rule

Escalates if semantic methods find < 2 contacts with confidence > 0.5

Browser Layer

Playwright Stealth Rendering

Full Playwright browser rendering for JavaScript-heavy SPAs that resist all other methods. Budget-capped at 3 browser renders per company to control costs. Handles click-to-reveal content, infinite scroll, and modal-based team directories.

Techniques

Playwright stealth mode with anti-detection
Click-to-reveal email/phone interaction
Infinite scroll handling for team directories
Modal and accordion content expansion
Screenshot-based fallback for heavily protected sites

Escalation Rule

Final layer -- if browser rendering fails, the company is marked for manual review

Max pages per company

Max browser renders

120s

Max runtime per company

FAQ

Frequently Asked Questions

Everything you need to know about our platform.

Still have questions?

Our team can walk you through the pipeline, pricing, and your use case.

Talk to us

Related Pipeline Pages

12-Stage Enrichment Pipeline

Platform

Full lead enrichment platform architecture

/platform/enrichment

7 Intelligence Modules

Platform

OSINT modules powering B2B sales intelligence

/platform/intelligence

Contact Intelligence Engine

Platform

Actor engine using the web scraping pipeline

/platform/actors/contact-intelligence-engine

Data Enrichment

Solution

Lead enrichment platform for CRM data

/solutions/data-enrichment

LeadsLogix vs Clay

Compare

Web scraping pipeline vs manual waterfall enrichment

/compare/vs-clay

All Features

Features

Full AI lead generation platform capabilities

/features

Platform

5 Layers

Scraping Hierarchy

Static Layer

HTTP + Regex + Schema.org

Techniques

httpx/requests with anti-detection headers
Regex patterns for email and phone extraction
JSON-LD and schema.org structured data parsing
Meta tag extraction (OG, Twitter cards)
Sitemap.xml parsing for page discovery

Escalation Rule

Escalates if confidence < threshold or page returns empty/minimal content

JS Data Layer

__NEXT_DATA__, React Props, API Endpoints

Techniques

__NEXT_DATA__ JSON extraction (Next.js SSR)
data-react-props attribute parsing (React)
Inline script JSON payload detection
XHR/fetch API endpoint discovery
GraphQL introspection endpoint probing

Escalation Rule

Escalates if no JS data structures found or extracted data is incomplete

Structural Layer

Page Classification & Priority Crawling

Classifies discovered pages by type (/team, /about, /contact, /leadership, /staff) and crawls them in priority order. Budget-aware: max 15 pages per company with intelligent page selection.

Techniques

URL pattern classification (team, about, contact, careers)
Internal link graph analysis
Sitemap-guided page discovery
Priority scoring (team pages > about > contact > other)
Cross-page contact accumulation

Escalation Rule

Escalates if classified pages don't contain extractable contact data

Semantic Layer

4-Method Contact Extraction

Deep content analysis using 4 cascading extraction methods. Combines structured data parsing, visual layout analysis, proximity heuristics, and social profile matching to find decision makers.