5-layer crawling that adapts to every site
Static HTML -> JS Data -> Structural -> Semantic -> Browser
Most scrapers use one method for everything. LeadsLogix escalates through 5 layers only when confidence is too low -- minimizing detection while maximizing extraction. Per-company budgets cap at 15 pages, 3 browser renders, and 120 seconds.
The Problem with Traditional Web Scraping
Intelligent Escalation, Not Brute Force
Each layer fires only if the previous layer didn't extract enough data. Browser rendering is the last resort, not the default.
5-Layer Hierarchy
Static HTTP + regex + schema.org -> JS data (__NEXT_DATA__, data-react-props, API endpoints) -> Structural (/team, /about, /contact) -> Semantic (4-method contact extraction) -> Playwright browser rendering.
Anti-Detection
Human-like delays (2-5s), user agent rotation, per-domain rate limiting (max 10 concurrent per IP). CAPTCHA detected = auto-stop + requires_manual flag.
Structured Data Extraction
JSON-LD, schema.org microdata, Open Graph, and Twitter Card extraction. __NEXT_DATA__ and data-react-props parsing for React/Next.js sites.
Cross-Page Context
Multi-page context accumulator merges contacts found across /team, /about, /contact, /leadership, and /people pages into unified profiles.
35+ Crawl Paths
Automatic discovery of team pages, about pages, contact pages, leadership directories, and organizational charts across 35+ URL patterns.
Budget Controls
Per-company limits: 15 pages max, 3 browser renders, 120-second runtime. Prevents runaway crawls while ensuring thorough extraction.
Crawling Pipeline
Each stage processes data sequentially with full checkpoint/resume capability.
URL Resolution
Resolve domain, follow redirects, validate SSL, check against bad domain filter list.
Static Layer
HTTP GET with anti-detection headers. Parse HTML with regex, extract schema.org, JSON-LD, Open Graph metadata.
JS Data Layer
Extract __NEXT_DATA__, data-react-props, inline JSON, and API endpoint data from page source without rendering.
Structural Layer
Discover and crawl /team, /about, /contact, /leadership, /people pages. Build cross-page context map.
Semantic Layer
4-method contact extraction: JSON-LD structured data, team card detection, heuristic proximity analysis, LinkedIn X-ray.
Browser Layer
Playwright rendering for JS-heavy pages. Budget-capped at 3 renders per company. Singleton pool management.
Quality Assessment
Score extraction confidence 0-100. Flag companies below threshold for re-crawl with deeper methods.
Technical Workflow
# Single company crawl python -m tools.website_crawler --domain acme.com # Batch crawl from CSV python -m tools.enrichment.pipeline --input companies.csv # 5-layer hierarchy with budget controls # Layer 1: Static HTTP (httpx + regex + schema.org) # Layer 2: JS Data (__NEXT_DATA__, API endpoints) # Layer 3: Structural (/team, /about, /contact discovery) # Layer 4: Semantic (4-method contact extraction) # Layer 5: Browser (Playwright, max 3 renders/company) # Resume interrupted crawl python -m tools.enrichment.pipeline --input companies.csv --resume
API Access
/api/v1/crawlCrawl a single domain with configurable layer depth and budget limits.
/api/v1/crawl/batchSubmit batch crawl job for multiple domains. Returns job ID for status polling.
/api/v1/crawl/{jobId}/statusCheck crawl job progress: pages visited, layers used, contacts found.
/api/v1/crawl/{domain}/dataRetrieve extracted structured data, contacts, and metadata for a domain.
Use Cases
Pre-Event Intelligence
Crawl all exhibitor websites before a trade show to extract team pages, contact info, and company profiles.
CRM Enrichment
Batch crawl domains from your CRM to fill missing company data, contacts, and social profiles.
Competitive Analysis
Monitor competitor websites for team changes, new hires, and organizational structure updates.
Market Research
Crawl industry directories and company listings to build comprehensive market maps.
Tech Stack Detection
Extract technology signals from website source code, meta tags, and JS frameworks.
Lead Qualification
Crawl prospect websites to assess company size, team structure, and contact availability before outreach.
Industry Applications
Manufacturing
Industrial catalogs, product pages, and team directories with heavy HTML content.
SaaS / Technology
React/Next.js sites with JS-rendered content requiring browser-layer extraction.
Professional Services
Team pages, partner directories, and practice area listings.
E-Commerce
Vendor pages, supplier directories, and wholesale buyer portals.
Performance Metrics
Platform Preview
See how LeadsLogix processes, verifies, and delivers your leads in real time.
Scraper Console
Create crawl jobs, monitor progress, view queue depths and rate limit status.
Extraction Results
View extracted contacts, structured data, and confidence scores per domain.
Layer Usage Analytics
See which crawling layers are used most, and which sites require browser rendering.
Integrations
Frequently Asked Questions
Everything you need to know about our platform.
Still have questions?
Our team can walk you through the pipeline, pricing, and your use case.