Scraping Hierarchy
Each layer is tried in order. The system only escalates to the next (more expensive) layer when the current one can't achieve the required confidence threshold. This minimizes browser renders and maximizes throughput.
Static Layer
HTTP + Regex + Schema.org
Fast HTTP requests with BeautifulSoup parsing. Extracts emails, phone numbers, and structured data (JSON-LD, schema.org microdata) from raw HTML. Handles 60-70% of company websites that serve static content.
Techniques
- httpx/requests with anti-detection headers
- Regex patterns for email and phone extraction
- JSON-LD and schema.org structured data parsing
- Meta tag extraction (OG, Twitter cards)
- Sitemap.xml parsing for page discovery
Escalation Rule
Escalates if confidence < threshold or page returns empty/minimal content
JS Data Layer
__NEXT_DATA__, React Props, API Endpoints
Extracts data from JavaScript-rendered frameworks without launching a browser. Parses __NEXT_DATA__ (Next.js), data-react-props attributes, inline JSON payloads, and discovers internal API endpoints that serve structured data.
Techniques
- __NEXT_DATA__ JSON extraction (Next.js SSR)
- data-react-props attribute parsing (React)
- Inline script JSON payload detection
- XHR/fetch API endpoint discovery
- GraphQL introspection endpoint probing
Escalation Rule
Escalates if no JS data structures found or extracted data is incomplete
Structural Layer
Page Classification & Priority Crawling
Classifies discovered pages by type (/team, /about, /contact, /leadership, /staff) and crawls them in priority order. Budget-aware: max 15 pages per company with intelligent page selection.
Techniques
- URL pattern classification (team, about, contact, careers)
- Internal link graph analysis
- Sitemap-guided page discovery
- Priority scoring (team pages > about > contact > other)
- Cross-page contact accumulation
Escalation Rule
Escalates if classified pages don't contain extractable contact data
Semantic Layer
4-Method Contact Extraction
Deep content analysis using 4 cascading extraction methods. Combines structured data parsing, visual layout analysis, proximity heuristics, and social profile matching to find decision makers.
Techniques
- JSON-LD person/organization extraction
- Team card CSS pattern detection (photo + name + title)
- Heuristic proximity analysis (name near email/phone within DOM distance)
- LinkedIn profile URL extraction and matching
- Company general email separated from personal contacts
Escalation Rule
Escalates if semantic methods find < 2 contacts with confidence > 0.5
Browser Layer
Playwright Stealth Rendering
Full Playwright browser rendering for JavaScript-heavy SPAs that resist all other methods. Budget-capped at 3 browser renders per company to control costs. Handles click-to-reveal content, infinite scroll, and modal-based team directories.
Techniques
- Playwright stealth mode with anti-detection
- Click-to-reveal email/phone interaction
- Infinite scroll handling for team directories
- Modal and accordion content expansion
- Screenshot-based fallback for heavily protected sites
Escalation Rule
Final layer -- if browser rendering fails, the company is marked for manual review
Frequently Asked Questions
Everything you need to know about our platform.
Still have questions?
Our team can walk you through the pipeline, pricing, and your use case.