14 rules between noise and your CRM
Junk removal, entity resolution, fuzzy dedup, confidence scoring
Web scraping captures everything: navigation text masquerading as names, social handles misidentified as emails, UI elements captured as titles, and placeholder contacts. The validation pipeline applies 14 precision rules plus entity resolution to ensure only real, reachable decision makers reach your CRM.
The Dirty Data Problem
Precision Cleanup with Entity Resolution
14 targeted rules remove specific junk patterns. Entity resolution merges duplicates. Confidence scoring ranks what remains.
14-Rule Junk Removal
Each rule targets a specific false positive pattern: nav text, social handles, UI elements, generic emails, hosting contacts, placeholders, bad domains, role addresses, duplicates, empty contacts, non-persons, low-confidence, name=company, and social domains.
Entity Resolution
Fuzzy matching across sources: name similarity, email domain validation, company context matching. Merge duplicate contacts into unified profiles.
Confidence Scoring
0-100 score per contact: source reliability (+30), domain match (+20), structured email (+15), corporate email (+10), verification tier (+25).
Name Validation
Detect non-person entries: department names, job postings, company names used as person names, and placeholder text.
Bad Domain Filter
Mandatory filter list: dnb.com, alibaba.com, made-in-china.com, wikipedia.org, and 14 more aggregator/directory domains.
Pre-Verification Cleanup
Remove noreply@*, no-reply@*, hosting@gabia.com, and cap at 5 emails per domain before verification to prevent circuit breaker loops.
Validation Pipeline
Each stage processes data sequentially with full checkpoint/resume capability.
Format Validation
Check email syntax (RFC 5322), phone format, URL validity. Reject malformed entries.
Navigation Text Filter
Remove names extracted from navigation menus, headers, and footers.
Social Handle Filter
Detect and remove social media handles (@company) misidentified as emails.
UI Element Filter
Remove buttons, links, and UI text captured as contact titles.
Generic Email Filter
Flag info@, admin@, support@, sales@, noreply@, no-reply@ addresses.
Bad Domain Filter
Remove contacts from aggregator/directory domains (18+ filtered).
Placeholder Detection
Catch John Doe, Test User, Example Name, and other placeholder patterns.
Entity Resolution
Fuzzy dedup across extraction sources. Merge duplicates by name + domain matching.
Confidence Scoring
Score each surviving contact 0-100 based on source reliability and data quality.
Tier Classification
Classify into HIGH/MEDIUM/LOW/SKIP based on composite confidence score.
Technical Workflow
# MANDATORY after every enrichment run
python tools/cleanup_contacts.py
# The cleanup script is importable as a module
from tools.cleanup_contacts import cleanup_contacts
# Pre-verification checklist:
# 1. Run cleanup_contacts.py FIRST
# 2. Cap at 5 emails per email domain
# 3. Remove hosting@gabia.com, noreply@*, no-reply@*
# 4. Then run /verify
# Output: database/clean_contacts_{date}.csv + .xlsx
# Color-coded Excel with priority tiersAPI Access
/api/v1/contacts/validateValidate and clean a list of contacts. Returns cleaned list with removed entries and reasons.
/api/v1/contacts/dedupEntity resolution and dedup on a contact list. Returns merged unified contacts.
/api/v1/contacts/scoreScore contacts 0-100 without cleaning. Returns confidence scores and tier classification.
/api/v1/contacts/bad-domainsList of filtered bad domains (aggregator/directory sites).
Use Cases
Post-Scrape Cleanup
Remove junk contacts after web scraping before adding to CRM or running email campaigns.
CRM Hygiene
Periodically validate existing CRM contacts to remove stale, duplicate, and low-quality entries.
Import Quality Gate
Validate and clean purchased lead lists before importing into your database.
Pre-Campaign Cleaning
Clean and score contacts before outbound campaigns to maximize deliverability and response rates.
Vendor Data Audit
Evaluate data vendor quality by running their output through the validation pipeline.
Merge Preparation
Clean and dedup data from multiple sources before running the merge engine.
Industry Applications
Technology
JS-heavy sites produce more extraction noise requiring thorough cleanup.
Marketing Agencies
Client data quality assurance before campaign execution.
Manufacturing
Trade show lead lists with exhibitor portal noise.
Financial Services
Regulatory requirements for accurate contact data.
Performance Metrics
Platform Preview
See how LeadsLogix processes, verifies, and delivers your leads in real time.
Cleanup Report
Summary of removed contacts by rule: how many caught by each of the 14 rules.
Before/After Comparison
Side-by-side view of raw extracted data vs. cleaned validated output.
Confidence Distribution
Score distribution of validated contacts across your dataset.
Integrations
Frequently Asked Questions
Everything you need to know about our platform.
Still have questions?
Our team can walk you through the pipeline, pricing, and your use case.