Lead generation is one of the most practical uses of web scraping. Instead of buying a static contact list, teams collect fresh public business signals from directories, maps, marketplaces, search results, company websites, job boards, and review platforms. The output is not just a list of names; it is a structured dataset that sales, marketing, and operations teams can filter, enrich, score, and route. The hard part is not extracting one page. It is deciding which sources are legitimate, which fields matter, how to validate the data, and how to keep the workflow compliant.Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Common sources
Good lead scraping starts with the source type.| Source | What it is good for | Typical fields |
|---|---|---|
| Local directories and maps | Local businesses by category and geography | Business name, address, phone, website, category, rating, review count, hours |
| Industry directories | Niche B2B targeting | Company name, specialty, location, certifications, contact page URL |
| Marketplaces | Sellers, agencies, vendors, or service providers | Seller name, profile URL, offer category, reviews, response rate |
| Company websites | Direct contact enrichment | Email, phone, social links, office locations, leadership pages |
| Job boards | Buying intent and growth signals | Hiring role, department, location, tech stack clues, company size |
| Social and professional networks | Public profile and company context | Name, title, company, location, public posts, company page URL |
Field design
Do not collect every visible field by default. Start with the decision the lead list must support. Core company fields:- Company or business name
- Website
- Category or industry
- Address, city, region, and country
- Phone number
- Source URL
- Date collected
- Rating and review count
- Employee count or location count
- Job openings or hiring department
- Technology signals from the website
- Social profile URLs
- Recent activity or last review date
- Opening hours or operating status
- Public email address
- Contact page URL
- LinkedIn company URL
- Decision-maker public profile URL
- Role/title when available from public data
Enrichment workflow
Lead scraping often works best as a chain:- Discover companies. Use maps, directories, search results, or marketplaces to build the initial list.
- Normalize company records. Clean names, addresses, phone formats, categories, and URLs.
- Enrich from websites. Visit the company website to collect public emails, contact pages, social links, and location data.
- Add intent signals. Jobs, reviews, recent posts, new locations, or product listings can indicate timing.
- Score and segment. Rank leads by fit, completeness, recency, geography, or buying signal.
- Export to the CRM. Push only qualified records, not every scraped row.
Data quality checks
Lead data gets messy quickly. Build these checks into the pipeline:- Deduplicate by domain, phone, and address. Business names vary.
- Separate headquarters from branches. A chain can have many locations but one corporate site.
- Validate emails. Do not assume every scraped email is deliverable or appropriate for outreach.
- Track stale records. Closed businesses, old job posts, and outdated review counts change lead quality.
- Keep source confidence. A direct website contact page is stronger than a copied directory field.
Compliance and ethics
Lead scraping touches personal and business contact data, so the operating rules matter.- Collect public data only when you have a legitimate use case.
- Avoid sensitive personal data unless you have a clear legal basis.
- Respect robots.txt, site terms, rate limits, and opt-out requests.
- Do not scrape logged-in networks in ways that violate account policies.
- Keep unsubscribe, suppression, and deletion workflows connected to your CRM.