Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Lead generation is one of the most practical uses of web scraping. Instead of buying a static contact list, teams collect fresh public business signals from directories, maps, marketplaces, search results, company websites, job boards, and review platforms. The output is not just a list of names; it is a structured dataset that sales, marketing, and operations teams can filter, enrich, score, and route. The hard part is not extracting one page. It is deciding which sources are legitimate, which fields matter, how to validate the data, and how to keep the workflow compliant.

Common sources

Good lead scraping starts with the source type.
SourceWhat it is good forTypical fields
Local directories and mapsLocal businesses by category and geographyBusiness name, address, phone, website, category, rating, review count, hours
Industry directoriesNiche B2B targetingCompany name, specialty, location, certifications, contact page URL
MarketplacesSellers, agencies, vendors, or service providersSeller name, profile URL, offer category, reviews, response rate
Company websitesDirect contact enrichmentEmail, phone, social links, office locations, leadership pages
Job boardsBuying intent and growth signalsHiring role, department, location, tech stack clues, company size
Social and professional networksPublic profile and company contextName, title, company, location, public posts, company page URL
For example, a local agency might scrape Google Maps for “dentists in Austin”, enrich each website for email and social links, then score leads by rating, review count, website quality, and whether the business is running ads. A B2B SaaS team might start from job postings and look for companies hiring roles that imply a need for their product.

Field design

Do not collect every visible field by default. Start with the decision the lead list must support. Core company fields:
  • Company or business name
  • Website
  • Category or industry
  • Address, city, region, and country
  • Phone number
  • Source URL
  • Date collected
Useful qualification fields:
  • Rating and review count
  • Employee count or location count
  • Job openings or hiring department
  • Technology signals from the website
  • Social profile URLs
  • Recent activity or last review date
  • Opening hours or operating status
Useful outreach fields:
  • Public email address
  • Contact page URL
  • LinkedIn company URL
  • Decision-maker public profile URL
  • Role/title when available from public data
Keep provenance. Every row should include where it came from and when it was collected. That makes deduplication, opt-out handling, and data refresh much easier.

Enrichment workflow

Lead scraping often works best as a chain:
  1. Discover companies. Use maps, directories, search results, or marketplaces to build the initial list.
  2. Normalize company records. Clean names, addresses, phone formats, categories, and URLs.
  3. Enrich from websites. Visit the company website to collect public emails, contact pages, social links, and location data.
  4. Add intent signals. Jobs, reviews, recent posts, new locations, or product listings can indicate timing.
  5. Score and segment. Rank leads by fit, completeness, recency, geography, or buying signal.
  6. Export to the CRM. Push only qualified records, not every scraped row.
Google Maps templates from tools like Apify and Octoparse commonly start with search terms, locations, URLs, or place IDs and return structured business fields such as name, address, phone, website, rating, review count, category, coordinates, and hours. Some templates also enrich contacts from the business website. That pattern is a good model: separate discovery from enrichment instead of expecting one source to contain every field.

Data quality checks

Lead data gets messy quickly. Build these checks into the pipeline:
  • Deduplicate by domain, phone, and address. Business names vary.
  • Separate headquarters from branches. A chain can have many locations but one corporate site.
  • Validate emails. Do not assume every scraped email is deliverable or appropriate for outreach.
  • Track stale records. Closed businesses, old job posts, and outdated review counts change lead quality.
  • Keep source confidence. A direct website contact page is stronger than a copied directory field.

Compliance and ethics

Lead scraping touches personal and business contact data, so the operating rules matter.
  • Collect public data only when you have a legitimate use case.
  • Avoid sensitive personal data unless you have a clear legal basis.
  • Respect robots.txt, site terms, rate limits, and opt-out requests.
  • Do not scrape logged-in networks in ways that violate account policies.
  • Keep unsubscribe, suppression, and deletion workflows connected to your CRM.
For B2B outreach, compliance often depends on jurisdiction, message type, lawful basis, and how the data is used after collection. Treat scraping as one part of a governed lead process, not a shortcut around consent or privacy requirements.

When templates help

Prebuilt templates are useful when the target source is common: Google Maps, Yelp, Yellow Pages, Amazon sellers, LinkedIn jobs, or other frequently used directories. Platforms like Apify, Bright Data, and Octoparse package much of the repetitive work: pagination, field mapping, browser execution, proxy handling, retries, and exports. Custom workflows make sense when the source is niche, the field mapping is unusual, or the lead logic depends on several sources. In either case, the important design is the same: discover, enrich, validate, score, and export only the records you can responsibly use.