Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

E-commerce data collection turns product pages and marketplaces into structured datasets. Retailers use it to monitor competitors. Brands use it to watch reseller activity and reviews. Market researchers use it to understand category trends. Product teams use it to identify gaps in assortment, content, and customer sentiment. The sources are familiar: Amazon, Walmart, eBay, Shopify stores, brand sites, marketplace seller pages, review pages, and category pages. The engineering challenge is that each source represents the same commercial facts with different page layouts and different anti-bot posture.

What to collect

E-commerce scraping usually starts with a product catalog.
Data typeExample fields
Product identityTitle, brand, ASIN/SKU/GTIN/UPC, model, product URL
PricingCurrent price, list price, discount, coupon, subscription price
AvailabilityIn stock, out of stock, delivery estimate, seller availability
Seller dataSeller name, marketplace seller ID, fulfilled-by signal
Product contentImages, description, feature bullets, specifications
ReviewsRating, review count, review text, review date, helpful votes
RankingBest-seller rank, search position, category rank
VariantsSize, color, pack count, style, region
Templates from Octoparse, Apify, and Bright Data commonly separate listing, detail, and review extraction. That mirrors how e-commerce sites are structured. Listing pages provide breadth; detail pages provide full product facts; review pages provide sentiment and quality signals.

Common workflows

Catalog monitoring

Scrape category pages or search results to discover products, sellers, and rankings. Store product URLs and IDs as refresh targets.

Product detail enrichment

Visit detail pages for discovered products. Collect descriptions, specs, images, variants, seller information, and availability.

Review analysis

Collect reviews separately from product facts. Review pages often paginate independently and may require sorting by newest to support monitoring.

Price and stock tracking

Refresh selected products on a schedule. Store timestamped snapshots so the team can detect price changes, promotions, stockouts, and seller changes.

Platform differences

Platform typeNotes
Large marketplacesRich data, heavy anti-bot defenses, many variants and sellers
Brand storesCleaner product structure, often Shopify or similar commerce platforms
Long-tail retailersLess standardization, but lighter defenses
Review-heavy marketplacesStrong sentiment value, separate review pagination
B2B catalogsOften require login, quote requests, or region-specific pricing
Amazon is the classic example. Search and category pages expose product cards with title, price, rating, review count, image, and ASIN-like identifiers. Product pages add descriptions, feature bullets, specifications, seller details, variants, best-seller rank, and stock or delivery hints. Review pages add text, rating, reviewer signals, helpful count, and verification status. Treat each page type as a different dataset.

Data normalization

E-commerce data needs cleanup before analysis.
  • Normalize currency and region.
  • Convert pack counts into unit price.
  • Separate product price from shipping.
  • Standardize availability states.
  • Map variants to parent products.
  • Deduplicate identical products across URLs.
  • Preserve source timestamps.
For cross-site comparison, product matching is the core problem. Use exact identifiers where possible, then fall back to title, brand, model, pack count, size, and image similarity.

Anti-bot and scale

E-commerce sites are among the most protected scrape targets because pricing, reviews, and inventory are commercially sensitive. Expect JavaScript rendering, rate limits, CAPTCHA, IP reputation checks, fingerprinting, and page layout tests. At small scale, careful pacing and a real browser may be enough. At larger scale, use cloud execution, subtask splitting, proxy rotation, coherent fingerprints, and retry logic. For pages like Amazon, prebuilt scrapers can save time because they already encode page-type handling and field mapping.

Compliance boundaries

Scrape responsibly. Respect robots.txt and site terms, avoid personal or sensitive data unless you have a legitimate basis, and prefer official APIs or partner feeds when they are available and suitable. For marketplaces, seller and reviewer data can raise additional policy and privacy concerns. E-commerce scraping is most valuable when it feeds a defined decision: repricing, assortment planning, review monitoring, reseller compliance, or market research. Start from that decision, then design the fields and cadence around it.