E-commerce data collection

E-commerce data collection turns product pages and marketplaces into structured datasets. Retailers use it to monitor competitors. Brands use it to watch reseller activity and reviews. Market researchers use it to understand category trends. Product teams use it to identify gaps in assortment, content, and customer sentiment. The sources are familiar: Amazon, Walmart, eBay, Shopify stores, brand sites, marketplace seller pages, review pages, and category pages. The engineering challenge is that each source represents the same commercial facts with different page layouts and different anti-bot posture.

What to collect

E-commerce scraping usually starts with a product catalog.

Data type	Example fields
Product identity	Title, brand, ASIN/SKU/GTIN/UPC, model, product URL
Pricing	Current price, list price, discount, coupon, subscription price
Availability	In stock, out of stock, delivery estimate, seller availability
Seller data	Seller name, marketplace seller ID, fulfilled-by signal
Product content	Images, description, feature bullets, specifications
Reviews	Rating, review count, review text, review date, helpful votes
Ranking	Best-seller rank, search position, category rank
Variants	Size, color, pack count, style, region

Templates from Octoparse, Apify, and Bright Data commonly separate listing, detail, and review extraction. That mirrors how e-commerce sites are structured. Listing pages provide breadth; detail pages provide full product facts; review pages provide sentiment and quality signals.

Common workflows

Catalog monitoring

Scrape category pages or search results to discover products, sellers, and rankings. Store product URLs and IDs as refresh targets.

Product detail enrichment

Visit detail pages for discovered products. Collect descriptions, specs, images, variants, seller information, and availability.

Review analysis

Collect reviews separately from product facts. Review pages often paginate independently and may require sorting by newest to support monitoring.

Price and stock tracking

Refresh selected products on a schedule. Store timestamped snapshots so the team can detect price changes, promotions, stockouts, and seller changes.

Platform differences

Platform type	Notes
Large marketplaces	Rich data, heavy anti-bot defenses, many variants and sellers
Brand stores	Cleaner product structure, often Shopify or similar commerce platforms
Long-tail retailers	Less standardization, but lighter defenses
Review-heavy marketplaces	Strong sentiment value, separate review pagination
B2B catalogs	Often require login, quote requests, or region-specific pricing

Amazon is the classic example. Search and category pages expose product cards with title, price, rating, review count, image, and ASIN-like identifiers. Product pages add descriptions, feature bullets, specifications, seller details, variants, best-seller rank, and stock or delivery hints. Review pages add text, rating, reviewer signals, helpful count, and verification status. Treat each page type as a different dataset.

Data normalization

E-commerce data needs cleanup before analysis.

Normalize currency and region.
Convert pack counts into unit price.
Separate product price from shipping.
Standardize availability states.
Map variants to parent products.
Deduplicate identical products across URLs.
Preserve source timestamps.

For cross-site comparison, product matching is the core problem. Use exact identifiers where possible, then fall back to title, brand, model, pack count, size, and image similarity.

Anti-bot and scale

E-commerce sites are among the most protected scrape targets because pricing, reviews, and inventory are commercially sensitive. Expect JavaScript rendering, rate limits, CAPTCHA, IP reputation checks, fingerprinting, and page layout tests. At small scale, careful pacing and a real browser may be enough. At larger scale, use cloud execution, subtask splitting, proxy rotation, coherent fingerprints, and retry logic. For pages like Amazon, prebuilt scrapers can save time because they already encode page-type handling and field mapping.

Compliance boundaries

Scrape responsibly. Respect robots.txt and site terms, avoid personal or sensitive data unless you have a legitimate basis, and prefer official APIs or partner feeds when they are available and suitable. For marketplaces, seller and reviewer data can raise additional policy and privacy concerns. E-commerce scraping is most valuable when it feeds a defined decision: repricing, assortment planning, review monitoring, reseller compliance, or market research. Start from that decision, then design the fields and cadence around it.

GET STARTED

WEB SCRAPING BASICS

HOW WEB SCRAPERS WORK

USE CASES

GUIDES

E-commerce data collection

What to collect

Common workflows

Catalog monitoring

Product detail enrichment

Review analysis

Price and stock tracking

Platform differences

Data normalization

Anti-bot and scale

Compliance boundaries

GET STARTED

WEB SCRAPING BASICS

HOW WEB SCRAPERS WORK

USE CASES

GUIDES

Documentation Index

​What to collect

​Common workflows

​Catalog monitoring

​Product detail enrichment

​Review analysis

​Price and stock tracking

​Platform differences

​Data normalization

​Anti-bot and scale

​Compliance boundaries

What to collect

Common workflows

Catalog monitoring

Product detail enrichment

Review analysis

Price and stock tracking

Platform differences

Data normalization

Anti-bot and scale

Compliance boundaries