Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The program that does this is usually called a “scraper” or “crawler”: it visits pages, reads their HTML content, and pulls out the specific data you need — work that would otherwise mean copying information from web pages by hand.
Common use cases
Common use cases include collecting product prices from e-commerce sites, gathering news articles, pulling job listings, monitoring competitors, or building datasets for research.
How it works
A typical scraping workflow looks like this: send an HTTP request to a URL, receive the HTML response, parse it to find the relevant elements (using CSS selectors or XPath), and then store the extracted data in a structured format like CSV, JSON, or a database.
How it differs from APIs and manual collection
Compared with collecting data by hand, scraping does the same copy-and-paste work automatically — so it’s faster, repeatable, and scales to thousands of pages, at the cost of some upfront setup and ongoing maintenance when a site’s layout changes.
Compared with an API, scraping reads the same HTML a browser shows rather than a dedicated data feed. When a site offers an official API, that’s usually the more reliable and sanctioned option: the data is already structured, stable, and documented. Scraping is what you reach for when no API exists, when the API doesn’t expose the data you need, or when its limits are too restrictive — with the trade-off that it depends on the page’s markup and can break when that markup changes.
Popular tools and libraries for web scraping include Python’s Beautiful Soup and Scrapy, JavaScript’s Puppeteer and Cheerio, and no-code platforms like Octoparse. No-code platforms take a different approach from script-based libraries: Octoparse, for example, runs pages in a real browser and lets you build a scraper by pointing and clicking — simulating the way a person browses and selects data, rather than writing extraction code. For pages that rely heavily on JavaScript to render content, headless browsers (like Puppeteer or Playwright) can simulate a real browser to access dynamically loaded data.
Things to keep in mind
A few things to keep in mind when scraping: always check a site’s robots.txt file and terms of service to understand what’s allowed, be respectful with request rates to avoid overloading servers, and be aware that some jurisdictions have legal restrictions around scraping certain types of data.