Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Most modern websites don’t serve their content as static HTML anymore. Frameworks like React, Vue, and Angular build the page dynamically in the browser — the server sends a minimal HTML shell along with JavaScript bundles, and the actual content only appears after that JavaScript executes on the client side. This is why a basic HTTP scraper using something like Python’s requests library often comes back with a nearly empty page: it fetches the raw HTML but never runs the JavaScript that would populate it with the data you actually want. There are three approaches to solving this, and the right one depends on how the site is built. The cheapest approach wins when it applies, so check in order.

Approach 1: Render the page in a real browser

The universal solution is to run a real browser engine that executes JavaScript exactly like a user’s browser does. The page loads, scripts run, API calls fire, content renders — and only then does the scraper extract data from the fully populated DOM. This works against any JS-rendered site, but it costs more: you’re spinning up a browser instance for every page, which uses real memory and CPU. Which browser engine to run is its own choice — different libraries (Puppeteer, Playwright, Selenium), different cloud APIs, different headed-vs-headless trade-offs. The full menu is in The browser runtime landscape, and the headed-vs-headless decision that matters most for scraping is covered in Headed vs headless browsers. Runtime overhead matters at scale, so a purpose-built engine costs significantly less per page than a stock browser.

Approach 2: Intercept the underlying API

JS-rendered pages don’t conjure their content from nothing — they fetch it from backend APIs over XHR or fetch requests. If you can identify those endpoints, you can call them directly and get clean, structured JSON without rendering a browser at all. This is faster, lighter, and more reliable than full-page rendering: no DOM to wait on, no selectors to break when a layout changes, and the data arrives already parsed. The catch is that these APIs can be undocumented, require authentication, or be rate-limited, and they may change without notice. So the workflow is: open the page in a real browser with DevTools’ Network panel open, watch the requests as the content loads, and look for the XHR / fetch calls whose responses contain the data you want. Once you’ve found them, you can often reproduce them outside the browser entirely — sometimes with a single curl command. Octoparse builds this directly into its visual editor. Its built-in browser exposes the same network panel as DevTools, but it lets you select an underlying API response the same way you’d select a DOM element — point and click, and the task uses that endpoint instead of rendering the page. This collapses the typical “open DevTools, find the call, copy headers, rebuild the request” loop into a single visual step. This approach is the strongest answer to JS-rendered pages whenever it’s available — and it’s more often available than people assume. It’s worth checking before reaching for a browser.

Approach 3: Detect server-side rendering

Some sites that use client-side frameworks also implement server-side rendering (SSR) or static site generation for performance and SEO. In those cases, the initial HTML response actually does contain the full content — meaning a lightweight HTTP request may be all you need. View the page source (Cmd+U / Ctrl+U, not “Inspect” which shows the live DOM); if the data is already there in the raw HTML, you can skip the browser overhead entirely. Some sites go further and serve different content to different user-agents — pre-rendering for search-engine crawlers, for example. Setting the request’s user-agent to a known search bot can occasionally unlock the server-rendered version of a page that otherwise requires JavaScript. Use this when it works, with the usual caveats about respecting a site’s terms.

A few practical tips when a real browser is needed

When you do end up rendering in a browser, the failure modes are predictable:
  • Wait for the right thing, not for “load.” The browser’s load event fires when the HTML and assets are in, but client-rendered content may still be on its way. Wait for the specific selector or text you need to appear, not for the page to “be done.”
  • Watch for lazy loading. Content that renders only when scrolled into view will not exist until the scraper scrolls. Most browser-automation libraries can simulate scrolling; the trick is knowing you need to.
  • Client-side routing trips traditional scrapers. In SPAs, navigating from /products to /products/42 may change the URL without firing a new HTTP request. Logic that watches for pageload events misses the transition entirely; wait for content changes instead.
  • Infinite scroll and “load more” need interaction, not just observation. For deeper treatment of these patterns, see Handle pagination.

Try the cheapest approach first

A reliable strategy is to check in this order:
  1. View source. If the content is in the raw HTML, you’re done — a single HTTP request will do.
  2. Inspect network calls. If the page fetches its content from an API, call that API directly. Faster, lighter, more reliable.
  3. Render in a real browser. When neither of the above applies, run the page. Pick the runtime that fits your operator profile and the target site’s defenses.
The reason this order matters: each step up is more expensive — in latency, infrastructure, and breakage risk. A site that genuinely requires browser rendering costs orders of magnitude more to scrape than one whose API you can call directly, and the cost compounds over thousands of pages. When a workflow ends up needing more than one approach across different page types — some pages SSR, others API-fetched, others fully client-rendered — a platform that lets you switch between strategies in the same task (visual selection on rendered pages, network inspection for APIs, browser-runtime choice when rendering is needed) saves the setup and integration work of stitching three different toolchains together.