Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Pagination is the navigation layer of a scraper. After the scraper can fetch, render, and extract one page, it still needs to answer a practical question: where is the next batch of records, and how do I know when there are no more? Most pagination failures come from treating every site like a numbered page list. In practice, a catalog might use URL parameters, a next button, infinite scroll, a load-more button, an API offset, or an opaque cursor token. Some sites combine several of these patterns.

Start with the request, not the UI

Before writing pagination logic, open DevTools and watch what changes when you move to the next batch.
  1. Open the Network tab and filter to Fetch/XHR.
  2. Click the next page, scroll down, or press the load-more button.
  3. Inspect the request URL, query parameters, request body, and response.
  4. Decide whether the scraper should follow links, interact with the page, or call an API endpoint directly.
Use the UI as a clue, but trust the network request. A button that says “Load more” might call a simple API with offset=40. A page link might actually hydrate results through JavaScript after the URL changes.
What changesWhat to try first
URL includes page=2, p=2, or /page/2Loop through numbered URLs
An <a> link points to the next pageFollow the href until it disappears or becomes disabled
Content appears after scrollingFind the XHR request; use browser scrolling only if needed
Content appears after clicking a buttonReuse the API request or click the button in a browser session
JSON includes next_cursor, endCursor, has_more, or offsetPaginate through the API response

Numbered pages

Numbered pagination is the simplest case because the next location is visible in the URL:
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/catalog/page/3
The scraper can increment the page number and stop when the response contains no items, fewer items than expected, or a known 404/empty-state page.
import requests
from bs4 import BeautifulSoup

all_products = []

for page in range(1, 100):
    html = requests.get(f"https://example.com/products?page={page}").text
    soup = BeautifulSoup(html, "html.parser")
    cards = soup.select(".product-card")

    if not cards:
        break

    for card in cards:
        all_products.append(card.select_one(".title").get_text(strip=True))
Watch for page indexes that start at 0, parameter names such as p or start, and sites that return the first page again when the page number is out of range. A repeated first page is worse than an empty page because it can create duplicate data without obvious errors. Some sites do not expose page numbers. They only expose a “Next” link or arrow. If the element is a normal anchor, treat pagination as link following:
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
seen_urls = set()

while url and url not in seen_urls:
    seen_urls.add(url)
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    for card in soup.select(".product-card"):
        print(card.select_one(".title").get_text(strip=True))

    next_link = soup.select_one("a[rel='next'], a.next")
    url = urljoin(url, next_link["href"]) if next_link and next_link.get("href") else None
The seen_urls guard matters. Misconfigured sites sometimes point the final “Next” link back to the current page or to page one. Also check disabled states such as aria-disabled="true", disabled, or a disabled class before trusting the link.

Infinite scroll

Infinite scroll looks like a browser-only problem, but it usually has an API underneath it. Scroll once with DevTools open and look for a request that fetches the next group of records. The useful parameters are often named offset, page, after, cursor, or limit. When the endpoint is usable, call it directly:
import requests

offset = 0
limit = 24
products = []

while True:
    data = requests.get(
        "https://example.com/api/products",
        params={"offset": offset, "limit": limit},
    ).json()

    batch = data.get("items", [])
    if not batch:
        break

    products.extend(batch)
    offset += len(batch)
Use a browser only when the API is hard to call outside the page because of authentication, signed parameters, or complex client-side state.
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/products")

    previous_count = 0
    for _ in range(40):
        page.mouse.wheel(0, 4000)
        page.wait_for_timeout(1500)

        current_count = page.locator(".product-card").count()
        if current_count == previous_count:
            break
        previous_count = current_count

    print(page.locator(".product-card").count())
    browser.close()
For infinite scroll, do not rely only on page height. Some layouts keep changing height because of ads, images, or virtualized lists. Item count, network idle, and a maximum scroll count make a safer combination.

Load-more buttons

A load-more button is controlled infinite scroll. The page waits for a click before requesting the next batch. That makes pacing easier because the scraper can wait, validate the new item count, and retry if the request fails. If the button calls a clean API, use that API. If not, click the button in a browser loop:
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/products")

    while page.locator("button.load-more").is_visible():
        before = page.locator(".product-card").count()
        page.locator("button.load-more").click()
        page.wait_for_function(
            "(count) => document.querySelectorAll('.product-card').length > count",
            before,
        )

    browser.close()
The important check is not just “button clicked”; it is “new records appeared.” Buttons can fail silently, become disabled, or remain visible after the final batch.

Offset and cursor APIs

Modern sites often paginate data at the API layer. Offset pagination asks for a numeric position:
/api/products?offset=40&limit=20
Cursor pagination asks for the next opaque token returned by the previous response:
{
  "items": [],
  "pageInfo": {
    "hasNextPage": true,
    "endCursor": "eyJpZCI6MTAwfQ=="
  }
}
Cursor pagination is more stable when records are added or removed while you scrape. Instead of saying “skip the first 40 rows”, the cursor says “continue after this known position.”
import requests

cursor = None

while True:
    params = {"limit": 50}
    if cursor:
        params["after"] = cursor

    data = requests.get("https://example.com/api/products", params=params).json()
    for item in data.get("items", []):
        print(item["name"])

    page_info = data.get("pageInfo", {})
    if not page_info.get("hasNextPage"):
        break

    cursor = page_info.get("endCursor")
For API pagination, handle rate limits deliberately. Respect Retry-After, retry temporary failures with backoff, and store progress if the job is large enough that restarting from page one would be expensive.

Hybrid pagination

Real sites often combine patterns:
  • A category has numbered pages, but each page lazy-loads more products after scrolling.
  • A search page starts with a load-more button, then switches to numbered links.
  • A tabbed interface has separate pagination for “New”, “Popular”, and “Sale”.
  • A listing page paginates result URLs, then each detail page has its own paginated reviews or comments.
Handle these as nested loops. Keep the outer loop responsible for the larger navigation unit, and keep each inner loop responsible for one repeated action. Track unique IDs across the whole run so duplicate records do not leak into the output.

Practical safeguards

  • Define a stop signal. Empty result sets, missing next links, disabled buttons, hasNextPage: false, repeated cursors, and max-iteration limits are all valid stop signals.
  • Detect duplicates. Infinite scroll and cursor APIs can repeat records when data changes mid-run. Store stable IDs or canonical URLs.
  • Throttle navigation. Add small randomized waits between batches. Browser automation should wait for content changes, not only fixed timeouts.
  • Log failures. If one page fails after retries, record the URL or cursor and continue when possible.
  • Prefer APIs when they are legitimate and stable. Direct API pagination is usually faster and easier to validate than driving a browser.
  • Use a visual tool when speed matters more than custom code. In Octoparse, pagination can be configured visually for common next-page, load-more, and infinite-scroll flows, then run locally or in the cloud.
Pagination is not just “go to the next page.” It is the scraper’s control loop. Once that loop has clear next-step logic, a reliable stop condition, and duplicate protection, the scraper can move through a site without silently stopping at page one or spinning forever.