Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Many useful pages sit behind a login: account dashboards, internal portals, member directories, saved searches, order histories, and private reports. From a scraper’s point of view, the problem is not only “enter a username and password.” The real task is maintaining a valid authenticated session long enough to navigate and extract data reliably. Authenticated scraping should be limited to data you are allowed to access. A login does not remove legal, contractual, privacy, or platform-policy obligations. Treat it as a higher-risk workflow: use authorized accounts, respect access controls, and avoid collecting data outside the account’s intended permissions.

What changes after login

A public page can often be fetched with a plain HTTP request. A logged-in page usually depends on several pieces of browser state:
  • Session cookies that prove the user has already authenticated.
  • CSRF tokens or request headers that the site expects on form submissions and API calls.
  • Local storage or session storage values used by single-page apps.
  • Redirect logic that sends unauthenticated visitors back to /login.
  • Expiration rules that invalidate sessions after time, inactivity, IP changes, or security events.
This is why copying a logged-in URL into a scraper often fails. The URL is only the visible part; the session state is what lets the page load.

Approach 1: Reuse cookies from a real session

The most common approach is to sign in once, save the resulting cookies, and reuse them in later scraping runs. This works well when the site keeps sessions alive for hours or days and does not bind the session too tightly to one device or IP address. With code, the pattern looks like this:
import requests

session = requests.Session()
session.cookies.update({
    "sessionid": "saved-session-cookie",
})

response = session.get("https://example.com/account/orders")
print(response.status_code)
In production, do not hard-code cookies in source files. Store them securely, rotate them when they expire, and treat them like credentials. Cookie reuse is strongest when the target pages are mostly server-rendered or when the underlying API accepts the same session cookies as the browser. It becomes fragile when the site uses short-lived tokens, device checks, or frequent reauthentication prompts.

Approach 2: Log in with a browser

Some sites require a real browser login flow. The scraper opens the login page, fills the form, submits it, waits for the account page, then saves the browser state for later runs.
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()

    page.goto("https://example.com/login")
    page.fill("#email", "user@example.com")
    page.fill("#password", "correct-horse-battery-staple")
    page.click("button[type='submit']")
    page.wait_for_url("**/account/**")

    context.storage_state(path="auth-state.json")
    browser.close()
Later runs can load the saved state:
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context(storage_state="auth-state.json")
    page = context.new_page()
    page.goto("https://example.com/account/orders")
Browser-based login is slower than cookie-only requests, but it handles JavaScript-heavy login pages, redirects, and storage values more reliably.

Approach 3: Replay the login API

Sometimes the login form sends a simple POST request to an authentication endpoint. If the flow is straightforward and allowed by the site’s terms, you can reproduce that request directly with an HTTP client, capture the resulting cookies, and continue through the site. This is usually less stable than browser login because authentication flows often include CSRF tokens, bot checks, device fingerprinting, or rotating hidden fields. Inspecting the Network tab will tell you whether the login is simple enough to replay or whether a browser session is the safer route.

MFA, SSO, and security prompts

Multi-factor authentication changes the design. A scraper should not try to bypass MFA. Instead, design around it:
  • Use a human-in-the-loop login step, then save the authenticated session.
  • Prefer official APIs or service accounts when the site provides them.
  • Expect sessions to expire and build a refresh or re-login workflow.
  • Avoid using personal accounts for unattended production jobs.
Single sign-on flows can be even more constrained because they may cross domains, require organization policies, or trigger security prompts when the browser identity changes. For those cases, a persistent browser profile or an official integration is usually more reliable than a raw HTTP client.

Session safety

Authenticated scraping can lock accounts, trigger alerts, or expose sensitive data if handled casually. Use these safeguards:
  • Separate accounts by task. Do not mix unrelated scraping jobs in the same logged-in session.
  • Keep IP and browser identity stable within one session. Switching proxies mid-session can look like account takeover.
  • Rate-limit actions. Logged-in areas often have stricter monitoring than public pages.
  • Detect logout pages. A scraper should recognize when it has been redirected to login instead of parsing the login page as real data.
  • Store secrets securely. Credentials, cookies, and storage state files are all sensitive.
  • Log access intentionally. Keep enough run history to diagnose failures, but avoid writing private page content or credentials to logs.

How Octoparse fits

Visual scraping tools usually handle authenticated pages in two ways: they let the user sign in through an embedded browser, then they preserve the resulting cookies and browser session; or they simulate the login steps as part of the task so the scraper can sign in before extraction starts. Octoparse follows this general model. For sites that can stay logged in, users can authenticate in the built-in browser and reuse the saved cookies/session across runs. For sites that require a fresh login flow, the task can simulate the login behavior before navigating to the target pages. The same practical limits still apply: MFA may require human intervention, sessions can expire, and the account must be authorized to access the data being collected.

When not to scrape behind a login

Do not scrape authenticated content when you do not have permission, when the data belongs to other users, when the site’s terms prohibit the use case, or when an official export/API exists and is the appropriate channel. Logged-in scraping is best reserved for authorized workflows where the account holder is collecting their own accessible data or operating within an approved business process. The technical pattern is simple: authenticate, preserve session state, navigate, detect expiry, and refresh when needed. The hard part is operating it responsibly.