Many useful pages sit behind a login: account dashboards, internal portals, member directories, saved searches, order histories, and private reports. From a scraper’s point of view, the problem is not only “enter a username and password.” The real task is maintaining a valid authenticated session long enough to navigate and extract data reliably. Authenticated scraping should be limited to data you are allowed to access. A login does not remove legal, contractual, privacy, or platform-policy obligations. Treat it as a higher-risk workflow: use authorized accounts, respect access controls, and avoid collecting data outside the account’s intended permissions.Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
What changes after login
A public page can often be fetched with a plain HTTP request. A logged-in page usually depends on several pieces of browser state:- Session cookies that prove the user has already authenticated.
- CSRF tokens or request headers that the site expects on form submissions and API calls.
- Local storage or session storage values used by single-page apps.
- Redirect logic that sends unauthenticated visitors back to
/login. - Expiration rules that invalidate sessions after time, inactivity, IP changes, or security events.
Approach 1: Reuse cookies from a real session
The most common approach is to sign in once, save the resulting cookies, and reuse them in later scraping runs. This works well when the site keeps sessions alive for hours or days and does not bind the session too tightly to one device or IP address. With code, the pattern looks like this:Approach 2: Log in with a browser
Some sites require a real browser login flow. The scraper opens the login page, fills the form, submits it, waits for the account page, then saves the browser state for later runs.Approach 3: Replay the login API
Sometimes the login form sends a simple POST request to an authentication endpoint. If the flow is straightforward and allowed by the site’s terms, you can reproduce that request directly with an HTTP client, capture the resulting cookies, and continue through the site. This is usually less stable than browser login because authentication flows often include CSRF tokens, bot checks, device fingerprinting, or rotating hidden fields. Inspecting the Network tab will tell you whether the login is simple enough to replay or whether a browser session is the safer route.MFA, SSO, and security prompts
Multi-factor authentication changes the design. A scraper should not try to bypass MFA. Instead, design around it:- Use a human-in-the-loop login step, then save the authenticated session.
- Prefer official APIs or service accounts when the site provides them.
- Expect sessions to expire and build a refresh or re-login workflow.
- Avoid using personal accounts for unattended production jobs.
Session safety
Authenticated scraping can lock accounts, trigger alerts, or expose sensitive data if handled casually. Use these safeguards:- Separate accounts by task. Do not mix unrelated scraping jobs in the same logged-in session.
- Keep IP and browser identity stable within one session. Switching proxies mid-session can look like account takeover.
- Rate-limit actions. Logged-in areas often have stricter monitoring than public pages.
- Detect logout pages. A scraper should recognize when it has been redirected to login instead of parsing the login page as real data.
- Store secrets securely. Credentials, cookies, and storage state files are all sensitive.
- Log access intentionally. Keep enough run history to diagnose failures, but avoid writing private page content or credentials to logs.