Selectors are the foundation of any rule-based scraper — they tell the tool exactly which elements on a page to extract data from. A well-written selector keeps working through minor site updates; a brittle one breaks the moment a developer changes a class name or rearranges the layout. Understanding how to write durable selectors is one of the most practical skills in web scraping, and the place to invest time even if you use a visual tool that generates selectors for you. There are two main selector languages: CSS selectors and XPath. Both can locate elements in an HTML document, but they work differently, and each has strengths the other lacks.Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
CSS selectors
CSS selectors use the same syntax web developers write in stylesheets, which makes them feel intuitive if you have any front-end experience. They select elements based on tag names, classes, IDs, attributes, and their relationships to other elements. For example,div.product-card h2 selects all h2 elements inside divs with the class product-card. CSS selectors are generally shorter, easier to read, and faster for browsers to evaluate. They’re the better default choice for most straightforward extraction tasks.
XPath
XPath is more powerful but more verbose. It treats the HTML document as a tree and lets you navigate in any direction — not just downward from parent to child, but also upward to parents, sideways to siblings, and across the document. Crucially, XPath can select elements based on their text content —//a[contains(text(), "Next Page")] — which CSS selectors cannot do. This makes XPath essential for tasks like finding a button by its label, or locating a table cell based on what it says. XPath also supports conditions, functions, and complex predicates, giving it more expressive power for unusual page structures.
When to pick which
The trade-off comes down to simplicity versus flexibility. CSS handles the majority of common cases more cleanly. XPath is the tool you reach for when the HTML structure is awkward, when you need text-based matching, or when the element you want can only be identified by its relationship to a sibling or ancestor rather than a direct parent.Writing durable selectors
Regardless of which language you use, durability comes down to a few principles:- Avoid auto-generated class names. Frameworks like React, Vue, and Angular often produce class names that change on every build —
.css-1a2b3cand friends are landmines. - Prefer semantic attributes.
id,data-*,role, andaria-labelare less likely to change because they carry meaning beyond styling. A selector anchored on[data-product-id]survives a stylesheet rewrite; one anchored on.flex-row__inner--lgdoes not. - Keep selectors short. The longer the chain of parent-child relationships, the more likely some intermediate layout change breaks the whole chain.
- Avoid positional selectors.
nth-childand absolute XPath paths like/html/body/div[3]/div[2]/ul/li[1]assume the element stays in an exact spot in the DOM — almost never true. - Anchor to the nearest stable landmark. Rather than traversing from the root of the document, find the closest element with a meaningful, stable identifier and select relative to it.
The Shadow DOM challenge
One emerging obstacle for selectors is Shadow DOM, a web component feature that encapsulates a section of the DOM inside a closed boundary. Standard XPath and CSS selectors cannot reach into a shadow root — which means elements inside web components are invisible to traditional scraping approaches. As more sites adopt web components for modular UI, this is becoming a real practical problem: a scraper might see the outer shell of a component but not the content rendered inside it. The fix requires a tool that can pierce shadow roots. Playwright extends standard selectors with its>> syntax for this; Octoparse extends XPath with a custom syntax that does the same — letting a generated selector reach into a shadow root the same way it would address a regular DOM subtree.
How Octoparse approaches selectors
Octoparse primarily uses XPath for element targeting, which is a deliberate choice. XPath’s tree-navigation model maps naturally to the way non-technical users think about page structure — “the price inside this product card” translates more directly into an XPath expression than a CSS selector chain. This keeps things approachable for users who aren’t developers but still need to review or adjust a selector when something changes. More importantly, Octoparse doesn’t just generate any working XPath when a user clicks on an element — it applies an intelligent attribute prioritization algorithm. The system evaluates available attributes by their semantic stability: meaningful identifiers likeid, data-*, and role are preferred over volatile class names or fragile positional indices. The auto-generated XPath is designed to survive minor site changes — a reshuffled layout or an updated stylesheet won’t break a selector that anchors on a stable data-product-id attribute rather than a third-div-inside-the-second-section path.
For pages built with web components, the same generator produces selectors using Octoparse’s custom XPath extension for Shadow DOM — so an element inside a shadow root is addressable the same way an element in the regular DOM is, without dropping out of XPath to a different selector dialect.
The team continues to refine this along two more fronts:
- AI-assisted XPath generation that evaluates broader page context to produce more resilient selectors
- AI-powered self-repair that detects broken selectors and updates them based on the changed page structure