Raw scraped data is rarely ready to use as-is. Octoparse lets you refine extracted fields before export, so you can clean text, reshape values, remove unwanted characters, and extract only the part of a field you need. Use data refinement when you want your exported data to be closer to the final format required by a spreadsheet, database, workflow, or downstream system.Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
What you can refine
Clean field values
Remove, replace, trim, or reformat text before it is exported.
Extract part of a value
Use matching rules or regular expressions to keep only the text pattern you need.
Standardize output
Add prefixes, remove repeated text, or normalize inconsistent values across rows.
Troubleshoot field structure
Check whether a value can be split into separate fields based on the page’s source structure.
When to refine data
Refine data when the field is correct, but the value needs cleanup. Common examples include:- Removing labels such as
Price:orRating: - Extracting a number from a longer text string
- Replacing unwanted characters or spaces
- Adding a prefix or suffix
- Matching a specific pattern with RegEx
- Cleaning HTML-derived values such as image attributes or text embedded in source code
Access Clean Data
You can refine a field from the field editor.Common refinement operations
| Operation | Use it for |
|---|---|
| Replace | Replace a specific string with another value, or remove it by replacing it with an empty string. |
| Add prefix or suffix | Add fixed text before or after each extracted value. |
| Match with RegEx | Extract text that follows a repeatable pattern. |
| Trim or remove characters | Clean extra spaces, symbols, or unwanted text fragments. |
| Extract from HTML or attributes | Pull values from raw HTML, image URLs, alt, src, or similar attributes. |
Use RegEx for pattern-based cleanup
Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace rules. Use RegEx when you need to:- Extract a number from a sentence
- Match text before or after a known delimiter
- Keep only part of an HTML attribute
- Remove repeating patterns across many rows
- Clean values that vary slightly from page to page
Use RegEx only when simpler cleaning rules are not enough. For straightforward cleanup, operations such as replace, trim, prefix, and suffix are easier to maintain.
Example: extract a value from an attribute
Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such asalt="5 stars" or in a source value such as src.
A typical workflow is:
Select the element
Select the element that contains the value you need, such as a rating icon or text block.
Choose the source value
Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.
Extract the target value
Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.
Limits of field refinement
Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.Best practices
- Refine fields after confirming the correct element is selected.
- Use simple cleaning steps before trying RegEx.
- Preview each step before saving.
- Keep field names clear so exported data is easy to understand.
- Avoid over-cleaning if the downstream system can handle formatting later.
- Document complex RegEx patterns so teammates can maintain the task.