Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Raw scraped data is rarely ready to use as-is. Octoparse lets you refine extracted fields before export, so you can clean text, reshape values, remove unwanted characters, and extract only the part of a field you need. Use data refinement when you want your exported data to be closer to the final format required by a spreadsheet, database, workflow, or downstream system.

What you can refine

Clean field values

Remove, replace, trim, or reformat text before it is exported.

Extract part of a value

Use matching rules or regular expressions to keep only the text pattern you need.

Standardize output

Add prefixes, remove repeated text, or normalize inconsistent values across rows.

Troubleshoot field structure

Check whether a value can be split into separate fields based on the page’s source structure.

When to refine data

Refine data when the field is correct, but the value needs cleanup. Common examples include:
  • Removing labels such as Price: or Rating:
  • Extracting a number from a longer text string
  • Replacing unwanted characters or spaces
  • Adding a prefix or suffix
  • Matching a specific pattern with RegEx
  • Cleaning HTML-derived values such as image attributes or text embedded in source code
If the wrong element is being extracted, fix the field selection first. If the right element is selected but the value format is messy, use refinement rules.

Access Clean Data

You can refine a field from the field editor.
1

Select the field

In the data preview or field list, select the extracted field you want to clean.
2

Open the field menu

Click the ... menu for that field.
3

Choose Clean Data

Select Clean Data to open the data cleaning workflow.
4

Add a cleaning step

Click Add Step, then choose the operation you want to apply.
5

Preview the result

Check the preview value before saving the rule.

Common refinement operations

OperationUse it for
ReplaceReplace a specific string with another value, or remove it by replacing it with an empty string.
Add prefix or suffixAdd fixed text before or after each extracted value.
Match with RegExExtract text that follows a repeatable pattern.
Trim or remove charactersClean extra spaces, symbols, or unwanted text fragments.
Extract from HTML or attributesPull values from raw HTML, image URLs, alt, src, or similar attributes.
A “string” means a sequence of characters, such as a word, number, space, symbol, or punctuation mark. An empty string means no characters. For example, replacing a value with an empty string effectively removes it.

Use RegEx for pattern-based cleanup

Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace rules. Use RegEx when you need to:
  • Extract a number from a sentence
  • Match text before or after a known delimiter
  • Keep only part of an HTML attribute
  • Remove repeating patterns across many rows
  • Clean values that vary slightly from page to page
You can access the RegEx tool from the Clean Data workflow, or from the Tools area in the left sidebar. If you are not familiar with RegEx syntax, use the built-in RegEx tool to generate a pattern from examples instead of writing the expression manually.
Use RegEx only when simpler cleaning rules are not enough. For straightforward cleanup, operations such as replace, trim, prefix, and suffix are easier to maintain.

Example: extract a value from an attribute

Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such as alt="5 stars" or in a source value such as src. A typical workflow is:
1

Select the element

Select the element that contains the value you need, such as a rating icon or text block.
2

Choose the source value

Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.
3

Customize the field

Open the field menu and choose Customize Field or Clean Data.
4

Extract the target value

Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.
5

Preview before saving

Confirm that the preview shows the expected value before running the task.

Limits of field refinement

Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.
If a value cannot be separated because the website stores it as one single element, data cleaning may not be enough. You may need to adjust the selected element, inspect the HTML, or extract a different source value.

Best practices

  • Refine fields after confirming the correct element is selected.
  • Use simple cleaning steps before trying RegEx.
  • Preview each step before saving.
  • Keep field names clear so exported data is easy to understand.
  • Avoid over-cleaning if the downstream system can handle formatting later.
  • Document complex RegEx patterns so teammates can maintain the task.