Refine data

Raw scraped data is rarely ready to use as-is. Octoparse lets you refine extracted fields before export, so you can clean text, reshape values, remove unwanted characters, and extract only the part of a field you need. Use data refinement when you want your exported data to be closer to the final format required by a spreadsheet, database, workflow, or downstream system.

What you can refine

Clean field values

Remove, replace, trim, or reformat text before it is exported.

Extract part of a value

Use matching rules or regular expressions to keep only the text pattern you need.

Standardize output

Add prefixes, remove repeated text, or normalize inconsistent values across rows.

Troubleshoot field structure

Check whether a value can be split into separate fields based on the page’s source structure.

When to refine data

Refine data when the field is correct, but the value needs cleanup. Common examples include:

Removing labels such as Price: or Rating:
Extracting a number from a longer text string
Replacing unwanted characters or spaces
Adding a prefix or suffix
Matching a specific pattern with RegEx
Cleaning HTML-derived values such as image attributes or text embedded in source code

If the wrong element is being extracted, fix the field selection first. If the right element is selected but the value format is messy, use refinement rules.

Access Clean Data

You can refine a field from the field editor.

Select the field

In the data preview or field list, select the extracted field you want to clean.

Open the field menu

Click the ... menu for that field.

Choose Clean Data

Select Clean Data to open the data cleaning workflow.

Add a cleaning step

Click Add Step, then choose the operation you want to apply.

Preview the result

Check the preview value before saving the rule.

Operation	Use it for
Replace	Replace a specific string with another value, or remove it by replacing it with an empty string.
Add prefix or suffix	Add fixed text before or after each extracted value.
Match with RegEx	Extract text that follows a repeatable pattern.
Trim or remove characters	Clean extra spaces, symbols, or unwanted text fragments.
Extract from HTML or attributes	Pull values from raw HTML, image URLs, `alt`, `src`, or similar attributes.

A “string” means a sequence of characters, such as a word, number, space, symbol, or punctuation mark. An empty string means no characters. For example, replacing a value with an empty string effectively removes it.

Use RegEx for pattern-based cleanup

Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace rules. Use RegEx when you need to:

Extract a number from a sentence
Match text before or after a known delimiter
Keep only part of an HTML attribute
Remove repeating patterns across many rows
Clean values that vary slightly from page to page

You can access the RegEx tool from the Clean Data workflow, or from the Tools area in the left sidebar. If you are not familiar with RegEx syntax, use the built-in RegEx tool to generate a pattern from examples instead of writing the expression manually.

Use RegEx only when simpler cleaning rules are not enough. For straightforward cleanup, operations such as replace, trim, prefix, and suffix are easier to maintain.

Example: extract a value from an attribute

Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such as alt="5 stars" or in a source value such as src. A typical workflow is:

Select the element

Select the element that contains the value you need, such as a rating icon or text block.

Choose the source value

Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.

Customize the field

Open the field menu and choose Customize Field or Clean Data.

Extract the target value

Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.

Preview before saving

Confirm that the preview shows the expected value before running the task.

Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.

If a value cannot be separated because the website stores it as one single element, data cleaning may not be enough. You may need to adjust the selected element, inspect the HTML, or extract a different source value.

Best practices

Refine fields after confirming the correct element is selected.
Use simple cleaning steps before trying RegEx.
Preview each step before saving.
Keep field names clear so exported data is easy to understand.
Avoid over-cleaning if the downstream system can handle formatting later.
Document complex RegEx patterns so teammates can maintain the task.

GET STARTED

TASKS

TASK RUNNING

MONITORING

DATA EXPORT

ANTI-BLOCKING

TEAM & GOVERNANCE

What you can refine

Clean field values

Extract part of a value

Standardize output

Troubleshoot field structure

When to refine data

Access Clean Data

Common refinement operations

Use RegEx for pattern-based cleanup

Example: extract a value from an attribute

Limits of field refinement

Best practices

GET STARTED

TASKS

TASK RUNNING

MONITORING

DATA EXPORT

ANTI-BLOCKING

TEAM & GOVERNANCE

Documentation Index

​What you can refine

Clean field values

Extract part of a value

Standardize output

Troubleshoot field structure

​When to refine data

​Access Clean Data

​Common refinement operations

​Use RegEx for pattern-based cleanup

​Example: extract a value from an attribute

​Limits of field refinement

​Best practices

What you can refine

When to refine data

Access Clean Data

Common refinement operations

Use RegEx for pattern-based cleanup

Example: extract a value from an attribute

Limits of field refinement

Best practices