Refining data with regex

Raw scraped data is rarely clean enough to use directly. Once your selectors have located the values you want, the strings they pull off the page often arrive with surrounding noise. A product price might come with currency symbols, whitespace, and trailing text. A phone number might appear in three different formats across the same site. Dates could be written as “May 15, 2026” in one place and “2026-05-15” in another. Regular expressions — regex — are the standard tool for cleaning, extracting, and reshaping this kind of messy text into consistent, structured fields. Regex works by defining a pattern that describes the shape of the text you want. The engine scans a string, finds matches, and lets you extract or replace them. You don’t need to master every edge case of regex syntax to get real value from it — a handful of common patterns cover the vast majority of scraping cleanup tasks.

Common patterns for cleanup

A small library of reusable patterns handles most real-world cases:

Prices. [\d,.]+ captures numeric values with commas and decimals, stripping currency symbols and surrounding text. Refined to (\d{1,3}(?:,\d{3})*(?:\.\d{2})?), it matches standard formats like 1,299.99 more precisely.
Phone numbers. \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} handles common US formats — with or without parentheses, separated by dashes, dots, or spaces.
Emails. [\w.+-]+@[\w-]+\.[\w.]+ catches most standard email addresses — one of the most well-known regex use cases.
Dates. \d{4}-\d{2}-\d{2} matches ISO format; \w+ \d{1,2}, \d{4} handles May 15, 2026 style strings.
HTML tag stripping. <[^>]+> removes tags from a string.
Whitespace normalization. \s+ collapses runs of whitespace into a single space.

Practical tips

Test against real samples. Real pages have quirks idealized examples don’t. Run the pattern against output from the actual target site before trusting it.
Use non-greedy quantifiers (*?, +?). When the engine has a choice between a short and a long match, default * and + take the longest. That’s often not what you want.
Use capture groups. Parentheses () let you extract just the part you care about from a larger match — useful when you need to match context around the data but only keep the data.
Watch locale edge cases. A price regex built for US formatting (1,299.99) will misbehave on European numbers (1.299,99) where the comma and period swap roles. The same applies to dates, phone numbers, and decimal notations.

How Octoparse approaches it

Regex has a well-earned reputation for being difficult to write and harder to read. A complex pattern can look like line noise, and a small mistake can silently match the wrong data or miss valid entries. The cost of getting it wrong isn’t a crash — it’s silently corrupted data you only notice downstream, sometimes much later. Octoparse builds regex support directly into the extraction workflow — users can apply regex transformations to any scraped field as a post-processing step, cleaning and reshaping data before export without writing standalone scripts. The pattern, the field it applies to, and the cleaned output all live inside the same task definition. For users who aren’t comfortable writing regex from scratch, Octoparse also offers AI-assisted regex generation: describe what you need in plain language — extract the dollar amount, get the phone number, strip the trailing newline — and the AI produces a working pattern. That pattern can then be tested against the actual scraped data within the platform; you see what it matches, and adjust if needed. The AI handles the syntax; the user validates the result. This addresses the two real costs of regex in a scraping workflow: writing patterns from a blank slate, and verifying they do what you intended on real data. Both are absorbed into the visual editor rather than left as separate engineering tasks.

The takeaway

For most scraping projects, you won’t need deeply complex regex. A small library of reusable patterns for prices, emails, dates, phone numbers, and whitespace cleanup will cover the majority of cleaning tasks. Keep the patterns as simple as they can be while still matching accurately, document them if you plan to maintain the scraper long-term, and lean on AI generation when the syntax gets unwieldy. The goal is clean, consistent output — regex is the means, not the end.

GET STARTED

WEB SCRAPING BASICS

HOW WEB SCRAPERS WORK

USE CASES

GUIDES

Common patterns for cleanup

Practical tips

How Octoparse approaches it

The takeaway

GET STARTED

WEB SCRAPING BASICS

HOW WEB SCRAPERS WORK

USE CASES

GUIDES

Documentation Index

​Common patterns for cleanup

​Practical tips

​How Octoparse approaches it

​The takeaway

Common patterns for cleanup

Practical tips

How Octoparse approaches it

The takeaway