Regex is a fantastic tool if you know how to use it. Sadly, it’s not the easiest thing in the world to interpret. With the proper expression, though, you can automate whatever you want. Regex is ideal for searching, matching, and manipulating text, as in these eight cases.
Why Use Regex?
Regex (short for REGular EXpressions) makes it easy to clean up and standardize. While it’s versatile as an editing tool, it’s it has its limits in application. The best way to think about regex is as a super-powerful wildcard search or search-and-replace that you can use whenever you need it—where it’s supported, of course. Microsoft Excel recently started supporting regex, making it a useful skill to learn.
But what exactly can you use a regex for? Let’s look at eight common examples.
Some programs don’t support regex, and others only support it partially. Apps like Notepad++ are perfect for doing regex commands with their find and replace functions.
Fix Copy-Pasted PDF Text
We’ve all been there: copying across some stuff from a PDF and then pasting it to your own document, only to have weird spacing and artifacts come across with the copy. But did you know that regex could help with that? Enter this command into your find-and-replace function:
Find: [^\S\r\n]{2,}|\s*\r?\n\s*\r?\n\s*
Replace: \n
This regex will make your text editor:
- Remove any instance of multiple spaces
- Reduces multiple line breaks to a single line break
- Gets rid of trailing spaces
This should clean up whatever copied text you have into something that’s useful.
Bulk-Renaming Downloaded Files
In more than one unfortunate incident, I downloaded a set of files, and they came with odd names appended. If you have a bulk-rename tool like Advanced Renamer, you can use a regex to clean up those filenames into something more recognizable. If you have a series of files with symbols all over the place, you can use your renamer with this regex:
Find: [^a-zA-Z0-9-.]
Replace:
This keeps numbers, periods, and letters as they are but replaces everything else with dashes.
Currency Formatting
Let’s say you have a file with a ton of currency in different formats. You don’t want to manually go through each of those currencies and fix it to the format you want, especially if they’re in multiple weird formats. Here’s what you’ll use for your regex:
Find: \$?\s*(\d+(?:\,\d{3})*(?:\.\d{2})?)\s*(?:USD|dollars?)?
Replace: $\1
This regex will scrub through your currency file and clean up anything to give you a dollar sign, a currency entry, and two decimal places for cents.
Standardize Date Formats
I’ve been in several situations where I’ve had to extract dates into a standardized format, like moving them from text into a database. When you’re faced with something like this, you can use a regex to find and extract data into a simplified format (in this case, it should be in YYYY-MM-DD). The regex for this would be:
Find: \b(?:\d{1,2}[-/\.]\d{1,2}[-/\.]\d{2,4})|(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[\s.-]?\d{1,2}(?:st|nd|rd|th)?[\s,.-]?\d{2,4})\b
This should search the entire document and fix all the dates to this standardized format.
When you copy something from the web, it sometimes has HTML tags attached to it. Luckily, regex has a handy method for stripping HTML tags from a document:
Find: [^>]+>|&[^;]+;|\s*\n\s*
Replace: \n
This regex will sift through the document, find the HTML tags, and wipe them, along with extra line breaks and other HTML entities (like ‘&’). Now, you can easily clean up a document like this by simply looking for the tags and replacing them with something empty with this regex:
Find: [^>]+>
Replace: (empty)
However, if your document uses weird HTML tags, formatting, or entities, you might encounter problems. The first regex is a general cleaning, and the second one is more in-depth in its searching.
Sometimes, you have a document with URLs buried inside the text. Pulling out those URLs shouldn’t require a manual search through the entire document, and regex will save you time. We already know that URLs always start with http or https, and we can use that knowledge in our regex:
Find: (https?:\/\/)?([\w\-]+(\.[\w\-]+)+\.?(:\d+)?(\/\S*)?)|((www\.)?[\w\-]+(\.[\w\-]+)+\.?(:\d+)?(\/\S*)?)
While this extractor will find your URLs, it has a few issues. If you have malformed URLs, or anything without http or https prefixes, you won’t see the URL. You won’t get emails with this pattern either, but there’s another one that you can use specifically for emails.
One of the most common problems I encounter when doing data scraping or email list validation is getting emails from a text file. Emails typically have a pattern that makes it easy for regex to interact with the text file. For an email search function, we’ll do something like this:
Find: (?:[a-z0-9!
This might look like a lot, but it basically searches for anything that has the pattern
When moving data from a form to or from a database, you sometimes have to fix some formatting problems. An excellent case in point is social media handles. The regex for doing this is:
Find: (?:^|[^@\w])[@\s]*(\w{1,30})
Replace: @$1
This is the most robust use case for formatting social media handles, but each platform has its own nuances for usernames. You can’t write error-checking for those specific handles unless you use this regex in a Python script, for example. Even so, debugging your Python code with regex might be a bit more complicated.
Regex Is Not a Silver Bullet
There’s a saying that if you have a hammer, every problem starts to look like a nail. As an experienced coder, I can say that’s 100% true regarding regex. There are several places online that can help you learn regex, but you should use this knowledge sparingly.
Including a regex in your code can complicate your debugging process. They don’t lend themselves to commenting either, making it more difficult to share code with others. Finally, they are part of an automation system; if the source data is bad, the results will also be bad. While regex is a powerful tool, it’s better for some things than for others.