The Magic of Regular Expressions

What are Regular Expressions?

Regular Expressions are basically a pattern that a text is matched with. For example, say I specify a pattern – three letter word followed by two space and then two numbers. So, text matching that would be  ‘xzc  25’ or ‘uzu 88’ etc. Some uses for Regular Expressions (Regex) might be validating email-addresses, phone numbers or urls. You can even used Regex to capture different information (parse) from text.

So.. how do you use them?

I’ll use an actual ‘problem’ that I used Regex for as an example. So I was making an FtpClient and I had to get file information from what was basically a long continuous string containing many different sets of information. It looked like this:

-rw-r–r– 1 redacted redacted 3486 Apr 3 2012 .bashrc -rw-r–r– 1 redacted redacted 751 Dec 2 2013 .cshrc -rw-r–r– 1 redacted redacted 248 Dec 2 2013 .login drwx—r-x 7 redacted redacted 8 Sep 4 17:27 public_html -rw-r–r– 1 redacted redacted 20 Sep 5 09:48 test.txt drwx—— 2 redacted redacted 5321 Sep 5 00:08 www_logs

So each line corresponds to information on a specific file or folder. I added the multiline flag to Regex which tells Regex to treat each line as a different entity and only search within that entity. This is the pattern that I used to extract the data from the above lines :-

What’s this weird text with symbols?

Seems intimidating right? It really isn’t. So the way Regex works is that you can use certain symbols to indicate the type of content you’re expecting.

  • \d – Digits
  • \w – Words and Digits
  • . – Anything
  • \s – Whitespaces

Then you can add modifiers that can make your request more specific.

  • * – 0 or more
  • + – 1 or more
  • ? – 0 or 1
  • {x} – Specifically x times
  • {x,y} – Anywhere between x and y times
  • {x,} – Minimum x times
  • x|y – X or Y

After that you can group sections up into capture groups by surrounding them using brackets (x y) so you can refer to them later if your Regex matches (and even in your current pattern but that’s a harder topic to cover now). Note: C# allows you to group by names using (?<GroupSomething>xyz) so you can reference them later if the pattern matches a text.

Now looking back to the Regex I wrote, I’ll break it down piece by piece. Compare this to one of the lines  -rw-r--r-- 1 redacted redacted 220 Apr 18 2010 .bash_logout :-

  1. (?<dir>d|-) – I expect a ‘d’ or ‘-‘ character. [-]
  2. (?<owner>.{3})(?<group>.{3})(?<public>.{3}) – I expect 3 things followed by another 3 things followed by another 3 things. [rw-r–r–]
  3. \s+(?<files>\d) – Then some space followed by a number. [  1]
  4. \s+(?<user>\w+) – Then some more spaces followed by some text [   redacted]
  5. \s+(?<idk>\w+) – Even more spaces followed by some more text [   redacted]
  6. \s+(?<size>\d+) – Again, spaces followed by a number [  220]

You get the idea by now. Another way of using Regex is to validate things and writing validation Regex can be simple but generally people try to challenge themselves into writing the shortest, most efficient Regex possible. Here’s one of my favourites:  /\b(?:(?:2(?:[0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9])\.){3}(?:(?:2([0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9]))\b/ig This is used for validating IP Addresses.

How do I learn more about this thing?

If you want to explore the world of Regular Expressions and learn how to apply it to your projects, visit RegExr. And before you go, DO NOT I REPEAT DO NOT TRY AND PARSE HTML USING REGEX. Thanks for reading.


Other Stuff