Regexing Shit


Problem

Links open in new tabs

I have a text file where I want to slide colons at the end of “columns” to the beginning of the next column. It is a text file, not a spreadsheet. I looks like this (actually this is it):

…
# OID                     short name          long name
1.3.6.1.4.1.311.20.2     :dom:                Domain Controller
1.3.6.1.4.1.311.21.1     :MsCaV:              Microsoft CA Version
1.3.6.1.4.1.311.10.3.4.1 :msEFSFR:            Microsoft EFS File Recovery
1.3.6.1.5.5.8.2.2        :iKEIntermediate:    IP security end entity
0.2.262.1.10.7.20        :nameDistinguisher:  
…

I could double click the whitespace to select it all, cut it and paste it in the other side of the colon. That’s too much motor coordination I don’t have, I’ve been known to lose my balance standing still in a windless environment falling flat on my ass. I ain’t no ballet dancer.

Fortunately there’s regex (short for “regular expressions”) for that. A form of generalizing shit so you can group traits and patterns in text to match them and use them in something. It’s a standard feature found in the Search and Replace function of specialized text editors known as Integrated Development Environment (or IDEs), used by developers to create and edit code. You might be using one already, examples are: Notepad++, TextWrangler, Visual Studio Code, BBEdit, Eclipse, Atom, the text editors in GNOME or in Synology’s DiskStation Manager, etc. Sometimes is called “grep” search.

Imagine you arrived a little late and your teacher — whose rumored to be such a spaz/klutz unless they’re balancing on a d**k — has already started so you sneek in from the back, speaking of sneaking in from the back…

I’ll start halfway in which I think might make it easier to grasp. I already slid the first colon from the end of the OID column to the beginning of the short name column, no I need to do the same for the colon between short name and long name. This is the expression I used:

(^.*)(:)(-|\w)+(:)(\s+)(?=\w+)

Meaning…

  • (
    Creates a matching group
  • ^
    Start of line
  • .
    Any character except new lines
  • *
    …if there’s not match no biggie, if there’s a match multiply it unlimited times
  • )(
    Ends matching group, opens next matching group.
  • :
    Matches a colon, this stops the effect of .* that would’ve matched up to the end of the line otherwise.
  • )(
    Next matching group

  • Hyphen, dash character
  • |
    Means “or”. So, in the group () either - or \w are sought after.
  • \w
    Means word character, as in a programmer’s definition of word, not your or mine, which is to say, an alphanumeric character, Roman/Latin basic alphabet and numbers, that’s it.
  • )
    End group and…
  • +
    Keep matching it as many times as possible, similar to * but if it doesn’t match at least once, abort everything (whereas * would continue regardless). Altogether (-?|\w)+ means: continue matching this group which may contain alphanumeric or hyphen characters until you run out of characters to match.
  • (
    New group
  • :
    Colon
  • )
    End group. Up until this point, (^.*:) would be another, much simpler way of matching it all (.*) from the beginning (^) then stopping at the last (:) because it would be singled out (the first : would’ve been covered by .*), but remember I was expanding my earlier expression up to the first colon.
  • \s
    space character
  • +
    …multiplied endlessly or abort if there’s none.
  • (?=
    Starts group with positive lookahead. It means that I want the next regex I’mmabout to put it to match it but pretend it’s not there for the actual output. It’s used as a delimiter for the match.
  • \w
    word character. …If I remember correctly, I think word comes from bits and bytes, where bit, which is a 1 or 0 value, true or false, etc; cannot represent much for us, but 8 of them, or a byte, can start to spell out characters we use, like hexadecimal and letters.
  • )
  • Closes the group.

A word about groups and nesting…

Groups aren’t obligatory but they make the next part possible. Each group in a regex receives a number (or you can a also explicitly name them but it’s kinda cumbersome) which can be used to recall them right after so they can be reordered. It’s kinda of awesome.

Groups are numbered each opening group. ( ( ), even if they’re nested, so…

(…)(…(…))(…)(…)
1…)2…3…))4…)5…)

To recall them, they’re referenced using their number or name (if given one) preceded by the $ character. If they’re nested, recalling the outermost group also includes the inner group, for example, $2 above, includes the $3 group.

Or in my case:

(^.*)(:)(-|\w)+(:)(\s+)(?=\w+)
1    2  3      4  5    6      

So, I need to switch groups $4 and $5’s places, thus

$1$2$3$5$4$6

…might just do it. It didn’t. I messed it up instead:

Fortunately, IDEs have a another feature that reverts what you just did, this you might have heard of, it’s called “undo”. It’s gonna be a thing, I tells ya.

Hmm… The same expression that had just matched no longer matches for replacement. I could either try to understand what went wrong or wrap it up so I don’t have to deal with it. It’s sort of like handing it off to subprocess so I don’t need to know what went on, just the results to work on:

((^.*)(:)(-|\w)+)(:)(\s+)
12 3 4 5 6
replaced as…
$1$6$5

Groups $1-3 were wrapped in another group, becoming groups $2-4 individually or group $1, groups $4-5 were bumped up as well to $5-6

et voici:

Learn more (and practice)

There are tons of websites about regex. The one I’ve liked the most is RegExr. It explains much better than I could possibly do how they work and you can use it without creating an account — I would advised against it anyway, or, get uBlock Origin and as you should do everywhere you sign up: use a spamcatcher mailbox for at least a month before switching your account’s address to the real address. It’s very rare for a service not to allow email address changes.

Alternatively, I have my own cheat sheet you’re welcome to use. It’s got no trackers or ads or anything like that, it won’t teach you interactively though…

Cheat

I’ve memorized most of what I need, I don’t use them on the reg though, so oftentimes I forget some, specifically the look-around expressions so I made myself a little cheat sheet.

This is my personal cheatsheet for REGEX:

https://docs.vitanetworks.link/s/YapV3TySy#

It’s a HedgeDoc server BTW, which you’re also welcome to use. Besides using a domain account, which only a handful of people do, it’s only usable anonymously or with a GitHub account, I did not add more services because the options available are much more privacy-invasive than GitHub; Twitter, Facebook, Google… those.