Welcome, Guest
Section to post tutorials on how to manage and read your eComics
  • Page:
  • 1
  • 2

TOPIC: Regular Expressions Tutorial

Regular Expressions Tutorial 4 years 3 months ago #36289

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Regular Expressions for Pattern Matching Comic Names

Regular expressions are a powerful way of matching strings, but there is a barrier to getting started with them, because the websites out there that detail how to use them tend to be written by technical people for technical people.

Also they can be used in many different manners, and some tutorials will be very focused on one specific usage, and neglect other more general uses.

For Comic Rack you don't really need to know everything about Regular Expressions to make them useful, you just need to know how they work and how to match the kinds of names that comics have.

Examples when basic Regular Expressions may be useful.

• When some comic volumes begin with "The" and others don't, but you want to match both ( X-Men & The X-Men)

• When adjectives are used and you want to include them with the regular comics (Uncanny Avengers, Secret Avengers)

• When you wish to include Annuals, Specials or Giant-Sized issues (Fantastic Four Giant-Sized & Fantastic Four Annual

• When you want to match all annuals regardless of a tagged on year like 98 or 2002


The vagaries of comic book titles makes for an endless list of possible reasons.

Why not use multiple lines in a Smart List?

Simple answer is you can, and most times it would probably be easier.

Take our X-Men example above, we could use a smart list and match "X-Men" in one line and "The X-Men" in another. It works and I would probably do this most of the time.

But imagine you wanted to quickly edit it to look for another comic, then you would need to edit both lines. (Again not a problem for such a simple example but that's why it's a simple example.)
The administrator has disabled public write access.
The following user(s) said Thank You: forkicks, 600WPMPO, rmagere, gyrop, jericko, Xelloss, boshuda, Harry, scottycondron

Re: Regular Expressions Tutorial 4 years 3 months ago #36290

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 1: What is a Regular Expression?
Introducing ^ and $


For the purposes of this tutorial it is a string of symbols and words that allows Comic Rack to search for a matching hypothetical string of words.

This Regular Expression:

^The text we are looking for$

Would match the string

"The text we are looking for"

As long as the spaces were all in the same place and the capital letters were all just the same. It wouldn't match:

"The text we are Looking for" (capital L)
or
"the text we are looking for" (lower case the)
or
"the text we are looking for" (two spaces after the)

Starting at the beginning moving towards the end
The ^ Symbol and the $ Symbol


Let's look closely at the expression. It starts with a special character, a ^.
This represents the beginning of the field we are searching, so lets imagine a comic with the title "This is Not The text we are looking for".

If we used the regular expression ^The text we are looking for$ to match the title of our comics Comic Rack would see the ^ symbol and start checking from the beginning.

First it would see a 'T', so far all ok, so it would carry on and check the next character, an 'h' still good, next though is an 'i', nope that doesn't match so Comic Rack will stop there. No need to look further, this comic won't be added to our search result.

At the end there is a '$' which matches the end of a string

So if we stepped through the string "The text we are looking for." You might think that it would match fine, but the full stop wasn't expected. Comic Rack expected a $ which means no more characters after the last letter, anything following is not a match so it won't add it to our search result.

And, that is all a regular expression really does, it steps through each comic field that it is asked to and checks one letter at a time to see if the expression matches the field. If it matches it adds the comic to the list, if it doesn't it moves on to the next field and or comic until it runs out of comics.

So already we have a way of matching any Comic that begins with a single word or anything else for that matter. We just don't use the $ symbol.

^The
Matches
"The X-Men" and "The Avengers" and even a hypothetical comic called "The"



Comic rack will stop stepping through the string as soon as it matches the 'e', it won't go any further so it won't disallow anything that starts correctly.

Also, we have a way of matching any comic that ends with a word or string. We just leave out the ^ symbol.

Avengers$
Matches
"The Avengers" and "New Avengers" and "Secret Avengers"
But not
"Avengers Assemble"



What comic rack does here is step forward one letter at a time but it is looking for a whole string of letters each time, and the string it is looking for includes the end of the field so with Avengers Assemble it will check the first A but because Avengers has a space after it it will move on to the v. It still keeps going one letter at a time just in case it can still find a match. It doesn't know there isn't a comic called "Avengers: To me my Avengers" after all what if Xavier came back from the dead to lead the Avengers.
Last Edit: 4 years 3 months ago by 600WPMPO. Reason: Added screenshots
The administrator has disabled public write access.
The following user(s) said Thank You: perezmu, 600WPMPO, gyrop, romsnesrom, Xelloss, docdoom, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36296

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 2: The next logical step

What makes databases like Comic Rack so useful for finding things is that they can be programmed with logic. And this allows us to ask complex questions. Questions that we might ask in English like “Show me all the X-Men Comics”.

But these seemingly simple questions become quite complicated when we realise that we need to explain what we mean by and X-Man Comic. And to be honest collectors would probably have a pretty lengthy argument over this question.

In Comic Rack we could use the Search box at the top right and just put in X-Men and select search all and probably get a pretty useful list but the resulting list is going to be pretty long and include all sorts of comics. Even comics that are nothing to do with Marvel that have description that compare it to the X-men comics.

Logical expressions allow us to be very specific in our search.

We could ask:

• Show all the Comics with X-Men in the series name
• Show all the Comics with X-Men in the series title and Avengers in the series title.
• Show all the Comics with Uncanny in the title but not X-Men

All of these are pretty easy using Smart Lists but you will often need to use more than one line in the search query.

Alternate searches, OR the | symbol

The most useful questions to ask Comic Rack will include the word OR.
Let’s imagine we want a list of all of the comics just called X-Men. Part 1 gave us a pretty easy expression to match:

^X-Men$
In our Smart list it would be entered like this:



And, would result in this:



But we might be disappointed because it doesn’t include the 60’s run “The X-Men”.

What we want is an expression that matches The X-Men OR X-Men. And the way we specify OR is with the | character.

This character is sometimes tricky to find on a keyboard, have a search for it, it’s not a capital i or a little L it extends below and or above the other words. On my keyboard it is “shift+\” but it does move around. If you can’t find it anywhere you can cut and paste it from this page.

In logic the expression apples|oranges means apples OR oranges, and we can use it in a regular expression thus:

^X-Men$|^The X-Men$

You can probably see by now that this can be stated in English as:
A string that contains only X-Men OR only The X-Men.



It is very specific, it won’t allow The Uncanny X-Men to slip into the list.

What Comic rack does with this is the same as every regular expression. It starts with the X and works its way through, so it is going to match all of the ‘X-Men’ comics first, and then hit the | symbol, it knows now to expect an alternate search, and indeed seeing the next symbol is a ^ it will go back to the beginning of ‘series’ and look for a first letter T etc.

Now it may seem odd to worry about how much work this is for Comic Rack, but searches can get pretty complex and complexity equals time. So it would be a good idea to phrase out queries in such a way that they take less effort. And we do this by understanding that Regular expressions start from the beginning and work along one character at a time.

So the query above is doing more than it needs to, it is looking for a Series that ends in X-Men twice. It would be easier for Comic Rack to do this:

^(|The )X-Men$

But in one easy step we have gone from simple to alchemical, but only because this is harder to translate into English:

Match a comic that begins with either nothing or “The ” followed by X-Men and then nothing else.



It is easier for Comic Rack because it never has to go back to the beginning. It starts at the ^ and then looks forward without moving onto the next character, for either nothing, or The followed by a space, anything else it moves on to the next comic. It then looks for the next letter to be an X etc.

Let this digest, play around with Smart Lists that have various positions for the | symbol and watch the results.

Maybe try and make a query that starts with The and Ends in X-Men but includes anything in the middle.
But every time you get either an expected or unexpected result think how Comic Rack is actually getting to that answer in its step by step way. Because once you get your head around that part the rest will fall into place.

(Part three won’t be for a couple of days anyway so you may as well experiment.)
Last Edit: 4 years 3 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: cYo, 600WPMPO, gyrop, romsnesrom, Xelloss, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36299

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 3: Matching in the real world
Introducing *

Now I refuse to fall into the trap that every other Regular Expression article falls into and immediately start looking at all of the other logic operators. Sure they have their uses, but for searching they are much more specialist. Instead we will stick with what is most useful.

So far we have done some stuff that is pretty easy to do with standard Smart Lists, so let’s move into an area where Regular Expressions excel, the real world.

In the real world we have comics that are named incorrectly, have strange punctuation or have been scraped from a crowd-sourced database where the editors do things strangely with their own agendas. This is ignoring the naming conventions that the comic companies themselves sometimes use.

In the real world we can have a number of comics that may or may not be capitalised properly, possibly contain a ':' character, may have too many spaces between words, and otherwise are less than ideal for searching.

Well, I picked X-Men as an example for a reason!

It is easy when you are in a hurry to forget to capitalise the M, or maybe miss out the -, or even put 2 spaces after ‘The’.

So here is a search that finds all instances of ‘X-Men’ ignoring some of the basic mistakes:

(X|x) *(|-) *(m|M)en

Yep it's looking less like English but nearly everything has been covered already with the exception of that * character, lets step through it and see what it does.

And we start as usual at the beginning, indeed let’s look at what isn’t even at the beginning, there is no ^ character, so Comic Rack will step through the name of each comic series step-by-step, it isn’t worried about the pattern being at the beginning, it can be anywhere.

(X|x)

Pretty simple X or x. Comicrack will step one character at a time looking for an x.

<space>*

This one is new. The Star symbol is looking for whatever is immediately in front of it. And in this case the character immediately in front is a space. But it also allows any number of that character including none of them.

So it will match:

<Nothing>
<Space>
<Space><Space>
<Space><Space><Space>
Etc.

(|-)

This is either nothing or a hyphen

<space>*

As before any number of spaces including none.

(m|M)en

And finally similar to how we checked above for 'x' we check for 'Men' with or without capitals.
So put it all together:

(X|x) *(|-) *(m|M)en

Will match:

X-Men
X – Men
x-men
x- Men
X -Men
X- Men
x Men
X men
XMen



And many other combinations.

It might be complex but it is also very compact and you don't need to create a whole load of lines in your search.

Of course, once you got these results you would probably correct all of the errors, but you would be reasonably sure you hadn't missed any. And you have free time to sort out Spider-Man.
Last Edit: 4 years 3 months ago by 600WPMPO. Reason: Added screenshots
The administrator has disabled public write access.
The following user(s) said Thank You: gyrop, romsnesrom, Xelloss, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36300

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Interlude

The post above shows an interesting phenomenon when using Regular Expressions, sometimes you will get results you don't expect. In this case Flex Mentallo: it fits the pattern but its not what we are looking for.

What we need to keep that from happening is a regular expression that doesn't worry about the X being the first character, but does insist that it is the first character in a word.

There is somthing that would help here. \b matches a word boundary, ie it matches the beginning and end of a word.

\bX-Men\b would expect a non word character before and after X-Men. Word characters will be defined in another tutorial but suffice to say if we enclose the expression in this format we can stop these rogue results.

\b(X|x) *(|-) *(m|M)en\b

Last Edit: 4 years 3 months ago by 600WPMPO. Reason: Added screenshots
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, gyrop, romsnesrom, Xelloss, boshuda, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36347

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 4: Dots in front of your eyes.
Introducing the ‘.’

So far we have tried to match specific things, sometimes those specific things have been ‘nothing’, but it has always been a specific nothing. But Regular Expressions get interesting when we start to look for non-specific patterns.

In a way the expression:

^X-Men

Is non-specific, as it doesn’t matter what comes next, but we may want to get specific in our non-specificity. (Try saying that 5 times fast.)

The dot character ‘.’ matches any single character, so we could get very specific with an expression like:

^X............$

This will only match a 13 character long target string beginning with ‘X‘.

(Target string is the correct jargon for the text you are matching with your regular expression.)

It is basically matching the X at the beginning and 12, but only 12, anythings in a row.


The dot is useful when you are not fussy about what you are matching, only the amount of space it takes up.

In a search within Comic Rack it could be useful for nonspecific seperation characters, for example there are a number of X-Men titles that begin with ‘X-Men:’ and the following would match those:

^X-Men..\b



But it would also match any other incorrect punctuation because it is just looking for two characters before the next word begins.

You may remember the \b from the interlude above, it looks for the beginning of any word.

Or this:

Spider.Man



Would not only match Spider-Man but also Spider Man or Spider/Man or even SpiderzMan.


Other related reasons to use dots.

A classic example might be a numbered reading list like this:

00) Uncanny X-Men #319: Untapped Potential
01) X-Factor #109: The Waking
02) Uncanny X-Men #320: The Son Rises In The East
03) X-Men #40: The Killing Time
04) Uncanny X-Men #321: Auld Lang Syne
05) Cable #20: An Hour Of Last Things
06) X-Men #41: Dreams Die

When these lists are long and I am trying to sort my comics into a reading list I use one of the many text editors that allow you to use regular expressions in the find field and in this case I would perform this search for each X-Men title.

\n.…X-Men

‘\n’ is a bit like ^ but for the beginning of a line (new line), so my search is looking for a line beginning with four non-specific characters followed by ‘X-Men’.
Last Edit: 4 years 3 months ago by 600WPMPO. Reason: Added screenshots
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, romsnesrom, Xelloss, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36351

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 5: Time to be Insensitive

I have been using capital letters for my examples above, and they are a great way to teach the "|" symbol. But, for my own sanity I stopped doing that and moved on.

This is mainly because many comic collections will probably have been corrected at some stage, (mostly through scraping) but partly because in Comic Rack's particular implementation of Regular Expressions you can turn on case insensitivity.

Firstly if you want to turn on insensitivity you would use:

(?i)

The i stands for insensitive.

If you want to turn it off again at some point:

(?-i)

So this expression:

(?i)^X-Men: (?-i)G

Will not worry about capitals for X-Men: but will go on to be fussy about the capital G and anything following

If you want to turn on or off insensitivity for one specific part of the regular expression you can enclose it within the brackets like this.

(?i)^the (?-i:Avengers) annual

Which is generally case insensitive but will be fussy about only the name Avengers.
The : is part of the syntax not something that the expression is trying to match, so it is being case sensitive for anything in the brackets after the : symbol.

It matches:

The Avengers Annual
the Avengers annual



But not:

The avengers Annual

And it will remain insensitive for the rest of the Expression.

(Note: because it just modifies the behaviour of the regular expression I placed it before the "^" in the previous examples. I think it helps the clarity but Comic Rack isn't fussy.)
Last Edit: 3 months 6 days ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, GlopGlop31, romsnesrom, Xelloss, boshuda, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36362

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Interlude 2: Time for an uncanny example

I have recently been reading the X-Men comics from the beginning, and the frustrating thing is the way the title keeps changing. In comic language the indicia remains the same but the actual title changes, so the run starts with ‘The X-Men’ and continues with the occasional Annual and Giant-Size but starts to change into The Uncanny X-Men, firstly with an annual and then later into the regular run.

So I wanted to collect the title up to 1980 into a Smart List that contains these comics.

The X-Men
X-Men Annual
Giant-Size X-Men
The Uncanny X-Men
The Uncanny X-Men Annual

I used this as a start:

(?i)^(|the )(giant-size |uncanny |)x-men( annual|)$




Let us Break that down:

(?i)^

As we just covered in the previous part (?i)^ is turning on case insensitivity and specifying that we are starting at the beginning of the string (the technical term is anchoring).

Now if we ignore everything in the brackets we see ‘x-men’ in the middle. (Mercifully the one thing these titles all have in common.)

So I started with that and then added the things that may or may not be in front and behind it in the title.

So in front we have:

(|the )(giant-size |uncanny |)

The first part

(|the)

Matches

the<space> OR <nothing>

It has to include nothing because there is no ‘the’ in front of ‘Giant Sized X-Men’ or ‘X-Men Annual’.

(giant-size |uncanny |)

Matches

giant-size<space> OR uncanny<space> OR <nothing>

Again we need the nothing for ‘X-Men Annual’

After x-men we have

( annual|)$

Which anchors at the end of the target string:

<space>annual OR <nothing>

It’s simple really!

P.S.

The purist in me doesn't want to include 'adjectiveless X-Men'

X-Men(1991)
X-Men Annual(1992)
X-Men(2004)
X-Men(2010)
X-Men(2013)

But we haven't covered negations and two volumes will be called X-Men Annual, so my purist Smart List ends up like this:


Last Edit: 4 years 3 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, GlopGlop31, romsnesrom, jericko, Xelloss, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36363

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Part 6; Getting Classy with numbers and punctuation

With dots we covered those times we are not concerned with which character we match, so what about when we are?

Character classes allow us to match things like any numeric character, any capital letter or anything from a list of characters.

To return to our incorrect punctuation example we could do this:

Spider[ _./~]Man

lets use Spider[<space>_./~]Man for clarity, but remember you can't use <space>

Which would match any of these:

The Amazing Spider Man
Spider~Man Team-Up
Spider_Man 2099
etc.

We place the possible characters that we are looking for between square brackets, and the Regular Expression will match ONE AND ONLY ONE of the characters in the class.

My emphasis is there because this is such a common confusion with character classes and I think its born from this kind of Expression:

Spider-[Man]

Which will indeed match 'Spider-Man', but only because 'Man' begins with an M.

Spider-[Man]<space>

Matches:

Spider-M<space>
Spider-a<space>
Spider-n<space>

Which in any sane collection is going to be no match at all.

Probably the most common use of Character Classes is finding numbers, for example:

^[0123456789]

Would find anything beginning with a number.

But this would get pretty cumbersome so luckily there is not only a shorter way to do this there are a number of ways.

Firstly we can use a range within a class to do exactly the same:

^[0-9]

Or we can use a handy shortcut \d so this:

[1-2][90]\d\d

would match any year from 1900-2099

But note this will match some other years as well. I won't tell you, just look at it carefully from left to right and you can probably work out a few other ranges it will find.

(Regular Expressions are about matching characters and we so easily imagine that computers think like us that when they don't we are sometimes surprised.)

This highlights one of the things it is always worth remembering about regular expressions, sometimes its ok to use an expression that isn't perfect. In my collection there isnt a single comic that matches that expression outside 1900-2099 so I would use it.

One use for this could be looking for those annoying annuals with the year in their series name (I find them annoying anyway, maybe its just me).

This will find most of them:

(?i)annual(:|)[- _./~]('|)(\d\d|[12][90]\d\d)

And breaking that down there are three parts,

(?i)annual(:|)[[-<space>_./~]

We are turning on case insensitivity, matching "annual" and either a ':' or nothing and then one of a few possible puncuations inclucing a space.

('|)

checking for those dates like '94, we look for ' or nothing.

(\d\d|[12][90]\d\d)

The part before the | is looking for any two digits like 99 or 54 and the part after is the one we used earlier that matches appropriate 4 digit years

So matches include:

Spider-Girl Annual 1999
X-Man Annual '96
Avengers annual: 2000


We can make this look tidier but that is for the next tutorial.
Last Edit: 4 years 3 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, GlopGlop31, romsnesrom, Xelloss, Harry

Re: Regular Expressions Tutorial 4 years 3 months ago #36384

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 768
  • Thank you received: 253
  • Karma: 55
Here are some of the Smart Lists for those of you going cross-eyed trying to copy them manually.

File Attachment:

File Name: Uncanny X-Men.cbl
File Size:1 KB


The corresponding Adjectiveless X-Men list not used above:

File Attachment:

File Name: i X-Men.cbl
File Size:1 KB


File Attachment:

File Name: Dated Annuals.cbl
File Size:0 KB


And the one that kicked it off in the Feature Requests section

File Attachment:

File Name: Hawkeye.cbl
File Size:0 KB
Last Edit: 4 years 3 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: Xelloss, Harry
  • Page:
  • 1
  • 2
Time to create page: 0.289 seconds

Who's Online

We have 243 guests and 2 members online