Welcome, Guest
Section to post tutorials on how to manage and read your eComics

TOPIC: Regular Expressions Tutorial V2

Regular Expressions Tutorial V2 3 years 11 months ago #36416

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Part 10: The Repetitive Part
Inroducing +,? & {,}



Repitition is for those times when you are looking for a particular number of things. If you look back at Part 3 we used * to match 0 or any number of whatever we placed before it.

This is fine but sometimes you want to match at least one character, and then you would use the +.

So whereas

^New *Avengers$

will match:

NewAvengers
New<Space>Avengers
New<Space><Space>Avengers
New<Space><Space><Space>Avengers etc.

^New +Avengers$

will match:

New<Space>Avengers
New<Space><Space>Avengers
New<Space><Space><Space>Avengers etc.

And if you wanted to be even more speific you could use ? which matches 0 or one occurance of whatever is in fomt of it.

^New ?Avengers$

only matches:

NewAvengers
New<Space>Avengers

The most powerful repetition expression is {,}. With this you can specify the minimum and maximum times somthing can be repeated.

[0-9]{4,4}

Will match anything with 4 digits in a row.

or perhaps a less practicle example

(?i)(tin){2,2}

Will match any Tintin series.

You dont have to specify the maximum, so our annual search could be witten like this

(?i)annual(:|)[- _./~]('|)/d{2,4}

(Note: This isn't identical to our previous annual search because it isnt fussy about the first 2 digits.)
Last Edit: 3 years 11 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: perezmu, Xelloss, Harry

Re: Regular Expressions Tutorial V2 3 years 11 months ago #36417

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Part 11: Looking for 2 things and Look-ahead.

So now we have covered repitition we can use the increadibly useful command: '.+'.

Which basically looks for any number of characters. We can use it to look for one word followed anywhere in the target string by another word:

X-Men.+Avengers

We can even look for two words either way around by using or:

(X-Men.+Avengers|Avengers.+X-Men)


There is another way to be flexible with multiple words, and that is to use the Look-ahead feature.

Remember we started off by making it clear that Regular Expressions go through the target string one character at a time, and commands like OR step trough the string looking for the first alternative and then go back to check again for the other alternative.

Well there is a way to make regular expressions behave a little differently, by using the look-ahead syntax '(?=...)':

(?=.*X-Men)

Finds anything with X-Men in the string, and so it appears to be identical to:

X-Men

But the key difference is that instead of looking one character at a time for X the regular expression is keeping its place and then looking ahead for the X. Once it finds an X it will look-ahead to match the other letters but once matched it will then go right back to where it started looking ahead.

So this Expression:

The(?=.*X-Men)

is identical to

(?=.*X-Men)The

Both will perhaps unintuitively, look for The FOLLOWED BY X-Men

This allows us to not bother with the OR command and effectively make an AND:

(?=.*X-Men)(?=.*Avengers)(?=.*The)

Which looks ahead from the beginning for each of three words, only matching when they are all found.
Last Edit: 3 years 11 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: Xelloss

Re: Regular Expressions Tutorial V2 3 years 11 months ago #36418

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Part 12: Getting Negative with Look-ahead

Another useful thing about look-ahead negation. We can specify that words are not ahead in the string with (?!...)

Uncanny(?!=.*X-Men)

Will Match:

Uncanny X-Force
Uncanny Avengers
Uncanny

But not:

Uncanny X-Men

In english this is 'Search for series where Uncanny is not followed by X-Men'

And this brings up a condition that might surpise you:

Spider(?=.*\bMan\b)
(Remember \b is a word boundary so \bMan\b is making sure man isn't inside a word like management or human.)

Will not only match:

Spider-Woman
Spider-Girl

But also:

Spider-Man Family Featuring Spider-Clan

Because Spider-Clan has a Spider not followed by Man.


Without the .* we can specify that a word or character is not followed by another:

Spider(?!-)

Matches

The Spider
Scarlet Spider

But not:

Spider-Man
Spider-Woman
Spider-Girl
Spider-Clan
Last Edit: 3 years 11 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: perezmu, Xelloss

Re: Regular Expressions Tutorial V2 3 years 11 months ago #36497

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Real Example:

I recently used Regular Expressions to tidy up the Format field in my collection.

I had many comics listed as "Scan" but a few stragglers listed "scan" ie without Capitalisation, but I couldn't just use a smart list based on 'Format Contains scan' becuase that would be case insensitive.

So instead I just changed Contains to Regular Expression leaving this Smart List:

Name "scan"
Match [Format] regex "scan"


This is case sensitive.

This is probably the simplest use of Regualar Expressions but it is pretty useful when tidying up individual fields.
Last Edit: 3 years 3 months ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: Xelloss

Re: Regular Expressions Tutorial V2 3 years 11 months ago #36498

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Reserved space
xx
To comment or ask questions please respond to the first tutorial thread:

Regular Expressions Tutorial
The administrator has disabled public write access.
The following user(s) said Thank You: Xelloss

Re: Regular Expressions Tutorial V2 1 year 2 months ago #45616

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Gold Boarder
  • Posts: 313
  • Thank you received: 81
  • Karma: 19
jkthemac wrote:
Real Example:

I recently used Regular Expressions to tidy up the Format field in my collection.

I had many comics listed as "Scan" but a few stragglers listed "scan" ie without Capitalisation, but I couldn't just use a smart list based on 'Format Contains scan' becuase that would be case insensitive.

So instead I just changed Contains to Regular Expression leaving this Smart List:

Name "scan"
Match [Format] regex "scan"


This is case sensitive.

This is probably the simplest use of Regualar Expressions but it is pretty useful when tidying up individual fields.

hahaha, thanks for the tutorial...

Just to comment, I just used your tutorial for this problem I had: (and solved it!)

Playing with an script I was programming, I just messed up all my collection "Series" fields, replacing them in each comic with a "simpler for match" version (eg: "Spider-Man Annual" changed to "spidermanannual") - just in case someone is programming an script, know that if you load a copy of all books in the library with the CR command, it isn't really a copy, it is a pointer to the real collection, SO DON'T MODIFY ANYTHING THERE unless you want to modify it in the library too :pinch: -. The thing is I had to scrap every comic again, but as It took literally days, I had to do it in parts... Which carried the big problem of knowing which ones were already corrected and which ones were still to be fixed...

The first thing I did, as the series was in the file name, is use the filename to compare the series... so for example, if the filename contained "Spider-Man" and the series was "spiderman", I knew the series was not ok... That worked ok for most comics (although I had to rescrapped all comics which had special symbols not allowed in filenames, such as : / "... but they were only a few hundreds... so no problem there). But then I realised that one word series were, in case of non sensitive matiching, the same even not fixed ("Wolverine" and "wolverine"). So, I needed to find any series starting with a non capital letter... but I couldn't find a way to do it with the tools I had...

Fortunately, I found your tutorial... (I read it all btw, and it's really FANTASTIC, not only very easy to understand, but even funny and entertaining) and I just used:

^[A-Z]

And with that, I found all books that started with non capital letters or symbols...

I am sure I will use more complex expressions for a lot of other things, but even if only that, it saved me from looking one by one every comic in my Library XD

as usual: Forgive my pathetic English... Between I just suck at writing in English and that it is 4am here and I don't know half of the things I write, I don't need to read what I posted that I know it is an insult to the language XD
Last Edit: 1 year 2 months ago by Xelloss.
The administrator has disabled public write access.

Regular Expressions Tutorial V2 2 months 1 week ago #47666

  • jkthemac
  • jkthemac's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 748
  • Thank you received: 244
  • Karma: 54
Placing these examples here after discussing substrings in another thread:

Contains word that starts with Substring:

(?i)\bsuper\p{L}+\b

Ends with Substring:

\p{L}+man

Substring in Middle:

\p{L}+man\p{L}+

Start, middle or whole word:

(?i)\p{L}*man\p{L}*

Whole word (without hyphen):

(?i)\b[^-]man\b

String starts with whole word:

(?i)^superman\b

String starts with a word that ends in the substring:

^\p{L}+man\b

I don't believe I talked about \p{L} in my tutorial. It is a more general way of referring to a letter, which includes any unicode letter from multiple languages.

[^-] is a way of specifying a specific character or characters are not in a position, (in this case a hyphen).
Last Edit: 2 months 1 week ago by jkthemac.
The administrator has disabled public write access.
The following user(s) said Thank You: rmagere

Regular Expressions Tutorial V2 2 months 1 week ago #47678

  • rmagere
  • rmagere's Avatar
  • Offline
  • Gold Boarder
  • Posts: 215
  • Thank you received: 20
  • Karma: 6
Thank you!
I would suggest (for clarity purpose) that the "variable" part in your example is coloured differently.
Once you realise that you are using superman as example - it because clear - however given my lack of familiarity with regex it took a little bit of time to get there :)

E.g.jkthemac wrote:
Contains word that starts with Substring:

(?i)\bsuper\p{L}+\b

Ends with Substring:

\p{L}+man

Substring in Middle:

\p{L}+man\p{L}+

Start, middle or whole word:

(?i)\p{L}*man\p{L}*

Whole word (without hyphen):

(?i)\b[^-]man\b

String starts with whole word:

(?i)^superman\b

String starts with a word that ends in the substring:

^\p{L}+man\b
The administrator has disabled public write access.
Time to create page: 0.210 seconds

Who's Online

We have 246 guests and 4 members online