This is an old revision of the document!


WebComics

ComicRack supports WebComics (.cbw) files. With WebComics ComicRack can read comics directly from web pages and display them as if they where standard eComics (CBR, CBZ). WebComics can be exported to other formats. If the definition supports it, WebComics can update itself to add new pages (like for daily or weekly comics).

File Format

WebComic is an XML based file with the .cbw extension. The basic structure is the following:

<WebComic>
  <Info/>
  <Variables>
    <Variable Key="Base" Value="http://www.milehighcomics.com/firstlook/marvel/avengers500/" />
  </Variables>
  <Images>
    <Image Url="{Base}cover.jpg" />
    <Image Url="{Base}[0:1-24].jpg" />
  </Images>
</WebComic>

Elements

Info

This is where ComicRack will store it's eComic info (when you edit the values with the Comic Book Dialog or execute a script to get the values).

Variables

This is a an optional collection to define textual variables you can reuse in the image entries (with the {key} construct).

Images / Image

This is a collection of Image entries that define the actual pages of the WebComic. In the simplest case this can be direct links to images on the internet or it can be complex scraping definitions.

Image Types

ComicRack supports three different Image element types to add pages to your WebComic.

Url

<Image Type="Url" Url="http://lala.com/someimage.jpg"/>

This is the simplest type of a page link. It just tells ComicRack to add the linked image as a new page. As this type is the default, you can omit the PageLinkType attribute.

This type supports defining references to multiple pages with one entry. The syntax is

<Image Type="Url" Url="http://lala.com/page[format:a-b].jpg"/>

where format is a number format string like “0” or “00”, a is the start number, b is the end number.

so adding an image element like

<Image Type="Url" Url="http://lala.com/page[00:8-11].jpg"/>

is the same as

<Image Type="Url" Url="http://lala.com/page08.jpg"/>
<Image Type="Url" Url="http://lala.com/page09.jpg"/>
<Image Type="Url" Url="http://lala.com/page10.jpg"/>
<Image Type="Url" Url="http://lala.com/page11.jpg"/>

BrowseScraper

The BrowseScraper type is intended for WebComics that do not have an index page, but rather a start page with an image on it and a next button to get to the next page.

BrowserScapers are using Regular Expressions. The basic structure is

<Image Type="BrowseScraper" Url="start page link|regex for the image link|regex for the next page link"/>

Alternatively you can omit the PageLinkType and start the Url with a '?':

<Image Url="?start page link|regex for the image link|regex for the next page link"/>

Or you can define the three parts separately (for example if they are very complex or contain the | delimiter in the regular expression):

<Image Url="?start page link"/>
  <Parts>
    <Part>regex for the image link</Part>
    <Part>regex for the next page link</Part>
  </Parts>
</Image>

So let's try this with an example, a classic daily web comics www.penny-arcade.com. We want all the 2010 issues.

<Image Url="?http://www.penny-arcade.com/comic/2010/1/1/"/>
  <Parts>
    <Part>&quot;http.*/\d\d.*jpg&quot;</Part>
    <Part>(?&lt;link&gt;&quot;/comic/.+&quot;)(?=.Next)</Part>
  </Parts>
</Image>

The Url is the start page for our scraper. The first part defines the regular expression for getting the link to the Jpeg image(s). The second part gets the link for ComicRack to move one forward. If this part does not match, or the link is one that ComicRack already scraped, the scraping ends.

Also note that as you are in an xml file you need to write special characters like “ or > with their XML entities (like &quot; or &gt;).

When ComicRack updates the WebComic it rereads the last link page and checks if a new page has been added.

IndexScraper

The IndexScraper is intended for web comics that have a central index page for all their pages. The general format is

<Image Type="IndexScraper" Url="index page|[!]regex for page links|[!]regex for page links on these pages|...|[!]regex for the images"/>

The scraper supports a chain of n pages to get from the index pages to the actual images. This way it supports links like Index Page→Month Links→Day Pages→Images on day Page

The optional ! in front of a regex tells the scraper to reverse the matches. This is helpful if the index page lists newest first.

As with the BrowserScraper you can also omit the PageLinkType and simply start the Url with an ! or put the regex expressions into a part list.

Let's look at an example: http://www.abandoncomic.com/ - Abandon: First Vimpire

 <Image Url="!http://www.abandoncomic.com/?page_id=6">
   <Parts>
     <Part>!href=&quot;(?&lt;link&gt;.+\?p=\d+)</Part>
     <Part>src=&quot;(?&lt;link&gt;.*/comics/.*\.jpg)&quot;</Part>
    </Parts>
  </Image>

The Url is a link to the index page. The first part is the regex to find the individual pages, the second part is to find the image links on these pages. As the index page is newest first, we reverse the order with the !.

When ComicRack updates the WebComic it rereads the index page.

How to create WebComics

Go with the browser of your choice to your WebComic page. Descide if you need to create a Url based (simple) or a regex based (BrowseScraper, IndexScraper) WebComic.

To find the regular expressions, select “View Source” in your browser and copy the html code into a regex testing tool of your choice. Play around to the the regular expression. If you think you're done put the expressions into the WebComic file and open it with ComicRack.

Please note that ComicRack works with the .NET implementation of RegEx. If the expression contains a link group, this one is used. Otherwise the matched expression is used.

If you start ComicRack with the -ssc commandline switch, ComicRack will display a log of all the scriping actions. This may help when debugging e Web Comics.

Regular expression

.NET Framework Regular Expressions - msdn library regex documentation

RegexBuddy - regex testing utility, commercial software

Expresso - same as above but freeware


Navigation