Welcome, Guest
General discussion about ComicRack

TOPIC: Scanners technical question... and posible forum communitie project

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48335

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
This is more a techincal question than anything else...

I was reading The Organizer 2015 great tutorial pdf, and I noticed I never paid much attention to scanners of comics. I have long changed comic files filenames (the files, and the files inside them), so I lost that information source for "capturing it" with a script.

Now I was thinking of a way to recovering this data, which seems an imposible task. HOWEVER I thought of this:

If scanners "scan" or make some kind of unique digital work on comic pages... then by probability I could say that it is almost imposible the same comic page from different scanners has exactly the same "image coding". In other words, I could use image files crc in zip files as a "signature" of an scanner work, can't I? Of course for this I would need the information of which crcs match which scanner, but this data is something someone who saved the scanner in their comic files could provide... (as far as they didn't recode the pages file, which unless they resize the image files or change it's quality, it isn't very probable)

I could make an script that runs in comics which scanners are recognised, save this data in a file, make a huge database with crc-scanner matches (for each comic and each version) and then use it to, with another script, inout this data in comics where this data was lost... (in other people's library)

In other words, I am talking about a scanner info provider based in pages files crc... Do you think this would work?

Of course all this is based in the hypothesis scanners recode their comic pages and they don't leak raw files from some common source... because if it is the second case, different scanners would have the same page files... and it would be imposible to separate one from the other...

If this work, however, I could make a script that for example store the crc of every page of a comic, and "link" all this crcs to a scanner saved in a certain field (the field used in the tutorial for example). Then, with this info stored in a common database, use this to search comics in a library that has at least 50% of this crcs (to make sure the comic is the same and not only a few pages that could be ads) and save the scanner linked to them in the same field...

To make a succesful database that recognise most comics I would also need the help from people that save this data in their comics (and don't recode the pages, compressing or resizing them) and have a lot of comics in cbz format... Just run my script, and upload the output file to the forum :)

Once the database is made, this script could recognised scanner of comics, even if they were renamed, have deleted pages, were exported to another format, or modified multiple times... always based on the fact that somebody who saved the scanner by other method (filename for example) upload the crc-scanner data to the shared database of course...

Edit: Now that I think of it, if scanners work as I think they do, I could even used shared pages (ads and scanner page information) to recognise scanners in comics that are not even in the database (assuming same ad or same scanner page means same scanner). It is a pity I have deleted this pages from my comic, but a lot of people keep them in their comic files!
Last Edit: 4 weeks 1 day ago by Xelloss.
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48338

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
I will write here how it would work, so other users can find problems or add ideas with the procedure... (I will update it as I find improvements)

Export scanner info from libraries:

First look for all comics in library which has scanner field completed (and save this comics in a list)
Then, look among the comics which has the scanner field uncompleted if they still has scanner information in the filename (if so save them in the list - and fill the scanner field)
Then, look among the rest of the comics, which has information inside the file (page filenames and folders - and fill the scanner field)

Once I have a list of all comics and their scanner info, I begin opening each file and saving the crc of every page (linking scanner with crcs) -> I have some experience with this as I made a script that compare crc in files for finding duplicate pages

While doing the above I will look for duplicate pages in different comics that has the same crc (ads and scanner info pages duplicated in different comics), if this pages are repeated among 2 or more comics of the same scanner (and only of the same scanner), I will save this crc-scanner in a special list...

Then I will search for all comics in the library - with no scanner information - which has this crcs in the special list, and add the scanner info (if they have this pages, I know their scanner and I can complete it).

Once I completed the scanner field in every comic I could, I would do once more the list of crc-scanners and export it in a file that could be uploaded to the forum for add to the shared database...

Import scanner info from libraries:

Once downloaded the las version of the shared database, you run the script, and it will look for crcs in the database, if found more than one in a comic or only one of the special ones (ads and scanner info page), the script would field the scanner it is linked to (in the case of two regular crc from regular pages, if the scanner linked to both crcs is the same)

I used two pages in case of regular pages, because in my experience with crcs in millon of pages, it is common to find one or two false positive, so to reduce them, I make two pages comparision (which is almost imposible to happen in the same comic)

After this, I run the same procedure of the export part, for autopopulating comic scanners by same ad or scanner page crc

Note: I use crcs hash in cbz/zip files, because it is easy with a script to read them from the zip file without loading all the pages... as zip files has an array inside them with a list of all file crcs... For those who doesn't know what crcs are, they are hashes that are used to see first if the file uncompressed is corrupted, and second it can be used to see if two files are duplicates easily (as one bit changed in a file changed the crc). The crc comparision is not a 100% probable method, as different files CAN have same crc... but in small groups of files (less than millons of files) the probability of false positive comparisions is very low...
Last Edit: 4 weeks 1 day ago by Xelloss.
The administrator has disabled public write access.
The following user(s) said Thank You: Alan Scott

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48341

  • Alan Scott
  • Alan Scott's Avatar
  • Offline
  • Gold Boarder
  • Posts: 264
  • Thank you received: 20
  • Karma: 10
Well I'm very interested in this... I too didn't keep the scanner names when I started, as there was no tag for scanner information when I began using ComicRack back in 2008. Since then I've come to regret not having it for choosing between duplicates and other reasons. I've been doing this manually one comic file at a time, and as I've gone along the two easiest ways to retrieve the data is either the scanner used a unique filename for their scanner .jpg, so if you find that you know who was the source, or they nested the comic images in total within a folder that is identical to the original name of the comic, scanner name included. The only problem for me is I'm still having to do this one file at a time. The only other solution I've found is to browse a file share source that hashes your files and you can look through alternate seeders and hopefully they would have a full run of comics that matches yours with the scanner names and makes it a little easier.

Your proposed script sound something like that with the added bonus that we can use the matches to automatically populate the metadata. I absolutely would love this! Sign me up.
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48342

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
"The only other solution I've found is to browse a file share source that hashes your files and you can look through alternate seeders and hopefully they would have a full run of comics that matches yours with the scanner names and makes it a little easier."

I am interested in that, could you elaborate a bit more? File share source? seeders? where?

Also, when you talk about hashes, you are talking about cbz hashes? because if so, when you store data in the xml inside the cbz file, you modified the hash. My idea is working with the hashes INSIDE the cbz file, from the images files... those one are usually not modified (only renamed or deleted)

The part about the autopopulate the data is by far the easier part... is only a pair of lines in a script... the problem is to have the data to know what to populate... (which is what I am trying to achieve with this)

"the scanner used a unique filename for their scanner .jpg, so if you find that you know who was the source, or they nested the comic images in total within a folder that is identical to the original name of the comic, scanner name included."

Could you give me examples of this? As I export all the comics to cbz format as part of my organizing system I also lost all pages and folders names inside the cbz files of my comics D:. If I am going to write a script that interpret this info, I need as much examples as possible... I will try to work with new comic files downloaded from now on... but if you have some experience with identifying scanner by this, you can give me a better idea...

All the same, I think the true potential of my idea is finding duplicate pages in DIFFERENT comcis... because with only identified and ad or a scanner ad page... you can identified the scanner of A LOT of other comics that also share this page... (scanners 99% of the time reuse pages repeated in different comics)
Last Edit: 4 weeks 1 day ago by Xelloss.
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48343

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Also another question... imagine you have a comic named: "bla bla bla (Zone-Empire).cbz" or "bla bla bla (Minutemen-Toth).cbr"

What is exactly the scanners? "Zone-Empire"? "Empire"? "Minutemen"? "Minutemen-Toth"? "Toth"?
Last Edit: 4 weeks 1 day ago by Xelloss.
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48344

  • Alan Scott
  • Alan Scott's Avatar
  • Offline
  • Gold Boarder
  • Posts: 264
  • Thank you received: 20
  • Karma: 10
• 1. Cool, wasn't sure I could mention it. The source is DC++. What you've shared will be hashed against all other files on DC++ and if you look through your own file list and search for alternates you can find everyone else's share of that file and see if you can find one with the scanner info. If it's a complete series you can then browse the file list for it and see if you match the whole list, which makes it a little easier to get the info. You still have to do it manually but it is a little easier.

• 2. Okay, first a file name match. Zone-Empire uses a couple different photos for his tag, but one example is zdelirium.jpg. No one else uses that name, so if you scan the zip/rar files and get a hit on that file name, you can auto-populate with the script that the file scanner is Zone-Empire without doubt. Now, for a folder match. Say for example it's Age of Heroes 01 (2010). When you open the file with WinRAR or what have you, the first item you find is a folder named "Age of Heroes 01 (of 04) (2010) (Minutemen-CalamityCoyote)". That's the original filename. So if you can use use a script to scan the comic files and find folders like that, you can also use that to auto-populate.

The only other way I've scanner info within a comic file is a rare text document, .nfo or maybe a .sfv file that may have it, but these are so rare and with no real order that it wouldn't be worth the trouble to accomplish it.

It goes without saying that if you converted a cbr to cbz to save a ComicRack xml within and didn't save the scanner info then that data is lost because the file contents won't match any hashes. The best you can hope file is to manually view a scanner tag image you recognize.
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48345

  • Alan Scott
  • Alan Scott's Avatar
  • Offline
  • Gold Boarder
  • Posts: 264
  • Thank you received: 20
  • Karma: 10
Empire (and most others, current and defunct) places their group name at the end, Minutemen at the front. So Zone is the individual, Empire is the group, Minutemen is the group, Toth is the individual. Both parts of the name are needed for proper credit.

I have occasionally found scanners that either didn't work with a group for a release, or failed to place the group name on their own name, but this is not common.
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48346

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Alan Scott wrote:
Empire (and most others, current and defunct) places their group name at the end, Minutemen at the front. So Zone is the individual, Empire is the group, Minutemen is the group, Toth is the individual. Both parts of the name are needed for proper credit.

I have occasionally found scanners that either didn't work with a group for a release, or failed to place the group name on their own name, but this is not common.

Great! that is consistent with that I was seiing... as when the last name changed (the name of the scanner), the same pages (in ads) have different crcs... this means they are coded or scanned by a different method... which means I can separate one from the other using crcs...
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48348

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Alan Scott wrote:
It goes without saying that if you converted a cbr to cbz to save a ComicRack xml within and didn't save the scanner info then that data is lost because the file contents won't match any hashes. The best you can hope file is to manually view a scanner tag image you recognize.

With my method, if you don't use alter image options (change size for example), you don't lose the hash from the image files, so no, you don't lose the info! (as the hash is from the image files, not from the whole comic). That is what I am talking about doing. Even if you delete pages, as I do with scan ads, you still don't lose the hashes from the pages you left, and you can still use that as a mehod to compare files! (of course you need first the hashes from pages from already identified comics)
Last Edit: 4 weeks 1 day ago by Xelloss.
The administrator has disabled public write access.

Scanners technical question... and posible forum communitie project 4 weeks 1 day ago #48349

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Another question... talking about scanner page (when present), are they usually from the scanners or from the group of scanners? (never pay atention to that before)
The administrator has disabled public write access.
Time to create page: 0.203 seconds

Who's Online

We have 257 guests and 3 members online