Welcome, Guest
Python Scripts for ComicRack

TOPIC: Missing Issues using ComicVine (New Version 06-DEC-2014)

Missing Issues using ComicVine (New Version 06-DEC-2014) 6 years 9 months ago #12321

  • Samael69
  • Samael69's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 381
  • Thank you received: 47
  • Karma: 21
OK, after long last I've worked out some of the stability issues. Multi-threading still doesn't work, but there's a work-around for it so you can see some semblance of progress.

Disclaimer:
This is a self-contained application and does not write anything to your library. This is not a replacement for the current find gaps script. This will only work on collections which have been scraped with the ComicVine Scraper and will ignore all entries which do not have a ComicVine (CVDB) URL in the Web Page field of the metadata. It will not find missing issues if they have not been entered on ComicVine. I have not tested this with huge libraries, say 30,000+ books, though, I think they should work fine.

In the archive you will find two files

Run_cvDB.bat - While not required under windows to run JAR files, this opens a DOS console, which is where you will see progress. If you run the application without this, it will work just fine, but it will sit there like a dead horse until it's done or you kill it.

FindMissingIssues.jar - This is the application.

In the batch file there are several lines

This application requires Java 6.0 (JRE 1.6). If you have multiple versions of Java installed you must point to the correct version

If you try to run it on an incompatible version you'll get an error something like “Unsupported major.minor version 49.0″.

This is the execution line, you'll have to edit this to match where you put the JAR file.

"C:\Program Files (x86)\Java\jre6\bin\java" -Xms64m -Xmx1024m -jar "I:\New folder\test\FindMissingIssues.jar"

REM "C:\Program Files\Java\jre6\bin\java" -Xms64m -Xmx1024m -jar "F:\New folder\test\FindMissingIssues.jar"

pause

The lines beginning with REM are commented out and ignored. You should create the line that works and remove the others as they're simply included as examples. "Pause" just keeps the DOS window open until you hit a key to close it.

If you're unsure how to find the file or still have problems, try this...Open a DOS Window by clicking Start and typing "cmd" in to the "Run" or "Search Programs and File" field. Follow these instructions. There's a very high probability that it is installed on your C drive.
C:\Users\USER_NAME>c:

C:\Users\USER_NAME>cd \

C:\>dir/s java.exe
 Volume in drive C has no label.
 Volume Serial Number is AC1D-224C

 Directory of C:\jdk\bin

20/10/2009  06:57 PM           139,264 java.exe
               1 File(s)        139,264 bytes

 Directory of C:\jdk\jre\bin

20/10/2009  06:57 PM           139,264 java.exe
               1 File(s)        139,264 bytes

 Directory of C:\Program Files\Java\jdk1.6.0_18\bin

03/11/2009  06:58 PM           165,888 java.exe
               1 File(s)        165,888 bytes

 Directory of C:\Program Files\Java\jdk1.6.0_18\jre\bin

03/11/2009  06:58 PM           165,888 java.exe
               1 File(s)        165,888 bytes

 Directory of C:\Program Files (x86)\Java\jre6\bin

11/10/2009  05:17 AM           145,184 java.exe
               1 File(s)        145,184 bytes
The output will look something like this and you can get the path you require. As you can see, I have multiple versions installed, but I am using the last one, the one that came with the Java Runtime Environment (JRE).

Once you run it, you'll see a screen with a number of check-boxes and fields.
  • Update All Issues - Forces the application to rescrape information of ALL issues, even if they already exist in the cache.
  • Update All Volumes - This is the default functionality. Basically it will skip all issues which have already been scraped and only check volumes for new issues.
  • Enable Blacklist - Forces the selected volumes to not show on the final output. This basically says I don't care if I'm missing these issues. For me, this contains most of the mainstream superhero stuff that I have single issues of as part of events or crossovers.
  • Hide 0, 0.5 Issues - This will hide negative, 0 and partial issues from the output. So examples might be "-1", "0", "0.5", "1.5", "100.1", etc.
  • Rebuild Local Cache - This overrides "Update All Issues" and "Update All Volumes" and rebuilds everything as if it's the first run.
The first field shows the default path for your "ComicDb.xml" file. You can choose an alternate file if you like. I don't force anything here and do no error checking. You can choose any file of any type, the application will simply fail if it's not correct. The Output is where the output is displayed when completed. It is also saved to a file "missingIssues.txt" in the same folder as the JAR.

The Blacklist shows your blacklist. You should restart the application after editing the blacklist. There is a bug here in how it displays. Clicking in the Output window will add titles to the blacklist. You still have to press the Save button...there is no confirmation. You can edit the Blacklist directly and save, but should only remove entries. The Blacklist will look something like this after clicking a number of titles...(Clicking issue lines will also add them to the blacklist, but really won't do any good. :P )

Green Lantern Corps (DC Comics - 2006) (18248) Adventure Comics (DC Comics - 2009) (25643) Green Arrow (DC Comics - 2010) (32002) Justice League of America (DC Comics - 2006) (18127) Outsiders (DC Comics - 2009) (25804) R.E.B.E.L.S. (DC Comics - 2009) (25833)

When you restart the application, it will be nicely formatted like this...

Green Lantern Corps (DC Comics - 2006) (18248)
Adventure Comics (DC Comics - 2009) (25643)
Green Arrow (DC Comics - 2010) (32002)
Justice League of America (DC Comics - 2006) (18127)
Outsiders (DC Comics - 2009) (25804)
R.E.B.E.L.S. (DC Comics - 2009) (25833)

The last number in the line is the ComicVine volume number The brackets are very important here.

When you press the "RUN" button, you should start seeing output on the DOS window...primarily in the form of CV URLs. Output on the DOS window should look something like this:
Processing volume: Hard Boiled
     http://www.comicvine.com/hard-boiled-hard-boiled/37-121911/
     http://www.comicvine.com/hard-boiled-/37-121934/
     http://www.comicvine.com/hard-boiled-issue-3/37-150994/
Processing volume: Icaro
     http://www.comicvine.com/icaro-icaro/37-257110/
     http://www.comicvine.com/icaro-icaro/37-257812/
Processing volume: Adastra in Africa
     http://www.comicvine.com/adastra-in-africa-original-graphic-novel/37-142371/
NOT Scraped by ComicVine: E:\Books\Comic Books\COMIX a History of Comic Books in America by Les Daniels c2c.cbz
NOT Scraped by ComicVine: E:\Books\Comic Books\European (Misc)\Dusk - Poor Tom\Dusk_-_Poor_Tom.cbz
Processing volume: Echoes
     http://www.comicvine.com/echoes-/37-252849/
Processing volume: Empowered
     http://www.comicvine.com/empowered-/37-116039/

SAVING XML. 48 records processed

Processing volume: Sin City: Hell and Back
     http://www.comicvine.com/sin-city-hell-and-back-/37-48371/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48372/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48373/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48374/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48375/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48376/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48377/
     http://www.comicvine.com/sin-city-hell-and-back-/37-106675/
     http://www.comicvine.com/sin-city-hell-and-back-/37-48378/

On the first run, it will hit the CV API for every issue and every volume. My collection is about 18000 and takes about 1.5 hours to run the first time through, if CV doesn't go down. It saves the cache every 25 issues processed, so if ComicVine goes down you can rerun with the default settings. It will only check volumes and pick up issues where it left off.

Now, if you've only added a few issues or added entries to the Blacklist and don't want to run a full compare, uncheck both "Update All Issues" and "Update All Volumes". This will still add new items to the cache and scrape their details from CV, but will essentially skip everything else. It all still has to be loaded in to memory, so you'll see it fly by in the DOS window, but should only take a few minutes.
  • missingIssues.txt - This is the primary output generated by the application.
  • blacklist.txt - This is the blacklist.
  • filelessentries.txt - This is the fileless entry file perezmu said he'd build a script for, for those who wish to create fileless entries.
  • localCache.xml - This is the cache built from your local ComicDB.xml file
  • remoteCache.xml - This is the cache built with details scraped from CV.
By the time it finishes, it generates 5 files in the same folder as the JAR.

If it stops for a long period of time (5-10 minutes), and this WILL periodically happen as this software is NOT fault tolerant, it means there was an issue with ComicVine. It will have to be killed and restarted. (Use default settings) As I said, it saves the cache every 25 issues, so it's not all bad. After pressing the "RUN" button, the application will appear frozen, bring the DOS window forward to monitor activity.

In my experience with this, if it returns an entry for something you know you have, it usually means it was scraped incorrectly with the ComicVine Scraper. Wrong volume, wrong issue...things like that.

Lastly, I currently don't handle deleted issues. If you remove something from your library, you'll have to rebuild the local cache to get rid of it. I may look at doing this in the future, but currently it's just not that important to me. To do this would add a lot of overhead.

NOW, occasionally you WILL see things in the output you know you have already. There are a number of reason for this.
  1. Multiple entries for the same book
  2. Volumes built incorrectly. i.e. Issues and TPB in the same volume.
  3. The issue(s) was scraped incorrectly
  4. The URL for the issues has changed due to someone doing maintenance on ComicVine
  5. Someone entered new data on ComicVine
I once had over 50 issues of Heavy Metal reappear because someone did some refactoring on ComicVine.

What's missing? Well, currently you cannot blacklist a single issue only entire volumes. And you cannot clear strays out of the cache without rebuilding. Given the length of time it takes to rebuild, the ability to remove strays is my next highest priority.



INFORMATION RELATED TO THE 23-OCT-2011 VERSION

These are somewhat larger than the originals due to the libraries required for the new XML parser. I've put them on SendSpace because they exceed the maximum size allowed on the board.

Most of the changes are under the hood to optimize performance. The sorting is a new feature. I provide options for
  • Position - This is the order in which volumes have been added to the library. The "Sort Order" does not affect this type of sort. It will always sort ascending for this option.
  • Series Title - This is alphabetic by title. Can be ordered either ascending or descending. I ignore "The", but not "A" or "An".
  • Date - This is the most recent date, in theory, an issue was added to a volume on Comic Vine. This is NOT the newest issue. If #1 (of 4) was the last issue entered, that date will be used. Again, it can be ordered either ascending or descending. NOTE: An extra attribute in the cache files must be created, YOU WILL HAVE TO REBUILD YOUR CACHE TO SORT BY DATE!
The only other obvious changes are
  • It will now identify issues which have an ID that is no longer valid on ComicVine. These will be displayed at the top of the final report, if any are found.
  • The progress window now shows issue, number, volume rather than the ComicVine URL.

31-MAR-2013 Changes (Version 1.3)
  • Fix bug related to new ComicVine API.
  • Added functionality to report the highest issue number on ComicVine.

03-MAY-2013 Changes (Version 1.4)
  • Removed the "AAAAA...", "BBBBB..." etc debug comments
  • Changed the software to use internal numbering for sorting. Now non-number issue number will display in the missing list. Unicode characters are still converted to notepad-friendly representations.
  • Changed the save interval from 25 to 100 to reduce disk writes.
  • Added a time-stamp to the missingIssues.txt file so that previous files do not get over-written, this will allow for quick compares to find differences between runs. This could save a great deal of time with large collections.
  • Added a section at the bottom of missingIssues.txt to identify issues with non-numeric characters. If you have a remote issues, but no matching local issue, you potentially need to rescrape that issue....or you're missing it.
  • Added support for several more non-standard issue number types

16-JUN-2014 Changes (Version 1.5)
  • This version is essentially untested as I do not have a library that is in a state that will facilitate testing. Please post any issues encountered here.
  • It will only look at books that have a custom field containing the volume ID. If you have old collections that have not been scraped since custom fields were added (Some time in the last year I think) you will need to rescrape all these books. I'd suggest creating a smartlist to find books that do not have custom fields defined, but do have tags containing "CVDB". These are what will have to be rescraped. I'd suggest starting this work as soon as possible.
  • It will no longer make calls to ComicVine for every book. Unless the "Update All Volumes" or the Rebuild box is checked (Essentially, now the same and I will likely remove one of them), the only time ComicVine will be queried will be the first time a new volume is encountered. As such, there will no longer be a box for updating all issues.
  • Because it will no longer be updating issue information from ComicVine directly it will no longer be able to detect invalid issue IDs. This is a sad loss, but necessary.
  • In the place of the update all issues box, there will be a throttle where you can define a time delay, in seconds, between queries. This will hopefully allow people to run things without running in to speed limit issues or having to baby-sit it too much. Personally, I would rather it run slow than have to baby-sit it.
  • Like the Scraper, the new version will require everyone to have their own API key. The one you got or will get for the scraper will work just fine. They do not need to be different, but I'd wait a while after scraping before running this just to ensure you're starting with a fresh speed limit.
  • The delay and API key fields have save buttons so that you only have enter the values once.
  • Sorting by date may be broken, if it ever worked, but I'm waiting to see if other options open up that will make this more useful.

21-JUN-2014 Changes
  • A couple bug fixes
  • I've also removed "Update All Volumes" and re-enabled the "Rebuild Local Cache". Both require the same number of queries to ComicVine and the rebuild option is less prone to cause bogus data.

25-JUN-2014 Changes
  • Fix a parsing bug caused by a bug in the STAX XML library.

06-DEC-2014 Changes
  • Alright, I have the "&" issue fixed, I think. Volumes with names with "&" will now display the full title. So "Sam & Twitch" will display as "Sam and Twitch" rather than "& Twitch". The issue related to the unicode characters 1/2, 1/4, 3/4 and infinity as well as anything else that was non-numeric just seemed to be nothing more than extra messages that actually did have an obscure use when I was grabbing my own issue information. Now, not so much, so I have removed them. The conversions and missing issue search seem to be actually be working fine.

Download Link: www.sendspace.com/file/eoko75
Last Edit: 2 years 11 months ago by Samael69.
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12350

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
Hi,

it is a pitty you cannot do that integrated in ComicRack, because the best way to go I think would be to create fully scrapped fileless entries in the database...

But in any case, please, keep your effort and share it!

Cheers!:woohoo:
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12351

  • Wedge
  • Wedge's Avatar
  • Offline
  • Senior Boarder
  • Posts: 60
  • Thank you received: 13
  • Karma: 1
I took cbanack at his word when he said "An ambitious script writer could try to steal some of the code from Comic Vine Scraper"

I just copied the whole scraper folder to get it going as a proof of concept. I'll attach my version of the scrapeengine file. My python programming experience is literally two days, so I'm sure there are better ways of going about it, but it may be of use to someone. I marked my changes with comments.
The file-less creation part is only slightly different from yours from your Story Arc script perezmu.
Attachments:
Last Edit: 6 years 9 months ago by Wedge.
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12352

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
Wedge, so, this creates fileless ecomics for missing issues in the same volume as that of the comics being scraped?
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12353

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
perezmu wrote:
Wedge, so, this creates fileless ecomics for missing issues in the same volume as that of the comics being scraped?

Yep, I 've seen it does... great! Thanks!
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12354

  • Wedge
  • Wedge's Avatar
  • Offline
  • Senior Boarder
  • Posts: 60
  • Thank you received: 13
  • Karma: 1
For every missing issue, it will create a fileless comic that has nothing but a comic vine ID in the tag, but then immediately scrapes it, so yes, assuming the scrape succeeds.
I didn't see much point filling in any of the other fields since they were going to be overwritten anyway.

Obviously it breaks the normal function of the comic vine scraper, I just duplicated the folder and renamed it.
I don't know enough about python or the scraper to make a proper job of it.

also, like Samael said in the other thread, it would probably be a good idea to store ComicVines volume ID so you could query the list of issues directly the next time
Last Edit: 6 years 9 months ago by Wedge.
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12355

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
Nice patch in the meantime, though! Thanks!
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12371

  • Samael69
  • Samael69's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 381
  • Thank you received: 47
  • Karma: 21
perezmu wrote:
Hi,

it is a pitty you cannot do that integrated in ComicRack, because the best way to go I think would be to create fully scrapped fileless entries in the database...

But in any case, please, keep your effort and share it!

Cheers!:woohoo:

I could generate fileless entries, I have no issues inserting entries directly in to the XML library. I built my cache files in XML, so I have most of the XML processing logic already built, I would only have to make it a little more generic. The only issue I have is the book GUID that ComicRack assigns. Are they generated by some prescribed formula or are they just randomly generated, say by date, so they are unique? They wouldn't be fully scraped, but the entries would be there that should then be able to be scraped with the ComicVineScraper.
Last Edit: 6 years 9 months ago by Samael69.
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12372

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
Samael69 wrote:
I could generate fileless entries, I have no issues inserting entries directly in to the XML library. I built my cache files in XML, so I have most of the XML processing logic already built, I would only have to make it a little more generic. The only issue I have is the book GUID that ComicRack assigns. Are they generated by some prescribed formula or are they just randomly generated, say by date, so they are unique? They wouldn't be fully scraped, but the entries would be there that should then be able to be scraped with the ComicVineScraper.

Ooooops, I don't really know about the GUID...! Another approach would be to create a text file, simple, like "CVDBID Series Issue Volume" for each missing issue, and then that could be easily read from within ComicRack with a very simple python script (if you choose to go that way, I could do that for you).
The administrator has disabled public write access.

Re: Missing Issues using ComicVine 6 years 9 months ago #12373

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
perezmu wrote:
Ooooops, I don't really know about the GUID...! Another approach would be to create a text file, simple, like "CVDBID Series Issue Volume" for each missing issue, and then that could be easily read from within ComicRack with a very simple python script (if you choose to go that way, I could do that for you).

I mean, from the python script, simply reading that info the fileless comics could be generated (no need for GUID, just calling the appropriate function CR exposes), and in a later step they could be scraped using CVS
The administrator has disabled public write access.
Time to create page: 0.257 seconds

Who's Online

We have 344 guests and 2 members online