Welcome, Guest
Section to post tutorials on how to manage and read your eComics

TOPIC: How to scrape your collection quickly and efficiently

How to scrape your collection quickly and efficiently 3 years 8 months ago #38884

  • WraithTDK
  • WraithTDK's Avatar
  • Offline
  • Junior Boarder
  • Posts: 36
  • Thank you received: 1
  • Karma: 0
With the recent problems with the scraper due to heavy use, I thought it would be appropriate if we compiled a guide to how to scrape more efficiently. This is particularly important for people who have just found CR and want to scrape a collection they've been building for years. I had over 100,000 books when I started, and I would just scrape the whole thing over night, do as much as I could the next day, and often start the whole process over again whenever I had to reboot my computer or hand it over to my wife.

Here, then, are my tips for how to scrape your collection as quickly as possible, while minimizing the need to re-scrape. Please respond with your own tips and together, we'll see if we can't bring that API usage down.

1.) This line:



Is your best friend. Build smart lists around it, and it will show you only the files that have not been scrapped, so you're not wasting your time with re-scrapes. When I started, I scrape thousands and thousands of books, manually identifying the ones it wouldn't recognize, stop for the night, come back, and rescrape EVERYTHING. Using this line saves ME a ton of time and prevents a flood of sever-hits. If you want to scrape everything that hasn't been scraped in a particular directory, add this line:



2.) Get your files named consistently. Once everything is scraped, you can use the organizer script to rename your files according to metadata, which is awesome. But until then, you can save yourself a lot of time by making sure that your files have fairly "clean" names before you scrape them. Easiest way to do that? Ant Renamer. There are similar products, but this one is free and works well. Get to know it.

For example, I downloaded tons of books that had numbers at the start of the file name. Some were story-arc bundles with the files numbered chronologically, some were just packs where the files had year/month before them. What I did was gathered all the one with number before them, sorted them by name, and then used the "character deletion" option to remove all the numbers. The scraper went from recognizing NONE of the files (because it was trying to reconcile the numbers as part of the file name) to recognizing the vast majority of them.

The ideal naming scheme seems to be "[series name] [issue #]" . Again, once you have everything scraped, you can use the organizer plugin to rename all of them based on metadata, using whatever scheme you want.

3.) If you have a huge amount of a particular series (for example, let's say you downloaded a mega-pack containing all 900+ issues of Action Comics), try to gather them all into one folder. Then, go to comicvine.com, find that series (you want the "volume" page, not a page for an individual issue), and copy and paste the URL into a text file. Save the text file in the same folder as the comic series, and name it cvinfo.txt. This tells the scraper that every comic in that folder belongs to that particular series, which can speed up the process considerably.

4.) If at all possible, don't scrape on Wednesday. This won't actually speed up your process, but CV has said that's when they see the biggest usage spike (edit: pweasel points out that this is due to Wednesday being when the scanning scene usually releases 0-day packs), so if you could avoid that day, it would benefit the community as a whole.
I am currently reading every Marvel Superhero comic book every printed, in chronological order, and blogging about the milestones, footnotes, and other interesting moments I read at http://www.wraithscomicjourney. I'll be adding DC when I hit 1985, and other companies when they launch.
Last Edit: 3 years 8 months ago by WraithTDK.
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 3 years 8 months ago #38886

  • RevQuixo
  • RevQuixo's Avatar
  • Offline
  • Gold Boarder
  • Posts: 280
  • Thank you received: 26
  • Karma: 12
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 3 years 8 months ago #38887

  • WraithTDK
  • WraithTDK's Avatar
  • Offline
  • Junior Boarder
  • Posts: 36
  • Thank you received: 1
  • Karma: 0
RevQuixo wrote:
Useful if you don't want to install another program, but I still recommend ANT renamer. It's been helpful for other things, as well. For example, I had a lot of comics that ended with "02 of 02 covers" because it had more than one cover. Without fail, this would result in me having to to manually verify them. I ran a search for "covers" in my comic directory, and told ant to delete 15 characters going from right to left. Problem solved.
I am currently reading every Marvel Superhero comic book every printed, in chronological order, and blogging about the milestones, footnotes, and other interesting moments I read at http://www.wraithscomicjourney. I'll be adding DC when I hit 1985, and other companies when they launch.
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 3 years 8 months ago #38904

  • pweasel
  • pweasel's Avatar
  • Offline
  • Expert Boarder
  • Posts: 124
  • Thank you received: 18
  • Karma: 8
Wednesday is the scanner scene's weekly dropoff date, so that's the spike :P
CRW 0.9.178 x64 on Win10
CRA 1.80 on Nexus 10
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 3 years 7 months ago #39133

  • Symmetry
  • Symmetry's Avatar
  • Offline
  • Senior Boarder
  • Posts: 43
  • Thank you received: 3
  • Karma: 2
Good guide! While I don't find any use from it, people who have less curated collections will. One suggestion I have is to take a look at Bulk Rename Utility rather than Ant Renamer; I've tried both, and BRU is the one I've stuck with for years (well, in Windows...in Linux, mv is your best friend :)). The reason for that is that, if I remember correctly, it has more powerful multi-folder renaming tools, it has a shell extension, and I prefer its interface.
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 2 years 11 months ago #41367

  • Corwin
  • Corwin's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 13
  • Thank you received: 1
  • Karma: 0
Is there a certain script used to add the Format when scraping files?
And does anything with my profile there reflect in Comic Rack?

For example, let's say I'm following All New X-Men (volume) in Comic Vine is there a script to automatically make put it on my pull list? Just curious.

Thanks!
The administrator has disabled public write access.

How to scrape your collection quickly and efficiently 1 week 4 days ago #48692

  • Arsen01
  • Arsen01's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 3
  • Karma: 0
Oh ,, hello guys .. I am Arsen.. I am A new member here.. I have no idea about that .,)))
The administrator has disabled public write access.
Time to create page: 0.263 seconds

Who's Online

We have 211 guests and 3 members online