Welcome, Guest
Python Scripts for ComicRack

TOPIC: Comic Vine Scraper further Development

Comic Vine Scraper further Development 2 years 1 month ago #43589

  • boshuda
  • boshuda's Avatar
  • Offline
  • Gold Boarder
  • Posts: 295
  • Thank you received: 64
  • Karma: 8
cbanack wrote:
boshuda wrote:
I've got the code branched (cloned, whatever) and have made some version of the changes. I want to make it a more complete removal, clean it up, and test it out before pushing my changes into a proper release. Plus I'd like cbanack's blessing before making a proper release. Quick and dirty is to simply comment out the offending code in cvdb.py and hope nobody clicks the 'more covers' link. However, since there's a workaround the urgency seems lessened. I would like to more fully remove the hidden option and the more covers link, as well as change the signature of the function and everything that calls it all the way out to the option.

It looks like the ability to get those two currently problematic pieces of data (community ratings and alternative covers) might be coming to the official Comic Vine API soon. So rather than completely removing all references to that data and the features that use it, you might want to just remove the bad http request from the code and make the data always be blank/empty. I.e. make it so clicking on 'Search for More Covers' just doesn't find anything, and scraping the community rating just never finds any value. The rest of the CVS code should be able to handle not finding any data in these two cases, since that already happens with some comics.

And in the future if those two pieces of data become part of the proper Comic Vine API (or the new API server that's being built), it would be a simple matter to start obtaining those details again during scrapes, and then all the features would just start working again.

Since you're forking the scraper, I would also suggest taking some time to go through the code and rebrand it; give it a new name and author (maybe 'Comic Vine Scraper Remake', or 'Boshuda's Scraper', or whatever.) That will avoid a lot of confusion and ensure people know which fork they are using. If you search for uses of the strings 'Comic Vine Scraper' and 'Cory Banack', that should find you most of what you'd want to change.

cbanack, what do you do to test CVS? You have the Run Unit Tests in there, as well as the Run Comic Vine Scraper which scrapes the simulated books. But do you have some [in]formal testing procedure you follow before releasing? Maybe a checklist of things to look for (version incrementing, readme updating, anything like that)?

  • I make sure the version number is changed (I never re-release with the same version number, as that just confuses people.)
  • I run the unit tests. The 'Run Comic Vine Scraper' launch target is mostly used during actual development, to iteratively test new features.
  • When I'm ready to do real testing (i.e. just before I release a new version) I usually try to grab a pile of recent releases and scrape them, utilizing as many different variations of settings and features as possible, and obviously focusing more heavily on any areas that I've changed.

All good suggestions, and many of them things I was considering. I've renamed it comic-rack-scraper on GitHub. Now the arduous task of renaming it in the code :).

I think what I'll investigate doing is:
1. Do nothing with slow_data in _query_issue(), which should be completely invisible to the user and essentially nullifies the 'Try To Choose Series Automatically' option
2. Change the link for more covers to call up the user's default web browser and take the user to the issue in question
2.a Disable/remove the "more covers" portion of the text completely (because of the way the state machine appears to handle this it is most likely easier to set it so that the panel acts as though it was clicked).
3. Ignore the hidden SCRAPE_RATING setting so that no matter it's state it doesn't attempt to get that information
The administrator has disabled public write access.
The following user(s) said Thank You: romsnesrom

Comic Vine Scraper further Development 2 years 1 month ago #43590

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 204
  • Thank you received: 21
  • Karma: 4
boshuda wrote:
cbanack wrote:
boshuda wrote:
I've got the code branched (cloned, whatever) and have made some version of the changes. I want to make it a more complete removal, clean it up, and test it out before pushing my changes into a proper release. Plus I'd like cbanack's blessing before making a proper release. Quick and dirty is to simply comment out the offending code in cvdb.py and hope nobody clicks the 'more covers' link. However, since there's a workaround the urgency seems lessened. I would like to more fully remove the hidden option and the more covers link, as well as change the signature of the function and everything that calls it all the way out to the option.

It looks like the ability to get those two currently problematic pieces of data (community ratings and alternative covers) might be coming to the official Comic Vine API soon. So rather than completely removing all references to that data and the features that use it, you might want to just remove the bad http request from the code and make the data always be blank/empty. I.e. make it so clicking on 'Search for More Covers' just doesn't find anything, and scraping the community rating just never finds any value. The rest of the CVS code should be able to handle not finding any data in these two cases, since that already happens with some comics.

And in the future if those two pieces of data become part of the proper Comic Vine API (or the new API server that's being built), it would be a simple matter to start obtaining those details again during scrapes, and then all the features would just start working again.

Since you're forking the scraper, I would also suggest taking some time to go through the code and rebrand it; give it a new name and author (maybe 'Comic Vine Scraper Remake', or 'Boshuda's Scraper', or whatever.) That will avoid a lot of confusion and ensure people know which fork they are using. If you search for uses of the strings 'Comic Vine Scraper' and 'Cory Banack', that should find you most of what you'd want to change.

cbanack, what do you do to test CVS? You have the Run Unit Tests in there, as well as the Run Comic Vine Scraper which scrapes the simulated books. But do you have some [in]formal testing procedure you follow before releasing? Maybe a checklist of things to look for (version incrementing, readme updating, anything like that)?

  • I make sure the version number is changed (I never re-release with the same version number, as that just confuses people.)
  • I run the unit tests. The 'Run Comic Vine Scraper' launch target is mostly used during actual development, to iteratively test new features.
  • When I'm ready to do real testing (i.e. just before I release a new version) I usually try to grab a pile of recent releases and scrape them, utilizing as many different variations of settings and features as possible, and obviously focusing more heavily on any areas that I've changed.

All good suggestions, and many of them things I was considering. I've renamed it comic-rack-scraper on GitHub. Now the arduous task of renaming it in the code :).

I think what I'll investigate doing is:
1. Do nothing with slow_data in _query_issue(), which should be completely invisible to the user and essentially nullifies the 'Try To Choose Series Automatically' option
2. Change the link for more covers to call up the user's default web browser and take the user to the issue in question
2.a Disable/remove the "more covers" portion of the text completely (because of the way the state machine appears to handle this it is most likely easier to set it so that the panel acts as though it was clicked).
3. Ignore the hidden SCRAPE_RATING setting so that no matter it's state it doesn't attempt to get that information

How hard would it be to have try to choose automatically run, but only use the single image provided from the api? Yes that would make it much less effective, but less effective would be better then not having it at all.
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43591

  • hyperspacerebel
  • hyperspacerebel's Avatar
  • Offline
  • Junior Boarder
  • Posts: 31
  • Thank you received: 9
  • Karma: 1
krandor wrote:
How hard would it be to have try to choose automatically run, but only use the single image provided from the api? Yes that would make it much less effective, but less effective would be better then not having it at all.

Should be really easy. Skipping the "slow" http request and not adding additional image urls to the issue object should let the automatcher continue to run as normal. It just would have less covers to work with and so wouldn't get an automatic match as often, but still sometimes.

In theory, I believe that could be achieved by simply removing the if block from line 462 of cvdb.py (or make it false).
Last Edit: 2 years 1 month ago by hyperspacerebel.
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43594

  • 2skoops
  • 2skoops's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 13
  • Thank you received: 4
  • Karma: 0
I just saw this post over at the CV forums, and it looks like API responses will be limited to 100 returned items.

Does this affect the current scraper at all? Does it ever request more than 100 at a time? It's unclear from the post whether requesting more than 100 would throw an error, or just limit the response to 100. Either way, seems like this should be tested to make sure it doesn't break things with CVS.
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43598

  • Paradoxic
  • Paradoxic's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 2
  • Karma: 0
Isn't it currently like 200 items every 15 minutes? So I'd assume that it would be 100 itmes every 15 minutes, it would just increase the time to syncronize your comic collection.

But I don't know, for what ever reason the devs at comicvine aren't specific about anythinge and you have to pull every little thing out of the freakin' nose. 100 items now, still in a 15 minute timeframe or over 24h? It annoys me a lot that they don't post specific information.
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43601

  • 2skoops
  • 2skoops's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 13
  • Thank you received: 4
  • Karma: 0
Paradoxic, I don't think they're talking about the number of API calls you can make in a given timeframe. They're talking about the number of results that can be returned with a single API call.

The confusing part for me is that I thought when you did a search for a series with a lot of matches (ex. "Batman"), it already only returned 100 results, which is why CVS had to make multiple search requests for some series (leading to increased API usage). Cory talked about that in the "Am I blocked?" thread at the CV forums. So, if there already was a limit of 100 results, what is this new post talking about? Where there other API requests that could return more than 100? And if so, will these sections of CVS work once CV adds these new restrictions?
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43602

  • cbanack
  • cbanack's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1327
  • Thank you received: 508
  • Karma: 182
2skoops wrote:
Paradoxic, I don't think they're talking about the number of API calls you can make in a given timeframe. They're talking about the number of results that can be returned with a single API call.

The confusing part for me is that I thought when you did a search for a series with a lot of matches (ex. "Batman"), it already only returned 100 results, which is why CVS had to make multiple search requests for some series (leading to increased API usage). Cory talked about that in the "Am I blocked?" thread at the CV forums. So, if there already was a limit of 100 results, what is this new post talking about? Where there other API requests that could return more than 100? And if so, will these sections of CVS work once CV adds these new restrictions?

Yup, that's how I understand it too. Not sure what the Comicvine guy is talking about--their API has returned a maximum of 100 results per call for years now. In fact I've been telling them that they could speed things up and reduce the load on their server if they just returned all the results in one call instead of making people do multiple API calls just to get the complete result set.

In any event, the scraper already knows how to handle this. So even if they do change the number of results per call, it shouldn't cause any problems.
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43603

  • Paradoxic
  • Paradoxic's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 2
  • Karma: 0
@2skoops

Oh, ok, I am sorry then. It was just a guess from me:)
The administrator has disabled public write access.

Comic Vine Scraper further Development 2 years 1 month ago #43605

  • Jothay
  • Jothay's Avatar
  • Offline
  • Senior Boarder
  • 1
  • Posts: 47
  • Thank you received: 24
  • Karma: 6
I've posted the first commit to GitHub here:
github.com/Jothay/comic-vinescraper-api

I'm working the BusinessWorkflows project today and the Services project tomorrow.
The administrator has disabled public write access.
The following user(s) said Thank You: dockens, krandor

Comic Vine Scraper further Development 2 years 1 month ago #43620

  • iohanr
  • iohanr's Avatar
  • Offline
  • Junior Boarder
  • Posts: 23
  • Thank you received: 17
  • Karma: 8
Jothay wrote:
I've posted the first commit to GitHub here:
github.com/Jothay/comic-vinescraper-api

I'm working the BusinessWorkflows project today and the Services project tomorrow.

Hey Jothay, just fyi - I ordered the server hardware and it should be arriving on Tuesday. I went with a Lenovo ThinkServer TS440 rather than building my own. It seems to be a good value and can be upgraded easily when needed.
The administrator has disabled public write access.
The following user(s) said Thank You: 600WPMPO, rmagere, dockens, romsnesrom, krandor
Time to create page: 0.277 seconds

Who's Online

We have 289 guests and 4 members online