Welcome, Guest
News and Announcements

TOPIC: Comic Vine Scraper

Comic Vine Scraper 2 weeks 1 day ago #49016

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 282
  • Thank you received: 31
  • Karma: 5
cbanack wrote:
As far as changing the scraper to only return part of the search results, this is not something I am interested in doing (though as always, you are free to fork the project on github and make your own changes.)

There really is no guarantee that the ComicVine Search API (as it currently works, or as it will work in the future) will always return the comic series you are looking for within the first 10, 20, 50 or even 100 results. When the search is working properly (with ANDing of terms), then it only makes sense to return ALL of the results that the user has asked for (otherwise I'll inevitably get people complaining that they can't find a comic that they are looking for). And when the search is broken (as it currently is) it is a waste of my time to try to patch over the problem with a semi-correct short term fix that I'll just have to remove when they get their search API working correctly again.

Completely understand. Hopefully they will get it changed back to working like it used to and won't come back and say it is just going to be like this going forward. Just have to wait and see.

If they did say this is the permanent solution maybe the comprimise would be to return the first X number and like it used to have for covers a "search more comics" button to then fetch all of them.

Hopefully they will just fix it.

Always appreciate your work on the scraper and willingness to still continue to make changes when needed to keep it working. I'll keep passing out concerns on to the devs at GB and hopefully they will resolve these issues especially since the comicvine editors don't like it either.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 1 day ago #49017

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 282
  • Thank you received: 31
  • Karma: 5
beardyandy wrote:
Cheers for coming back so quickly.
There's no other part, all in comicrack but I'll track down what's going wrong in a bit more detail - just seemed to have blocked my API key at present. Whoops

Normally those are pretty short-term
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 1 day ago #49018

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 282
  • Thank you received: 31
  • Karma: 5
I just hope that all the extra API hits they are getting prmpts them to change the search algorithm vs limiting the API again. That is my concern right now. I'm scrapping some stuff now and hit a stretch where it is pulling 2000-3000 entries per comic.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 1 day ago #49019

  • oraclexview
  • oraclexview's Avatar
  • Offline
  • Moderator
  • aka SoundWave
  • Posts: 912
  • Thank you received: 182
  • Karma: 38
I honestly wouldn’t bother scraping anything at this time other than for testing purposes, and even then it seems to make more sense to do the testing either directly with the API or on the ComicVine site since that’s where the problem lies.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 1 day ago #49020

  • cbanack
  • cbanack's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1351
  • Thank you received: 528
  • Karma: 185
Yup, I wouldn't bother scraping either...the search API is currently broken, as far as I'm concerned.

Unless there's some way to force it to use AND based searching...brackets or quotes or the AND keyword. Seems like the kind of thing they might build in.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 1 day ago #49021

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 282
  • Thank you received: 31
  • Karma: 5
cbanack wrote:
Yup, I wouldn't bother scraping either...the search API is currently broken, as far as I'm concerned.

Unless there's some way to force it to use AND based searching...brackets or quotes or the AND keyword. Seems like the kind of thing they might build in.

It really is. Xelloss Auto-fill volume information script plus your scraper are the only things that make it even marginally useful right now but when you hit a stack of stuff with lots of words it sucks.

If they want OR and the default for the site so they don't return 0 results ever, they need an AND option for the API in some form and really need and AND option for the site search too.

This search "upgrade" has been a disaster. I think the big issue is that it was the GB devs making it so didn't think about how comic searches on CV differ from game searches on GB. The two are not the same since comics you have issues and caracters and aliases and all these things that factor in and while come are there in the game word not at all to the same extent. Their new search revamp just didn't take any of that into account it doesn't appear.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 12 hours ago #49023

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 574
  • Thank you received: 139
  • Karma: 29
I will try to play with the script a bit later and see if I can make a temporary patch using your idea...

I am not sure I will be able to, as cbanack script is MUCH more complex than what I am used to do in python with my scripts XD (I am not a real programmer, I only play a bit with python to achieve what I need U_U)
Last Edit: 2 weeks 12 hours ago by Xelloss.
The administrator has disabled public write access.

Comic Vine Scraper 2 weeks 9 hours ago #49024

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 574
  • Thank you received: 139
  • Karma: 29
Based on Krandor idea, I made this patch:


File Attachment:

File Name: cvdb.zip
File Size:8 KB



Just replace the .py file in your Comic Vine folder (you can access this folder by double click in the script in the option menu)

Please make a backup of the original file first. And try this "temporary patch" at your own risk.

I tried it for a few comics, and it seems to work ok... If the results are more than 100, the result bar will show the 1000+ results, but will stop loading when the real results finish (in theory)

I don't know if it filter things it shouldn't, so test it and tell me what you think :)

(btw, the patch code is very un-elegant, and more importantly I don't know how it can affect the rest of the script, I just tinkered with a piece of the code I understood till I found it working, it is supposed to just make the script a bit more useful till a real fix can be implemented)

note: The CV script code is so elegant and uses so many things I don't know what they do... I just learned a lot of things about Python I didn't know by reading and trying to find where the piece of code I was looking for was :P. It is incredible the amount of work and complexity cbanack put in this project, I am really amazed as the level of details of some of the functions...

edited: YOU HAVE TO RESTART comicrack once the file was replaced for the change to kick in

Edit2:

About the new search CV implemented, now that I look at it, I think it is on porpose and it can be even a good idea. The previous search, would give you only the exact matches of your search, if you miss a word, or misspell it, the result would give no results, or at least not the result you were looking for. The new one will show you ALL the possible results, even if the result is not exactly the perfect match, BUT will sort it according to how good the match is (google does something similar). For the script this can be a knightmare, as it will load ALL the possible results... but this, I think, is thought for searches where you read the results, and ask for more ONLY if the already showed results are not what you are looking for. It is NOT for loading ALL the results without reading the first ones...

So the patch I implemented just do that, it reads every time a new "pack" is downloaded, for possible good matches, when a level of match is reached (in the case I did it is when it is not a perfect match, but this can be changed), it will stop reading results and give the ones already read.

Just think of the new search API as a shop owner... You ask him for a blue cover for a Iphone made of metal... If he has it, he will show it (or them) first, but if he doesn't, or you don't like the ones he show you, he will not just said "no, I don't have that" or "That is all I have", but will show you other blue covers for iphone, or covers for iphone made of metal, then, just covers for iphone... then just things for iphone... When you decide to leave the shop, is just your decision... Of course, that would be a problem if you want to make a list of all blue covers for iphone made of metal in an automatic way... that it is what is happening with this script now , but I just said the buyer "when the shop owner start talking about something not exactly the blue metal cover of iphone, leave the shop and bring me the list till there" :P

ps: There is no way to know the exact number of matches till you reach a "not exact match", that is why the results counter is now buggy and will just stop in the middle of the search, but in any case, most searches will not show the result counter (as most searches reach the not exact match in the first download of 100 results), so not that big issue there...
Last Edit: 2 weeks 9 hours ago by Xelloss.
The administrator has disabled public write access.
The following user(s) said Thank You: oraclexview

Comic Vine Scraper 2 weeks 8 hours ago #49025

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 574
  • Thank you received: 139
  • Karma: 29
If anyone has a modifed version of the file, and want to know the exact lines added or modified, here they are:
#PATCH by Xelloss
         
         for serie in series_refs:
             remove = False
             for term in search_terms_s.split(" "):
                  term2 = re.sub('[^a-zA-Z0-9]+', '',str(term).lower())
                  serie2 = re.sub('[^a-zA-Z0-9]+', '', str(serie).lower())
                  if term2 not in serie2:
                      num_results_n = iteration
                      
                      remove = True
             if not remove:
                 series_refs2.add(serie)
This piece of code is added in two places... with different tabulation, and is 99% of the change

It only searches in the already downloaded results if any is NOT an exact match (that it doesn't contain all the searched words), if it is a correct match it then adds it to the result set, if not it modifed the num_results_n (number of results) so that the script stop downloading more.

The first time it is with the first 100 results, the second time if more than 100 results are needed

The other modified lines are:
series_refs2 = set()  -> new line to add a new set

and
return set() if cancelled_b[0] else series_refs2  -> changed the set output to the new one

That is all (I think)
Last Edit: 2 weeks 8 hours ago by Xelloss.
The administrator has disabled public write access.
The following user(s) said Thank You: oraclexview

Comic Vine Scraper 2 weeks 8 hours ago #49028

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 282
  • Thank you received: 31
  • Karma: 5
Thanks Xelloss. Glad my research could be of help. I'll have to test it out tonight. A wonky counter is certainly better then having to wait for 5000 entries to be pulled down.
The administrator has disabled public write access.
The following user(s) said Thank You: Xelloss
Time to create page: 0.240 seconds

Who's Online

We have 206 guests and 3 members online