Welcome, Guest
News and Announcements

TOPIC: Comic Vine Scraper

Comic Vine Scraper 8 months 2 weeks ago #48997

  • Scuttle
  • Scuttle's Avatar
  • Offline
  • Junior Boarder
  • Posts: 28
  • Thank you received: 15
  • Karma: 6
Would there be a way to add an option of only returning X results?
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #48998

  • cbanack
  • cbanack's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1367
  • Thank you received: 550
  • Karma: 186
Scuttle wrote:
Would there be a way to add an option of only returning X results?

It wouldn't speed anything up, since the results do not come back in the order that you see them in the Scraper -- the scraper only sorts the "best" matches to the top of the list after it has received them all.

I rather doubt that they will be keeping an OR based search without also offering some way of doing an AND based search too. The results are not useful, and they waste a lot of their server resources.
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49000

  • beardyandy
  • beardyandy's Avatar
  • Offline
  • Senior Boarder
  • Posts: 48
  • Thank you received: 5
  • Karma: 0
cbanack - as you're there please

Is there something I'm missing about dealing with special characters. I think it may be a problem with comicrack itself, rather than your scraper (or somewhere else in my workflow)

But I've had a few instances of special characters being downloaded that seem to cause problems (I'm using mariadb sql if that makes any difference)

e.g. superscript 'th' in comicvine.gamespot.com/the-7-k-sword/4050-73243/
Omega in comicvine.gamespot.com/ral-grad/4050-38325/
em dashes in a few

Any suggestions
Last Edit: 8 months 2 weeks ago by beardyandy.
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49001

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 313
  • Thank you received: 34
  • Karma: 5
cbanack wrote:
Scuttle wrote:
Would there be a way to add an option of only returning X results?

It wouldn't speed anything up, since the results do not come back in the order that you see them in the Scraper -- the scraper only sorts the "best" matches to the top of the list after it has received them all.

I rather doubt that they will be keeping an OR based search without also offering some way of doing an AND based search too. The results are not useful, and they waste a lot of their server resources.

Running an admittidly small sample of things through the API directly, it appears that if you search with 3 words the entries containing all 3 are first, then the match 2 and finally the match 1. So I'm not so sure that grabbing the first X might not help here with how results are being returned.

As an example if I search the volume mentioned earlier "Ice Cream Man" directtly on the API this is what I get. Notice Ice+Cream+Man is first, then stuff with Ice and Man (in this case spider-man and ice stuff).

<response>
<error>
<![CDATA[ OK ]]>
</error>
<limit>100</limit>
<offset>0</offset>
<number_of_page_results>100</number_of_page_results>
<number_of_total_results>2135</number_of_total_results>
<status_code>1</status_code>
<results>
<volume>
<name>
<![CDATA[ Ice Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Harley Chan the Ice-Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ The Amazing Spider-Man: Skating on Thin Ice ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Brer Rabbit In Ice Cream For the Party ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[
The Amazing Spider-Man: Skating on Thin Ice: Double Trouble
]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Raspberry Ice Cream War ]]>
</name>
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49002

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 313
  • Thank you received: 34
  • Karma: 5
cbanack wrote:
I rather doubt that they will be keeping an OR based search without also offering some way of doing an AND based search too. The results are not useful, and they waste a lot of their server resources.

And the comicvine people don't like the OR either because it is making finding specific issues very very difficult for them, but in this case neither us nor them have control. Here was a post from the CV admin on the topic. So in this case, us and CV are on the same team which is kinda weird since often we are blaming CV directly....

@krandor: according to other API posts its been using OR instead of AND the whole time the new engines been in use, just like on CV, we don't like it either as it brings up way too many things that have nothing to do with what we search and it makes it hard for editors to find specific things when checking if something exists before we create it. Sometimes it almost feels intentional so that no search will ever result in zero results or that it will increase traffic by increasing clicking on multiple pages to find something but I don't think it is on purpose.
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49009

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 596
  • Thank you received: 150
  • Karma: 30
krandor wrote:
cbanack wrote:
Scuttle wrote:
Would there be a way to add an option of only returning X results?

It wouldn't speed anything up, since the results do not come back in the order that you see them in the Scraper -- the scraper only sorts the "best" matches to the top of the list after it has received them all.

I rather doubt that they will be keeping an OR based search without also offering some way of doing an AND based search too. The results are not useful, and they waste a lot of their server resources.

Running an admittidly small sample of things through the API directly, it appears that if you search with 3 words the entries containing all 3 are first, then the match 2 and finally the match 1. So I'm not so sure that grabbing the first X might not help here with how results are being returned.

As an example if I search the volume mentioned earlier "Ice Cream Man" directtly on the API this is what I get. Notice Ice+Cream+Man is first, then stuff with Ice and Man (in this case spider-man and ice stuff).

<response>
<error>
<![CDATA[ OK ]]>
</error>
<limit>100</limit>
<offset>0</offset>
<number_of_page_results>100</number_of_page_results>
<number_of_total_results>2135</number_of_total_results>
<status_code>1</status_code>
<results>
<volume>
<name>
<![CDATA[ Ice Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Harley Chan the Ice-Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ The Amazing Spider-Man: Skating on Thin Ice ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Brer Rabbit In Ice Cream For the Party ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[
The Amazing Spider-Man: Skating on Thin Ice: Double Trouble
]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Raspberry Ice Cream War ]]>
</name>

If that is the case, the script could stop asking for results when the first "page" of results with one of the word missing in results happens... this way we turn a OR into a AND
That only works if it works as you said though...
Last Edit: 8 months 2 weeks ago by Xelloss.
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49010

  • krandor
  • krandor's Avatar
  • Offline
  • Gold Boarder
  • Posts: 313
  • Thank you received: 34
  • Karma: 5
Xelloss wrote:
krandor wrote:
cbanack wrote:
Scuttle wrote:
Would there be a way to add an option of only returning X results?

It wouldn't speed anything up, since the results do not come back in the order that you see them in the Scraper -- the scraper only sorts the "best" matches to the top of the list after it has received them all.

I rather doubt that they will be keeping an OR based search without also offering some way of doing an AND based search too. The results are not useful, and they waste a lot of their server resources.

Running an admittidly small sample of things through the API directly, it appears that if you search with 3 words the entries containing all 3 are first, then the match 2 and finally the match 1. So I'm not so sure that grabbing the first X might not help here with how results are being returned.

As an example if I search the volume mentioned earlier "Ice Cream Man" directtly on the API this is what I get. Notice Ice+Cream+Man is first, then stuff with Ice and Man (in this case spider-man and ice stuff).

<response>
<error>
<![CDATA[ OK ]]>
</error>
<limit>100</limit>
<offset>0</offset>
<number_of_page_results>100</number_of_page_results>
<number_of_total_results>2135</number_of_total_results>
<status_code>1</status_code>
<results>
<volume>
<name>
<![CDATA[ Ice Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Harley Chan the Ice-Cream Man ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ The Amazing Spider-Man: Skating on Thin Ice ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Brer Rabbit In Ice Cream For the Party ]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[
The Amazing Spider-Man: Skating on Thin Ice: Double Trouble
]]>
</name>
<resource_type>
<![CDATA[ volume ]]>
</resource_type>
</volume>
<volume>
<name>
<![CDATA[ Raspberry Ice Cream War ]]>
</name>

If that is the case, the script could stop asking for results when the first "page" of results with one of the word missing in results happens... this way we turn a OR into a AND
That only works if it works as you said though...

And I'll be the first to admit I've only done limited testing but the ones I've done the AND results have been at the top which would make sense if you really are going to OR them you'd still want most revelant (i.e. most words matched) at the top. It would be the logical way to do it, but a lot about this search upgrade hasn't been logical..
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49012

  • cbanack
  • cbanack's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1367
  • Thank you received: 550
  • Karma: 186
beardyandy wrote:
cbanack - as you're there please

Is there something I'm missing about dealing with special characters. I think it may be a problem with comicrack itself, rather than your scraper (or somewhere else in my workflow)

But I've had a few instances of special characters being downloaded that seem to cause problems (I'm using mariadb sql if that makes any difference)

e.g. superscript 'th' in comicvine.gamespot.com/the-7-k-sword/4050-73243/
Omega in comicvine.gamespot.com/ral-grad/4050-38325/
em dashes in a few

Any suggestions

It looks like both the Scraper and Comic Vine handle these special unicode characters properly (aside from the currently broken search). It is not uncommon for other programs to mess up unicode characters, though, as this is something that many native english-speaking programmers forget to take into account when they are writing software. If you have other programs in your process, you should investigate if they are breaking those special characters. If you are writing one of those programs yourself, you should learn about the differences between ASCII and unicode character sets, and make sure you are handling unicode properly.
The administrator has disabled public write access.

Comic Vine Scraper 8 months 2 weeks ago #49013

  • cbanack
  • cbanack's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1367
  • Thank you received: 550
  • Karma: 186
As far as changing the scraper to only return part of the search results, this is not something I am interested in doing (though as always, you are free to fork the project on github and make your own changes.)

There really is no guarantee that the ComicVine Search API (as it currently works, or as it will work in the future) will always return the comic series you are looking for within the first 10, 20, 50 or even 100 results. When the search is working properly (with ANDing of terms), then it only makes sense to return ALL of the results that the user has asked for (otherwise I'll inevitably get people complaining that they can't find a comic that they are looking for). And when the search is broken (as it currently is) it is a waste of my time to try to patch over the problem with a semi-correct short term fix that I'll just have to remove when they get their search API working correctly again.
The administrator has disabled public write access.
The following user(s) said Thank You: Crave

Comic Vine Scraper 8 months 2 weeks ago #49014

  • beardyandy
  • beardyandy's Avatar
  • Offline
  • Senior Boarder
  • Posts: 48
  • Thank you received: 5
  • Karma: 0
Cheers for coming back so quickly.
There's no other part, all in comicrack but I'll track down what's going wrong in a bit more detail - just seemed to have blocked my API key at present. Whoops
The administrator has disabled public write access.
Time to create page: 0.490 seconds

Who's Online

We have 130 guests and 2 members online