Welcome, Guest
General discussion about ComicRack
  • Page:
  • 1
  • 2

TOPIC: COMIC.ORG offline "pseudo scraper" - Need help with a project!

COMIC.ORG offline "pseudo scraper" - Need help with a project! 10 months 2 weeks ago #46761

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
I have been working in the last weeks in a way to sync the offline database of comic.org (that has a lot of info comicvine doesn't have, as format, brand, etc, etc) with already comicvine scrapped comcis... (using the best of both sites)

The first step to do that, and the more complicated one, is to make a list of all comicvolume_id (of comivine) with their respective/s comic id of the comic.org database...

To do that, I installed a mysql server and download the dump of the database from the comic.org site, I extracted the data I needed from this database and I scrapped the first issue of each volume in the comicvine database and extract this data as well... (which make about 75k series in comicvine and 100k series in comic.org)

With this two list and values I choose to make the sync, I made a HUGE algorithm that achieved about 70% of the comicvine database volumes to be link to a comic.org volume... (it doesn't seem a lot, but with about a 20% of the volumes of comicvine not present in the comic.org database (and the other way too) and a lot of differences in how to "group" comics in volumes rules... I think this is near as far I can get by my own in an automatic way)

This list is still incomplete (I asume about a 40% of the series not linked yet can have a correct comic.org id) and there is still a lot of fixing to do with already linked ids (I calculate 95% of them are correct, but I am still finding wrongly linked ones)

With this is mine, I ask if someone want to give me a hand with this... it is only looking at the huge list of already linked ids, and found errors (to better the sync rules) and to look in the still unliked ids and found ids that can be linked (to make new sync rules)

The idea is once we achieved a more reliable linked-ids list (it will never be perfect but it can be much improved yet), to use it to make an offline data file with all the information that can be added to the comics already scrapped with comicvine (not so big, something like 200mb I think) and make a script that automatically complete it in comics in seconds... (adding also any data or corrections anyone want to add to the database, which I am working also in a way of doing it automatically with another script from already completed data in your collections)

Well... that's it, I don't know if someone understood anything of what I said or if anyone is interested, but I wanted to at least hear what you think of my idea... I will continue tinkering with the sync script between bases a bit more, and then I will post the list here for anyone interested :)
Last Edit: 10 months 2 weeks ago by Xelloss.
The administrator has disabled public write access.
The following user(s) said Thank You: perezmu, dockens, romsnesrom

COMIC.ORG offline "pseudo scraper" - Need help with a project! 10 months 2 weeks ago #46768

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
As promised... The last version of the list...

Remember it is done automatically... so there are a lot of "WTF?!" links...

The list of linked comics has the ids used both in comicvine and comic.org... and a number between "()", this number means the rule it was synced with in my script... Supposedely (or Ideally) the largest this number, the most probable the link is incorrect (as both series has less in common to be linked)

mega.nz/#!pBtllJqT!UON_NLgt6g8MvgzWYTOfleQDMhyAxwfg2xkKGydCwc0

Feel free to send me any mistakes (and there are hundreds, I am still improving it) you see... it will help me to improve the rules and create new ones...

In the unliked comics list you can also see comics in comicvine and in comic.org... Most of them are comics that are in only one database, but there are still hundreds that can be linked and weren't because no rule detected they were the same...

Once this list is "usable", I will use it to make an script to scrap the data in comic.org that comicvine doesn't have (format for example) to comics scraped with comicvine with a mini databasedump offline...
Last Edit: 10 months 2 weeks ago by Xelloss.
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 10 months 1 week ago #46781

  • perezmu
  • perezmu's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 1114
  • Thank you received: 64
  • Karma: 51
This is a huge effort and and awesome project. Having coded the original comic vine scraper, I can tell. Also I tried to do the same you are doing, linking comicvine with the official marvel api and I understand well the proccess you are going through! I am afraid I cannot help you as of today, but thanks for the effort!

Cheers!
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 10 months 1 week ago #46787

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Ok, I spent my weekend making an sql database with all the comicvine info (from the XMLs from the API)

As I didn't know anything about parsing (I had to parse it "by hand" all the same cause the python xml reader is horrible), anything about sql, anything about anything XD... It took me a while, but Yesterday I finished building a very simple sql database with volumes, issues and publishers from comicvine... (there are some data still missing... as I only used the volumes, issues, publishers functions of the CV API (that can retrieve 100 results at a time) and the info in this functions is limited - no character info for example)

All the same now I hace A LOT of new data to play with (for example the amount of issues in each volume) that I can use to improve my sync algorithm...

I will beging working with this when I have some free time :)
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 10 months 1 week ago #46788

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
perezmu wrote:
This is a huge effort and and awesome project. Having coded the original comic vine scraper, I can tell. Also I tried to do the same you are doing, linking comicvine with the official marvel api and I understand well the proccess you are going through! I am afraid I cannot help you as of today, but thanks for the effort!

Cheers!

My final objetive is to sync the comicrack comics with the DC and Marvel WIkia... but for that I had to first find a way to "filter" the TPBs and reprints from the comicvine database. As CV doesn't separate original comics from "reprints" and TPBs (and similar), I begin doing this project to sync it to comic.org that does this separation... (making format scrapping at large scale possible)

Of course that was the original thought, then it ended up in its own project itself XD

ps: I know what you are thinking... the wikia databases are almost pure text data, so a lot of UGLY parsing is needed to scrap info from them... but I am already done some work with that, and I think it is quite possible to at least scrap 90% of the info there... (I am also working in the wikias to standarize the data format XD)

It is not needed to say I am learning a lot of things with all this (sql, python, xml, wikia code, etc), but it seems to be the only way I can learn these things without wanting to commit suicide XDDDD
Last Edit: 10 months 1 week ago by Xelloss.
The administrator has disabled public write access.
The following user(s) said Thank You: khaoohs, romsnesrom

COMIC.ORG offline "pseudo scraper" - Need help with a project! 8 months 2 weeks ago #47077

  • Alan Scott
  • Alan Scott's Avatar
  • Offline
  • Gold Boarder
  • Posts: 264
  • Thank you received: 20
  • Karma: 10
Hello! I would like to ask if you are still developing this project. It's been quite some time since I saw something happening here to excite me, and this is VERY exciting. I would love to have access to data like cover prices and genres, among other data available on comic.org. Access to data like the Wikias would be lovely too. Thank you very much!
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 8 months 2 weeks ago #47087

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
Because of personal life, this is in hiatus for now... But not forgotten :P

The big problem with this, is the automatic matching between comicvine ids and comic.org ids... I have made some progress in that, with about 70% of success... but it still need A LOT of work...
Last Edit: 8 months 2 weeks ago by Xelloss.
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 8 months 1 week ago #47108

  • DanielFJorge
  • DanielFJorge's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 8
  • Thank you received: 6
  • Karma: 1
Hey, this is a nice project... the quickest way you can achive a reliable match is by image recognition of the covers. This would end almost all your problems... Of course there are covers that are very similar (for instance TPBs covers are almost always equal to the first cover o the series) but you can use other information like published year, series name, format, etc and build a confidence index of the match. I´m positive this would correctly classify over 99% of things... Here is an article that introduces how to build a image search:

www.pyimagesearch.com/2014/12/01/complet...ngine-python-opencv/

Good luck
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 8 months 1 week ago #47128

  • Xelloss
  • Xelloss's Avatar
  • Offline
  • Platinum Boarder
  • Posts: 455
  • Thank you received: 117
  • Karma: 24
hahahaha, that is quite an ambituous take of the problem... I don't know if I am up to that task D:

In addition that would mean one of two: Or download the entire cover images, or do the matching online... the two things would take A LOT OF TIME to accomplish...
Last Edit: 8 months 1 week ago by Xelloss.
The administrator has disabled public write access.

COMIC.ORG offline "pseudo scraper" - Need help with a project! 8 months 1 week ago #47129

  • DanielFJorge
  • DanielFJorge's Avatar
  • Offline
  • Fresh Boarder
  • Posts: 8
  • Thank you received: 6
  • Karma: 1
HAHAHA... it seems scary, but it is not that hard... you do not have to download the images manually... take a look at scrapy for python... it is an awesome lib to make web crawler... downloading all the images would take no more than 10 minutes... alto... programming a crawler to do that would take at most 2h... then, you will have complete automation to correlate any comics database with each other
The administrator has disabled public write access.
  • Page:
  • 1
  • 2
Time to create page: 0.406 seconds

Who's Online

We have 255 guests and 7 members online