Thursday July 29 , 2010
Text Size
   
Welcome, Guest
Please Login or Register.    Lost Password?

How Did You Learn To Web Scrape Using Python?
(1 viewing) (1) Guest
A place to meet other Developers
Go to bottomPage: 1
TOPIC: How Did You Learn To Web Scrape Using Python?
#6196
How Did You Learn To Web Scrape Using Python? 4 Months, 2 Weeks ago Karma: 8
So I've been trying to learn to use Python. I really don't think anything will come of it, as I've not really shown any aptitude in coding thus far in my life, but I will at least try. I've found a video tutorial on Youtube, nothing professional but it's a good way to learn. Thus far, most of the tutorials I've found are based on general application building. What I'll most want to do is web scraping, and that's where I've had some trouble finding resources. Everything I've come across has been pushing other software that as far as I know none of you use. This leaves me with nothing that can teach me how to do what I want.

I'd like to ask all of you that have done this, how did you learn? I've noticed from your posts that a couple of you are fairly new to coding, so hopefully you can fully remember the steps you took to get here. Do you recall if you came across any video tutorials for web scraping with Python? For this old dog, watching someone go through the steps is a big help. Are there any particular tips you can offer?

Thanks in advance for any help here.
Alan Scott
Expert Boarder
Posts: 156
graphgraph
User Offline Click here to see the profile of this user
Gender: Male Alan Scott Location: Newton, Massachusetts Birthday: 07/30
The administrator has disabled public write access.
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
 
#6203
Re: How Did You Learn To Web Scrape Using Python? 4 Months, 2 Weeks ago Karma: 55
Web scraping is fairly easy and language independent.

You need two things:
* A method to download data from a internet url
* A regular expression engine to take the retrieved text apart

Try learning regex first. Take a web site and save the html from within the browser to file. Get yourself a regex testing tool (Regex buddy etc) and start playing with it. When you know regex, the rest is easy
cYo
Moderator
Posts: 1130
graph
User Online Now Click here to see the profile of this user
Last Edit: 2010/03/12 11:31 By cYo.
The administrator has disabled public write access.
 
#6213
Re:How Did You Learn To Web Scrape Using Python? 4 Months, 2 Weeks ago Karma: 18
Hey Alan!

I could make simple python programs before getting here... just need the basics: variables, lists, dictionaries, loops and the like.

Then to do something functional for ComicRack, what I did was actually "reverse ingeneering" other's scripts... and then google a lot on particular problems I found along the way. Let me know if I can help you any further!

Cheers!
perezmu
Platinum Boarder
Posts: 413
graphgraph
User Offline Click here to see the profile of this user
Gender: Male An Idle Mind... Location: Spain Birthday: 01/01
The administrator has disabled public write access.
 
#6370
Re:How Did You Learn To Web Scrape Using Python? 4 Months, 1 Week ago Karma: 4
Hey Alan!

I figured I'd chime in on this one since I'm mos def still a newbie at Python/IronPython coding. I would suggest that you first take the time to learn the basics...especially if programming isn't really your thing. So I would say start off with what perezmu suggested. Learn generic basic Python rules and syntax first. Learn variables, lists, dictionaries, loops, if statements, tuples, etc.. If you're able to pick that stuff up quickly I would go ahead and try learning functions and classes (where a lot of the real power of programming in any language lies) as well. Python in its basic form can be studied at its source for free here. Also, you can try searching Yahoo! and Bing (I'm not a Google lemming just yet, but you can search there too) for other sites that can show you the basic usage of Python. There are quite a few great technical books on Python too. A search on Amazon will reveal these books to you. I don't know about you, but the library system in my California county allows me to access many of the eBook versions of these books online for free with an Electronic Library Card.

Then I'd begin learning IronPython, which the great ComicRack supports, by first starting at the source here. There are some great books on this fork of Python as well. The book IronPython Cookbook seems to be the one that is referred to the most.

I would also suggest, like perezmu did, to reverse engineer what ever code/app/programs you can get you hands on. All my programming knowledge over the years has come from doing this and reading books...no formal training at all. So it can get you far for very little money, if any, spent. With that said, you might want to think about signing up at Experts Exchange here for a small fee a month if you are serious about wanting to learn a lot really fast by using the problems and solutions of other coders for your needs.

Another VERY BIG thing you'll need to learn, just as cYo said, is REGULAR EXPRESSIONS! I've found that this is the holy grail of parsing/scraping needs. It's not the easiest thing to learn, but once you've got it, you're the man! cYo's suggestion of using Regex Buddy for this is great too! Even though their site's purpose is to have people use their product, the site also explains a lot of the basics of regular expressions very well in my opinion.

I hope all this this helps buddy. Also, if you reverse engineer the scripts by cYo, Stonepaw, cbanack, perezmu, and wadegiles to name a few, then you'll have learned quite a bit to get you up and running and qualify you as mini-dangerous. Another note is if you know any VBScript, VBA or Visual Basic coding (basically any Microsoft object oriented programming) then that’ll help you with being a little familiar with the .NET Framework.
oraclexview
aka SoundWave
Expert Boarder
Posts: 153
graphgraph
User Offline Click here to see the profile of this user
Gender: Male onyxistence Location: Bay Area California Birthday: 01/01
The administrator has disabled public write access.
 
#6371
Re:How Did You Learn To Web Scrape Using Python? 4 Months, 1 Week ago Karma: 39
As a completely self-taught programmer I figured I'd put my own $0.02 in.

As the other have said learn the basics of python and ironpython first. Dissecting other scripts is a great help. Don't hesitate to post here if you don't understand what something does, I'm sure one of us can help you out.

As oraclexview said the IronPython cookbook is handy. And if you have it in pdf format like I do you can even read with ComicRack

Create a simple script first just to get the feel of writing something and making it work. It doesn't have to be complicated, maybe just something that prints out the series and title to the console for selected comics.

A good editor is important for me to help catch stupid errors before I run the script. I currently use komodo edit and Sharpdevelop for writing scripts. But feel free to try others out.

As for webscrapeing: for my current script project I am using HTMLAgility Pack which is a .net library for parsing HTML in a Document Object Model. I then use XPath expressions to find the sections I want and then use Regular expressions and basic string operations (split, strip, replace) to parse out what I want.

I use Expresso to test regular expressions. It is based on the .net implentation of regex but as long as you know the differences between python and .net regex engines, it works great.

Also internet search engines are super useful for finding solutions to problems.
Stonepaw
Moderator
Posts: 334
graphgraph
User Online Now Click here to see the profile of this user
Gender: Male Location: Canada
Last Edit: 2010/03/19 07:26 By Stonepaw.
The administrator has disabled public write access.
 
#6514
Re:How Did You Learn To Web Scrape Using Python? 4 Months ago Karma: 8
Hey all, I want to thank you for all the feedback. It's already been helpful. I can't code anything yet, but at least now I know why .

@ cYo - Thanks for the tip about RegExBuddy, I picked that up and it's really nice to work with. I think RegEx is the one I should focus on learning first, as Python's syntax seems to make some sense to me already, but RegEx, on the other hand, for me it look's like some kind of alien language. I think I'll get better at it as I come to understand all the different parts but for now it's a little tough.

@ Perezmu - Yep, I've definitely been looking at the different scripts to work out how things are done. One quick question you could answer is, the first few scraping scripts consisted of one .py file, but now those like yours and cbanack's have multiple files. What is the benefit of using multiple files?

@ oraclexview - I see you mentioned .NET... what do you recommend I learn about it specifically? Should I learn the whole language or just certain aspects of it?

@ Stonepaw - I've been focusing more on learning RegEx at the moment just because it is the one I'm more challenged by, but I have been keeping up with Python learning too. Could you tell me, what is the difference between Python and IronPython? Extra methods? I haven't yet worked with your suggested tools but I am working my way to them. I think I might know how to pull off your simple script suggestion; I'll post here when I try it.

Again, thanks to you all for the help so far. Whether or not I ever get good at this remains to be seen, but I'll to post here about my experiences every once in a while so others that are thinking about learning can get a true idea of what it's like.
Alan Scott
Expert Boarder
Posts: 156
graphgraph
User Offline Click here to see the profile of this user
Gender: Male Alan Scott Location: Newton, Massachusetts Birthday: 07/30
The administrator has disabled public write access.
... The failure to appreciate... is perfectly understandable, because the readership never evaluates old material in the context of the cultural climate in which it was created, or the state of the art at the time it was created.
Marty Pasko
 
#6523
Re:How Did You Learn To Web Scrape Using Python? 4 Months ago Karma: 55
Try your new 1337 regex skillz on creating some WebComics
cYo
Moderator
Posts: 1130
graph
User Online Now Click here to see the profile of this user
The administrator has disabled public write access.
 
#6533
Re:How Did You Learn To Web Scrape Using Python? 4 Months ago Karma: 4
Alan Scott wrote:
but RegEx, on the other hand, for me it look's like some kind of alien language.
I'll have to agree with you here...RegEx looks like a big monster to me at this stage of my learning as well. I'm looking to implement it in my scripts also, but I have to 1st know exactly how it works, and boy is it a chore.

What is the benefit of using multiple files?
You guys can expand on this, but simply, you can look at a Python file as a module. Writing code in a modular fashion allows the developer to break the code out into pieces that are more directly related. This can allow you to use/write code more efficiently. For example you can have a form written in it's own Python file that is referenced many times in a seperate Python file. If I've got this right, a Python file is basically referenced as a function/object. So, you could write the code to reference a seperate file as a function/object or write it directly in the same file and reference it there. I think that's more of a developer preference. I would say the benefit of having a seperate file is so that you can use the function(s) of that code over again in other scripts that you write.

I see you mentioned .NET... what do you recommend I learn about it specifically? Should I learn the whole language or just certain aspects of it?
Well, I personally learn certain aspects based on what I need at the moment. That's why I love programming language reference libraries soooo much! Once you have the basic understanding of a language down, you can then just jump around and pick out the stuff you need most for your application. So I would say, learn about object oriented programming in general. Then get a feel for the .NET syntax. After that, just pick what you need. Also, always ask others, cuz everyone does things a lilttle different sometimes.

Could you tell me, what is the difference between Python and IronPython? Extra methods?
Python is the alpha and Iron is the omega! Lol, seriously though, IronPython is just an extension of the Python language with pre-defined code written to access the .NET Framework so that you don't have to work about doing it.
oraclexview
aka SoundWave
Expert Boarder
Posts: 153
graphgraph
User Offline Click here to see the profile of this user
Gender: Male onyxistence Location: Bay Area California Birthday: 01/01
Last Edit: 2010/03/26 07:08 By oraclexview.
The administrator has disabled public write access.
 
#6535
Re:How Did You Learn To Web Scrape Using Python? 4 Months ago Karma: 39
oraclexview wrote:
Alan Scott wrote:
but RegEx, on the other hand, for me it look's like some kind of alien language.
I'll have to agree with you here...RegEx looks like a big monster to me at this stage of my learning as well. I'm looking to implement it in my scripts also, but I have to 1st know exactly how it works, and boy is it a chore.
Regular Expressions can be very complicated and confusing for anybody, however they very powerful in the right circumstances.

I found this site was very good at explaining the basics of regex when I was learning how to use them a while ago, it might be of some use to you.

You may also want to try and dissect some RegEx to see what they do. There are a couple in perezmu's toolbox scripts and one or two in my Weekly Releases script.

Oraclexview, your monster comment reminded me of this slightly ridiculous RegEx.

What is the benefit of using multiple files?
oraclexview answered this pretty well however for myself it is partly ease of use. It is easier to manage large scripts if the code is spread into small related module files rather than thousands of lines of code in one single file.

When you get into Windows Forms it becomes almost essential to do this in order to keep yourself sane.


Could you tell me, what is the difference between Python and IronPython? Extra methods?
The most popular version of Python, cPython, is written in C. However, IronPython is written in .net which allows you to use all of .net's code libraries (around 100 built-in I think) while keeping the syntax, basic functionality and strengths of the python language.

The unfortunate side effect is that most popular cPython modules are incompatible with IronPython, but since most things can be accomplished with .net libraries it's not that big a loss.
Stonepaw
Moderator
Posts: 334
graphgraph
User Online Now Click here to see the profile of this user
Gender: Male Location: Canada
The administrator has disabled public write access.
 
#6548
Re:How Did You Learn To Web Scrape Using Python? 4 Months ago Karma: 18
Hi Alan,

Regarding the use of various files, it has been answered so that nothing is left to say! I only want to add that the original reason why I put several files was because I wanted the info in Imprints.py was easily edited by the user without having to mess with the code, so I thought putting it into a separate file just made sense!

Cheers!
perezmu
Platinum Boarder
Posts: 413
graphgraph
User Offline Click here to see the profile of this user
Gender: Male An Idle Mind... Location: Spain Birthday: 01/01
The administrator has disabled public write access.
 
Go to topPage: 1
Moderators: 600WPMPO, Stonepaw

Who's Online

We have 114 guests and 3 members online
  • Stonepaw
  • Yotta
  • cYo

IM

You are not logged in.