Wednesday, 17 May 2017

A webs-spider for Monster

Here we are...

In this post I will write a web-spider.
If you don't know what a spider is you may check the link  here: https://en.wikipedia.org/wiki/Web_crawler

Which kind of spider?
We can scrap everything from the web: links, photo, prices... but right now I need a job. So I will write a spider to crawl a job-website.

https://www.monster.de/ this one for example.

Yes I know... there is the newsletter, but... have you ever tried to subscribe you into a newsletter?
All you get are only junk mails and  the useful ones usually  arrive too late .
And also the target of those web sites is not actually to find you a job but they only try to push the users to spend as much time as possible on their sites , just like any other web-sites...


For these reasons many web site are so hostile to spiders.

Generally scaping a web site in a non- aggressive way does not create any problem , at least in most  of the cases...



I use here python 2.7 + scrapy



What am I doing here?
I am just scraping the first page, ordered by date of monster.ch, monster.de, monster.it and I extract from them the ad's title, the link and the location.

Monster uses json to show the jobs-ads in the web site. What I am doing here is just to extract every json element in the page and take from them what I am looking for.

Where store the results of our scraping session?

You can try to have a look in the log file, or if you prefer another format the Scrapy frameworks offers many ways to store the output in a more confortable file to read:
https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports

Next time I will show how to store the results in a mysql database.

Stay tuned!

No comments:

Post a Comment