In this post I will write a web-spider.
If you don't know what a spider is you may check the link here: https://en.wikipedia.org/wiki/Web_crawler
Which kind of spider?
We can scrap everything from the web: links, photo, prices... but right now I need a job. So I will write a spider to crawl a job-website.
https://www.monster.de/ this one for example.
Yes I know... there is the newsletter, but... have you ever tried to subscribe you into a newsletter?
All you get are only junk mails and the useful ones usually arrive too late .
And also the target of those web sites is not actually to find you a job but they only try to push the users to spend as much time as possible on their sites , just like any other web-sites...
For these reasons many web site are so hostile to spiders.
Generally scaping a web site in a non- aggressive way does not create any problem , at least in most of the cases...
I use here python 2.7 + scrapy
What am I doing here?
I am just scraping the first page, ordered by date of monster.ch, monster.de, monster.it and I extract from them the ad's title, the link and the location.
Monster uses json to show the jobs-ads in the web site. What I am doing here is just to extract every json element in the page and take from them what I am looking for.
Where store the results of our scraping session?
You can try to have a look in the log file, or if you prefer another format the Scrapy frameworks offers many ways to store the output in a more confortable file to read:
https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports
Next time I will show how to store the results in a mysql database.
Stay tuned!
No comments:
Post a Comment