Monday, 7 June 2021

Looking for a house (part 3)

Here the last part of my research: a ML model.
I will create a ML decision tree model with some data previously scraped and loaded in a Sql database.
First of all I create the training data:

in the column "eval" I insert an index from 1 to 3 (1= not interesting, 2=not bad ,3= interesting).

This table is used to train our model as done below:

The slqalchemy engine is created and directly passed as parameter in the pandas function pandas.read_sql().

The variables x and y are used to generate the training and the test data(without further parameter, the data are randomly splitted as 75% of data for training and 25% as test data).

With that we are able to create our model.

The model is provided from the library scikit-learn where the DecisionTreeClassifier is what I was looking for.

With the trained model we can feed it using the whole dataset to obatain the parameter "eval" for each house.

The script below is commented to clarify each step.

 

At the end some interesting houses (eval = 3) obtained from my model.


Wednesday, 26 May 2021

Looking for a house (part 2)

In my previous article (looking for house part 1 ) I described how I obtain data from a real-estate website.

With the a dedicated pipeline I pushed all data in a Couchdb database.

I prefer to store data in a NoSql database. In my case the Couchdb is something like a Datalake where the data are stored waiting to be processed  in a next time. With a Noslq database I am not constrain to keep a fix schema or a fixed datatype. In this way I am more flexible to add or remove Fields in the Scrapy item or adapt the spider to another html page.  

Couchdb has also a useful http/rest API (I use just the python library requests to create a new document) and with its sync feature I can easily synchronize the database between different Couchdb instances.

Here one document as example:

My target now is to export the Couchdb documents in a Postgres database. To achieve this target I use 2 other tools: the famous ORM Python library sqlalchemy and an ETL Tool, bonobo.

Here the complete data model: 



and the ER schema:

in the Couchdb_extract function I extract all the documents from the database: 

in the "transform" function the json documents have been mapped to the Slqalchemy Model defined at the beginning

finally in the "load" method inserts all the processed data in Postgres.

A bonobo service is needed to instantiate the connection with Postgres (more here).

the function get_graph chains all our  ETL functions together.


and in the main function we make the script executable through the command:

bonobo run etl_script.py

Now the data are ready to be analysed from a sql database.

Friday, 14 May 2021

Looking for a house (Part 1)

looking for a house... or somehow a way to collect real estate market data. It could be interesting to know the prices trend during this pandemic period.

First of  let's build a spider to collect data from some real estate websites. In this article I just focus on one site: casa.it

As I have already done with the jobs web sites I will collect all the data in a database through a pipeline, but this is a topic for the my next article.

I use the famous scrapy framework to build the spider. I am interested to get the information marked in the red squares:




The crawler looks like that:

The start_urls contains a list comprehension where urls are generated from a python list containing only some relevant cities.

In this case it is easy to generate a url for each city:

 https://www.casa.it/srp/?tr=vendita&ft=<city_name>

for example

https://www.casa.it/srp/?tr=vendita&ft=roma

The crawler rules follow the button "seguente"(next ) at the end of the webpage:



in the second gist the xpath-queries.

All togheter:


from this point now ready to store through a dedicated pipeline(an example here)


Tuesday, 23 May 2017

How to save the scraped data from Scrapy to a mysql DB

After a simple example of a web spider( written  here) I will write a pipeline which will stored the data in a mysql database.

from the scrapy documentation:

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

In order to process the item, in our case, we need a custom pipeline which stores the item in a mysql database.

I choose Sqlalchemy to store the items.



I have written here 3 methods, the first establishes a connection to the database (if you want to try it don't forget to insert your credentials), the second makes the table in the database and the third one create the model "job".

In Sqlalchemy the user-defined class rapresents the database table and the istance of those classes the table's rows. In our case the model "job" represents the table in the database

The Sqlalchemy Object relational Mapper presents a method of associating user-defined Python classes with database tables, and instances of those classes (objects) with rows in their corresponding tables. (from here).


After that we need to write our custom pipeline.


The init method provide a connection with the database.
The next method , process_item, is called for each item parsed. What it does, is simply to instantiate an object Job for each item and check if the item already exists in database or not. If it exists, a new line is added to the log file otherwise the pipeline inserts the item in database.

Now we just need to modify the file settings.py in this way:

That's all


Wednesday, 17 May 2017

A webs-spider for Monster

Here we are...

In this post I will write a web-spider.
If you don't know what a spider is you may check the link  here: https://en.wikipedia.org/wiki/Web_crawler

Which kind of spider?
We can scrap everything from the web: links, photo, prices... but right now I need a job. So I will write a spider to crawl a job-website.

https://www.monster.de/ this one for example.

Yes I know... there is the newsletter, but... have you ever tried to subscribe you into a newsletter?
All you get are only junk mails and  the useful ones usually  arrive too late .
And also the target of those web sites is not actually to find you a job but they only try to push the users to spend as much time as possible on their sites , just like any other web-sites...


For these reasons many web site are so hostile to spiders.

Generally scaping a web site in a non- aggressive way does not create any problem , at least in most  of the cases...



I use here python 2.7 + scrapy



What am I doing here?
I am just scraping the first page, ordered by date of monster.ch, monster.de, monster.it and I extract from them the ad's title, the link and the location.

Monster uses json to show the jobs-ads in the web site. What I am doing here is just to extract every json element in the page and take from them what I am looking for.

Where store the results of our scraping session?

You can try to have a look in the log file, or if you prefer another format the Scrapy frameworks offers many ways to store the output in a more confortable file to read:
https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports

Next time I will show how to store the results in a mysql database.

Stay tuned!

Friday, 6 May 2016

OpenMediaVault: La guida per gli italiani

Perche per italiani?

Qui potete trovare la guida officiale per installare OpenMediaVault:

http://wiki.openmediavault.org/index.php?title=Installation

ok e in inglese, ma penso che le figure siamo piu che esaustive.

Quindi che ce da aggiungere?

pare che ci sia un bug e l installazione in lingua italiana dia qualche problema con chroot.

Quindi se volete installare il vostro amato(se non lo amate ancora lo amerete presto) OpenMediaVault sarebbe utile impostare "inglese" come lingua predefinita di installazione per poi cambiarla una volta completata l'installazione.

Alla prossima!

Thursday, 10 December 2015

Vediamo se...

... riesco a riesumare questo vecchio blog...

Iniziamo con un semplice

"Hello World"

...