Wednesday, 26 May 2021

Looking for a house (part 2)

In my previous article (looking for house part 1 ) I described how I obtain data from a real-estate website.

With the a dedicated pipeline I pushed all data in a Couchdb database.

I prefer to store data in a NoSql database. In my case the Couchdb is something like a Datalake where the data are stored waiting to be processed  in a next time. With a Noslq database I am not constrain to keep a fix schema or a fixed datatype. In this way I am more flexible to add or remove Fields in the Scrapy item or adapt the spider to another html page.  

Couchdb has also a useful http/rest API (I use just the python library requests to create a new document) and with its sync feature I can easily synchronize the database between different Couchdb instances.

Here one document as example:

My target now is to export the Couchdb documents in a Postgres database. To achieve this target I use 2 other tools: the famous ORM Python library sqlalchemy and an ETL Tool, bonobo.

Here the complete data model: 



and the ER schema:

in the Couchdb_extract function I extract all the documents from the database: 

in the "transform" function the json documents have been mapped to the Slqalchemy Model defined at the beginning

finally in the "load" method inserts all the processed data in Postgres.

A bonobo service is needed to instantiate the connection with Postgres (more here).

the function get_graph chains all our  ETL functions together.


and in the main function we make the script executable through the command:

bonobo run etl_script.py

Now the data are ready to be analysed from a sql database.

No comments:

Post a Comment