In my previous article (looking for house part 1 ) I described how I obtain data from a real-estate website.
With the a dedicated pipeline I pushed all data in a Couchdb database.
I prefer to store data in a NoSql database. In my case the Couchdb is something like a Datalake where the data are stored waiting to be processed in a next time. With a Noslq database I am not constrain to keep a fix schema or a fixed datatype. In this way I am more flexible to add or remove Fields in the Scrapy item or adapt the spider to another html page.
Couchdb has also a useful http/rest API (I use just the python library requests to create a new document) and with its sync feature I can easily synchronize the database between different Couchdb instances.
Here one document as example:
My target now is to export the Couchdb documents in a Postgres database. To achieve this target I use 2 other tools: the famous ORM Python library sqlalchemy and an ETL Tool, bonobo.
Here the complete data model:
and the ER schema:
in the Couchdb_extract function I extract all the documents from the database:
in the "transform" function the json documents have been mapped to the Slqalchemy Model defined at the beginning
finally in the "load" method inserts all the processed data in Postgres.
A bonobo service is needed to instantiate the connection with Postgres (more here).
the function get_graph chains all our ETL functions together.
and in the main function we make the script executable through the command:
bonobo run etl_script.py
Now the data are ready to be analysed from a sql database.


No comments:
Post a Comment