Tuesday, 23 May 2017

How to save the scraped data from Scrapy to a mysql DB

After a simple example of a web spider( written  here) I will write a pipeline which will stored the data in a mysql database.

from the scrapy documentation:

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

In order to process the item, in our case, we need a custom pipeline which stores the item in a mysql database.

I choose Sqlalchemy to store the items.



I have written here 3 methods, the first establishes a connection to the database (if you want to try it don't forget to insert your credentials), the second makes the table in the database and the third one create the model "job".

In Sqlalchemy the user-defined class rapresents the database table and the istance of those classes the table's rows. In our case the model "job" represents the table in the database

The Sqlalchemy Object relational Mapper presents a method of associating user-defined Python classes with database tables, and instances of those classes (objects) with rows in their corresponding tables. (from here).


After that we need to write our custom pipeline.


The init method provide a connection with the database.
The next method , process_item, is called for each item parsed. What it does, is simply to instantiate an object Job for each item and check if the item already exists in database or not. If it exists, a new line is added to the log file otherwise the pipeline inserts the item in database.

Now we just need to modify the file settings.py in this way:

That's all


No comments:

Post a Comment