Crawling data with Scrapy

eyecatch

Crawling or extracting data from websites is common today. Extracting data manually is difficult and time consuming so this post will introduce you to a web crawling framework – Scrapy to automate the work.

What is Scrapy?

Scrapy is an open-source web scraping framework for Python, designed for crawling websites and extracting data from them in a fast, efficient, and structured way. It provides tools to handle requests, parse HTML, follow links, and store extracted data in various formats (JSON, CSV, databases, etc.).

Key features

Efficiency: Scrapy is highly optimized for web scraping. It supports concurrent requests and can handle thousands of requests with minimal resource usage.
Built-in features: It has built-in support for handling HTTP requests, responses, following links, and extracting data, which saves a lot of time compared to writing custom scrapers.
Extensibility: Scrapy is modular and extensible. You can easily add new middleware, pipelines, or custom spiders.
Automation: You can set up Scrapy to crawl a large number of pages automatically, follow links, and scrape data continuously (e.g., using cron jobs).

How Scrapy works

Starting a spider: When you run “scrapy crawl myspider”, Scrapy loads the spider class (myspider), starts making requests to the URLs defined in start_urls, and parses the responses.
Making requests: The spider creates and sends HTTP requests using the scrapy.Request class. Scrapy handles request scheduling, throttling, and retries automatically.
Processing responses: When a response is received, the parse method (or the method provided in the callback) is called. You can extract data from the response using Scrapy’s selectors and then either:
-Yield the extracted data as a dictionary or Item.
-Yield another request to follow links to additional pages (for example, to scrape paginated content).
Storing data: The extracted data is typically passed through pipelines, where it can be stored in files, databases, or processed further.

Crawler example

You need python to run this example. You can download and install python here.

Scrapy setup

First create your python virtual environment. This will help you avoid conflicting with your system packages. I will store virtual environment data in the example folder. Below are creating and activating virtual environment commands.

python -m venv scrapyenv
source scrapyenv/bin/activate  

Next, run commands below to install Scrapy and initialize the project. The project name is “mycrawler”

pip install Scrapy
scrapy startproject mycrawler

After running commands, you will get the project structure like below.

scrapy project structure

scrapy.cfg: Configures the Scrapy project.
items.py: Defines the data structure for the scraped items.
middlewares.py: Middlewares that process requests and responses.
pipelines.py: Defines how the scraped data will be processed and stored.
settings.py: Global settings for your Scrapy project.
spiders/: Directory where you’ll define the spiders (classes to crawl websites).

Creating spider

We will create a spider to crawl the sample website below.

quote to scrape

As you can see in the quote, we have quote content, author and some tags. We will define data structure to match data that we are going to crawl. To do this, we will edit items.py file.

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyQuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Next, we will create the spider and start crawling from the top page. There is a next button at the bottom of the page, so we will follow it until getting all of the quotes.
spiders/my_quote_spider.py

import scrapy

from ..items import MyQuoteItem

class MyQuoteSpider(scrapy.Spider):
    name = "my_quote_spider"
    start_urls = ["https://quotes.toscrape.com"]
    
    def parse(self, response):
        #get quote data on this page
        for quote in response.css("div.quote"):
            
            item = MyQuoteItem()
            item["text"] = quote.css("span.text::text").get()
            item["author"] = quote.css("span small::text").get()
            item["tags"] = quote.css("div.tags a.tag::text").getall()

            yield item

        #crawl next page if it is available
        next_url = response.css("li.next > a::attr(href)").extract_first()
        if next_url:
          yield scrapy.Request(url=self.start_urls[0] + next_url)

In the above code I used CSS selectors to select data. You can also use XPath . Selector can return an array of elements, so you may use extract() to get the array or extract_first() to only one element.
To execute the spider run below command in your project folder (mycrawler). I will export a JSON file. Other data formats like CSV format are supported.

scrapy crawl my_quote_spider -o quotes.json

If everything goes right, you will get a JSON file below.

quote json

Now you can use extracted data as you wish!

Summary

Scrapy is a full-featured framework that simplifies the task of web scraping and crawling. It provides tools to manage requests, parse web pages, follow links, and store scraped data, all in an efficient, scalable way. Whether you’re scraping a small blog or building a large web crawler, Scrapy can handle the complexity.
I used only some small functions of Scrapy in this post You can check scrapy documentation to explore more features!
References:
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
CSS selectors – CSS: Cascading Style Sheets | MDN (mozilla.org)

関連記事

カテゴリー:

ブログ

情シス求人

  1. チームメンバーで作字やってみた#1

ページ上部へ戻る