Crawling data with Scrapy

Crawling or extracting data from websites is common today. Extracting data manually is difficult and time consuming so this post will introduce you to a web crawling framework – Scrapy to automate the work.

この記事の目次

What is Scrapy?
Key features
How Scrapy works
Crawler example
- Scrapy setup
- Creating spider
Summary

What is Scrapy?

Scrapy is an open-source web scraping framework for Python, designed for crawling websites and extracting data from them in a fast, efficient, and structured way. It provides tools to handle requests, parse HTML, follow links, and store extracted data in various formats (JSON, CSV, databases, etc.).

Key features

Efficiency: Scrapy is highly optimized for web scraping. It supports concurrent requests and can handle thousands of requests with minimal resource usage.
Built-in features: It has built-in support for handling HTTP requests, responses, following links, and extracting data, which saves a lot of time compared to writing custom scrapers.
Extensibility: Scrapy is modular and extensible. You can easily add new middleware, pipelines, or custom spiders.
Automation: You can set up Scrapy to crawl a large number of pages automatically, follow links, and scrape data continuously (e.g., using cron jobs).

How Scrapy works

Starting a spider: When you run “scrapy crawl myspider”, Scrapy loads the spider class (myspider), starts making requests to the URLs defined in start_urls, and parses the responses.
Making requests: The spider creates and sends HTTP requests using the scrapy.Request class. Scrapy handles request scheduling, throttling, and retries automatically.
Processing responses: When a response is received, the parse method (or the method provided in the callback) is called. You can extract data from the response using Scrapy’s selectors and then either:
-Yield the extracted data as a dictionary or Item.
-Yield another request to follow links to additional pages (for example, to scrape paginated content).
Storing data: The extracted data is typically passed through pipelines, where it can be stored in files, databases, or processed further.

Crawler example

You need python to run this example. You can download and install python here.

Scrapy setup

First create your python virtual environment. This will help you avoid conflicting with your system packages. I will store virtual environment data in the example folder. Below are creating and activating virtual environment commands.

python -m venv scrapyenv
source scrapyenv/bin/activate

Next, run commands below to install Scrapy and initialize the project. The project name is “mycrawler”

pip install Scrapy
scrapy startproject mycrawler

After running commands, you will get the project structure like below.

scrapy.cfg: Configures the Scrapy project.
items.py: Defines the data structure for the scraped items.
middlewares.py: Middlewares that process requests and responses.
pipelines.py: Defines how the scraped data will be processed and stored.
settings.py: Global settings for your Scrapy project.
spiders/: Directory where you’ll define the spiders (classes to crawl websites).

Creating spider

We will create a spider to crawl the sample website below.

As you can see in the quote, we have quote content, author and some tags. We will define data structure to match data that we are going to crawl. To do this, we will edit items.py file.

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyQuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Next, we will create the spider and start crawling from the top page. There is a next button at the bottom of the page, so we will follow it until getting all of the quotes.
spiders/my_quote_spider.py

import scrapy

from ..items import MyQuoteItem

class MyQuoteSpider(scrapy.Spider):
    name = "my_quote_spider"
    start_urls = ["https://quotes.toscrape.com"]
    
    def parse(self, response):
        #get quote data on this page
        for quote in response.css("div.quote"):
            
            item = MyQuoteItem()
            item["text"] = quote.css("span.text::text").get()
            item["author"] = quote.css("span small::text").get()
            item["tags"] = quote.css("div.tags a.tag::text").getall()

            yield item

        #crawl next page if it is available
        next_url = response.css("li.next > a::attr(href)").extract_first()
        if next_url:
          yield scrapy.Request(url=self.start_urls[0] + next_url)

In the above code I used CSS selectors to select data. You can also use XPath . Selector can return an array of elements, so you may use extract() to get the array or extract_first() to only one element.
To execute the spider run below command in your project folder (mycrawler). I will export a JSON file. Other data formats like CSV format are supported.

scrapy crawl my_quote_spider -o quotes.json

If everything goes right, you will get a JSON file below.

Now you can use extracted data as you wish!

Summary

Scrapy is a full-featured framework that simplifies the task of web scraping and crawling. It provides tools to manage requests, parse web pages, follow links, and store scraped data, all in an efficient, scalable way. Whether you’re scraping a small blog or building a large web crawler, Scrapy can handle the complexity.
I used only some small functions of Scrapy in this post You can check scrapy documentation to explore more features!
References:
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
CSS selectors – CSS: Cascading Style Sheets | MDN (mozilla.org)

Crawling data with Scrapy

What is Scrapy?

Key features

How Scrapy works

Crawler example

Scrapy setup

Creating spider

Summary

関連記事

人気プログラミング言語とフレームワーク完全ガイド：基本から最新…

リスクマネジメントの極意：予測不可能を制する5つの戦略

簡単！ABテストで効果測定

UIデザインの素人がUIデザイナー職のインターンシップに参加した…

Latest AI trends

情シスナビ一週間のおまとめニュースをメルマガ登録しませんか？

メニュー

関連記事

効率的にウェブサイトのサイトマップファイルを生成する方法

【誰も教えてくれない】情報システム部の予算確保とコスト削減について

【3分で読める！】UIUX改善とABテスト・ヒートマップの関係（図解付き）

ピックアップ記事

ExcelユーザーのためのGoogleスプレッドシート入門

セキュリティ対策イベント／Security Days Fall 2024　【大阪：10/16、東京：10/22～10/25、名古屋：10/29】開催

このままだと危ない？システム保守・運用を外注すべきかの判断軸3点

情シス求人

チームメンバーで作字やってみた#1

Crawling data with Scrapy

What is Scrapy?

Key features

How Scrapy works

Crawler example

Scrapy setup

Creating spider

Summary

＜類似記事＞

関連記事

人気プログラミング言語とフレームワーク完全ガイド ：基本から最新…

リスクマネジメントの極意：予測不可能を制する5つの戦略

簡単！ABテストで効果測定

UIデザインの素人がUIデザイナー職のインターンシップに参加した…

Latest AI trends

情シスナビ 一週間のおまとめニュースをメルマガ登録しませんか？

メニュー

関連記事

効率的にウェブサイトのサイトマップファイルを生成する方法

【誰も教えてくれない】情報システム部の予算確保とコスト削減について

【3分で読める！】UIUX改善とABテスト・ヒートマップの関係（図解付き）

ピックアップ記事

ExcelユーザーのためのGoogleスプレッドシート入門

セキュリティ対策イベント／Security Days Fall 2024 【大阪：10/16、東京：10/22～10/25、名古屋：10/29】開催

このままだと危ない？システム保守・運用を外注すべきかの判断軸3点

情シス求人

チームメンバーで作字やってみた#1

人気プログラミング言語とフレームワーク完全ガイド：基本から最新…

情シスナビ一週間のおまとめニュースをメルマガ登録しませんか？

セキュリティ対策イベント／Security Days Fall 2024　【大阪：10/16、東京：10/22～10/25、名古屋：10/29】開催