Crawling data with Scrapy
Crawling or extracting data from websites is common today. Extracting data manually is difficult and time consuming so this post will introduce you to a web crawling framework – Scrapy to automate the work.
What is Scrapy?
Scrapy is an open-source web scraping framework for Python, designed for crawling websites and extracting data from them in a fast, efficient, and structured way. It provides tools to handle requests, parse HTML, follow links, and store extracted data in various formats (JSON, CSV, databases, etc.).
Key features
Efficiency: Scrapy is highly optimized for web scraping. It supports concurrent requests and can handle thousands of requests with minimal resource usage.
Built-in features: It has built-in support for handling HTTP requests, responses, following links, and extracting data, which saves a lot of time compared to writing custom scrapers.
Extensibility: Scrapy is modular and extensible. You can easily add new middleware, pipelines, or custom spiders.
Automation: You can set up Scrapy to crawl a large number of pages automatically, follow links, and scrape data continuously (e.g., using cron jobs).
How Scrapy works
Starting a spider: When you run “scrapy crawl myspider”, Scrapy loads the spider class (myspider), starts making requests to the URLs defined in start_urls, and parses the responses.
Making requests: The spider creates and sends HTTP requests using the scrapy.Request class. Scrapy handles request scheduling, throttling, and retries automatically.
Processing responses: When a response is received, the parse method (or the method provided in the callback) is called. You can extract data from the response using Scrapy’s selectors and then either:
-Yield the extracted data as a dictionary or Item.
-Yield another request to follow links to additional pages (for example, to scrape paginated content).
Storing data: The extracted data is typically passed through pipelines, where it can be stored in files, databases, or processed further.
Crawler example
You need python to run this example. You can download and install python here.
Scrapy setup
First create your python virtual environment. This will help you avoid conflicting with your system packages. I will store virtual environment data in the example folder. Below are creating and activating virtual environment commands.
python -m venv scrapyenv source scrapyenv/bin/activate
Next, run commands below to install Scrapy and initialize the project. The project name is “mycrawler”
pip install Scrapy scrapy startproject mycrawler
After running commands, you will get the project structure like below.
scrapy.cfg: Configures the Scrapy project.
items.py: Defines the data structure for the scraped items.
middlewares.py: Middlewares that process requests and responses.
pipelines.py: Defines how the scraped data will be processed and stored.
settings.py: Global settings for your Scrapy project.
spiders/: Directory where you’ll define the spiders (classes to crawl websites).
Creating spider
We will create a spider to crawl the sample website below.
As you can see in the quote, we have quote content, author and some tags. We will define data structure to match data that we are going to crawl. To do this, we will edit items.py file.
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MyQuoteItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
Next, we will create the spider and start crawling from the top page. There is a next button at the bottom of the page, so we will follow it until getting all of the quotes.
spiders/my_quote_spider.py
import scrapy from ..items import MyQuoteItem class MyQuoteSpider(scrapy.Spider): name = "my_quote_spider" start_urls = ["https://quotes.toscrape.com"] def parse(self, response): #get quote data on this page for quote in response.css("div.quote"): item = MyQuoteItem() item["text"] = quote.css("span.text::text").get() item["author"] = quote.css("span small::text").get() item["tags"] = quote.css("div.tags a.tag::text").getall() yield item #crawl next page if it is available next_url = response.css("li.next > a::attr(href)").extract_first() if next_url: yield scrapy.Request(url=self.start_urls[0] + next_url)
In the above code I used CSS selectors to select data. You can also use XPath . Selector can return an array of elements, so you may use extract() to get the array or extract_first() to only one element.
To execute the spider run below command in your project folder (mycrawler). I will export a JSON file. Other data formats like CSV format are supported.
scrapy crawl my_quote_spider -o quotes.json
If everything goes right, you will get a JSON file below.
Now you can use extracted data as you wish!
Summary
Scrapy is a full-featured framework that simplifies the task of web scraping and crawling. It provides tools to manage requests, parse web pages, follow links, and store scraped data, all in an efficient, scalable way. Whether you’re scraping a small blog or building a large web crawler, Scrapy can handle the complexity.
I used only some small functions of Scrapy in this post You can check scrapy documentation to explore more features!
References:
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
CSS selectors – CSS: Cascading Style Sheets | MDN (mozilla.org)
この情報は役に立ちましたか?
カテゴリー: