1.9K+ Star！Crawlee：一个高效可靠的网络爬虫和浏览器自动化库，支持下载HTML、PDF、JPG、PNG等多种格式

欢迎关注我，持续获取更多内容，感谢赞&在看~

项目简介

Crawlee^[1] 是一个用于构建可靠爬虫的 Python 网络爬取和浏览器自动化库。可以用于从网站下载 HTML、PDF、JPG、PNG 等文件，并且支持 BeautifulSoup、Playwright 和原生 HTTP 请求。

Crawlee 支持 headful 和 headless 模式，并且具备代理轮换功能。

项目特点

主要亮点

支持BeautifulSoup和Playwright，应对不同网页需求。
自动化重试、代理轮换和会话管理，保障爬虫稳定性。
基于标准的Asyncio，编写简洁且高效的异步代码。
丰富的配置选项，高度可定制化以满足特定项目需求。
开源项目，由Apify支持，易于在Apify平台上部署和运行。

功能特点

统一的 HTTP 和无头浏览器爬取接口。
基于系统资源的自动并行爬取。
使用 Python 编写，带有类型提示，提高开发体验并减少错误。
自动重试错误或被封锁的情况。
集成代理轮换和会话管理。
可配置的请求路由，将 URL 直接定向到适当的处理器。
持久化 URL 队列以供爬取。
可插拔的存储选项，用于存储表格数据和文件。
强大的错误处理。

使用方法

安装

Crawlee 可在 PyPI 上作为 crawlee 包获取。基本安装命令如下：

pip install crawlee

如果需要使用 BeautifulSoupCrawler，则需要安装带有 beautifulsoup 额外依赖的 crawlee：

pip install 'crawlee[beautifulsoup]'

如果需要使用 PlaywrightCrawler，则需要安装带有 playwright 额外依赖的 crawlee，并安装 Playwright 依赖：

pip install 'crawlee[playwright]'
playwright install

使用 Crawlee CLI

使用 Crawlee CLI 快速开始，首先确保安装了 Pipx^[2]：

pipx --help

然后运行 CLI 并选择一个模板：

pipx run crawlee create my-crawler

如果已经安装了 crawlee，可以直接运行：

crawlee create my-crawler

示例

Crawlee 提供了不同类型的爬虫示例，包括 BeautifulSoupCrawler 和 PlaywrightCrawler。

BeautifulSoupCrawler 示例：

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

PlaywrightCrawler 示例：

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Enqueue all links found on the page.
        await context.enqueue_links()

    # Run the crawler with the initial list of requests.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())