python—的爬虫框架Scrapy

阿龙的代码在报错

2024-04-11 帮助1人

提示：本文章代码由pyharm实现

前言

一直想学的爬虫框架，这次遇见了好的文章做一下笔记

安装scrapy

1、使用Anaconda安装
如果你的python是使用anaconda安装的，可以用这种方法。
（本人使用方法）
在哪cmd中输入一下代码：

conda install Scrapy

2、windows安装
windows安装就比较复杂了需要下载以下以来库：

lxml
pyOpenSSL
Twisted
PyWin32

安装完上述库之后，就可以安装Scrapy了，命令如下：

pip install Scrapy

生成Scrapy项目

启动cmd 进入我们要要创建的文件位置
进入后在cmd输入一下代码：

scrapy startproject 项目名称

如果在创建项目的时候出现： “ImportError: DLL load failed: 找不到指定的模块。”的错误可以参考文章：创建scrapy工程时报错 “ImportError: DLL load failed: 找不到指定的模块。“的解决方法

在cmd中进入我们最新创建的文件中

cd firstpro

创建我们的项目输入一下代码

scrapy genspider scenery pic.netbian.com

无报错则创建完成

爬取壁纸图片链接

修改参数

打开settings.py文件

修改第20行的机器人协议
修改第28行的下载间隙（默认是注释掉的，取消注释是3秒，太长了，改成1秒）
修改第40行，增加一个请求头
修改第66行，打开一个管道

写items.py文件

打开tems.py文件，输入一下代码：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    link = scrapy.Field()
    pass

书写爬虫文件


import scrapy
from ..items import FirstproItem


class ScenerySpider(scrapy.Spider):
    name = 'scenery'
    allowed_domains = ['pic.netbian.com']
    start_urls = ['https://pic.netbian.com/4kfengjing/']  # 起始url
    page = 1

    def parse(self, response):
        items = FirstproItem()
        lists = response.css('.clearfix li')
        for list in lists:
            items['name'] = list.css('a img::attr(alt)').extract_first()  # 获取图片名
            items['link'] = list.css('a img::attr(src)').extract_first()  # 获取图片链接

            yield items

        if self.page < 10:  # 爬取10页内容
            self.page  = 1
            url = f'https://pic.netbian.com/4kfengjing/index_{str(self.page)}.html'  # 构建url

            yield scrapy.Request(url=url, callback=self.parse)  # 使用callback进行回调

写pipelines文件

打开pipelines.py文件,输入一下代码：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class FirstproPipeline:
    def process_item(self, item, spider):
        print(item)
        return item

在框架中运行

在cmd中输入以下代码：

scrapy crawl scenery

也可以在pycharm中创建run.py文件输入以下代码:

from scrapy import cmdline

cmdline.execute('scrapy crawl scenery'.split())  # 记得爬虫名改成自己的

说明：

代码来自：原作者博客

这篇好文章是转载于：学新通技术网

python—的爬虫框架Scrapy

前言

安装scrapy

生成Scrapy项目

爬取壁纸图片链接

修改参数

写items.py文件

书写爬虫文件

写pipelines文件

在框架中运行

说明：

photoshop保存的图片太大微信发不了怎么办

《学习通》视频自动暂停处理方法

word里面弄一个表格后上面的标题会跑到下面怎么办

Android 11 保存文件到外部存储，并分享文件

photoshop扩展功能面板显示灰色怎么办

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

excel打印预览压线压字怎么办

TikTok加速器哪个好免费的TK加速器推荐

怎样阻止微信小程序自动打开