itemexporters-scrapy框架8-python

gaog2zh

2024-04-25 帮助1人

1 前言

我们爬取数据的目的，就是为了在其他应用或者系统中使用。为了方便使用，我们一般把爬取的数据持久化存储或者导出。关于持久化存储可以去参考之前pipeline章节以及python与数据库部分，这里主要讲解数据导出。

为此，Scrapy 提供了一组用于不同输出格式的项目导出器，例如 XML、CSV 或 JSON，以类XxxItemExporter的形式呈现。

2 item exporters

同样的在使用之前需要先实例化XxxItemExporter，那么我们先来看看都有哪些类。

2.1 Item Exporters

类名	参数	描述
BaseItemExporter	(fields_to_export=None, export_empty_fields=False, encoding=‘utf-8’, indent=0, dont_fail=False)	基础类
PythonItemExporter	(, dont_fail=False, *kwargs)	python格式
XmlItemExporter	(file, item_element=‘item’, root_element=‘items’, **kwargs)	xml格式
CsvItemExporter	(file, include_headers_line=True, join_multivalued=‘,’, errors=None, **kwargs)	csv格式
PickleItemExporter	(file, protocol=0, **kwargs)	pickle格式
PprintItemExporter	(file, **kwargs)	打印格式
JsonItemExporter	(file, **kwargs)	json格式
JsonLinesItemExporter	(file, **kwargs)	json 行格式
MarshalItemExporter	(file, **args)	marshal格式

关于JsonItemExporter与JsonLinesItemExporter的分析
- JsonItemExporter典型输出
```
[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]
```
- JsonLinesItemExporter典型输出
```
{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}
```
- JsonItemExporter输出规范，适用于小数据量的输出;JsonLinesItemExporter适用于大量的输出。

2.2 BaseItemExporter

BaseItemExporter为基础类，其他的都继承该类，下面我们以BaseItemExporter为例，介绍它的属性和方法。

方法

方法名	参数	描述
export_item()	item	导出item
serialize_field()	file,name,value	序列化字段
start_exporting()		开始导出，准备工作
finish_exporting()		介绍导出，收尾工作

字段

字段名	默认值	描述
export_empty_fields	None	要导出的字段，默认导出全部字段
encoding		编码
indent	0	缩进

2.3 实例化

2.3.1 必须条件

实例化需要实现一下3个方法：

start_exporting()
export_item()
finish_exporting()

2.3.2 字段序列化

默认情况下，字段值由默认的序列化库执行序列化，当然我们也可以自定义实现，有以下2中方式：

在字段中声明一个序列化器，示例：

import scrapy

def serialize_price(value):
    return f'$ {str(value)}'

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field(serializer=serialize_price)

重写serialize_field()方法，示例：

from scrapy.exporter import XmlItemExporter

class ProductXmlExporter(XmlItemExporter):

    def serialize_field(self, field, name, value):
        if name == 'price':
            return f'$ {str(value)}'
        return super().serialize_field(field, name, value)

2.4 项目实例

以我们之前的爬取csdn个人博客文章为例，现在我们要吧爬取的数据以json格式输出到文件中，pipelines.py代码实例：

class JSONPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.json", "wb")
        self.exporter = JsonItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

输出：

[{"title": "process-进程详解-python", "publish": "2022-01-12 18:10:09", "approval": 0, "comment": 0, "collection": 0},...

2.5 自定义ItemExporter

现在很多应用特别是办公类，都需要和excel打交道，但是scrapy没有提供响应的导出器，你们我们参考BaseItemExporter自定义实现ExcelItemExporter。

详细过程参考链接：https://www.jianshu.com/p/a50b19b6258d

实例，以爬取csdn个人博客文章为例，pipeline代码

class ExcelPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.xls", "wb")
        self.exporter = ExcelItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

输出：学新通

3 后记

参考文章：

Scrapy导出Excel By Exporter

代码仓库：https://gitee.com/gaogzhen/python-study.git

QQ群:433529853

这篇好文章是转载于：学新通技术网

itemexporters-scrapy框架8-python

1 前言

2 item exporters

2.1 Item Exporters

2.2 BaseItemExporter

2.3 实例化

2.3.1 必须条件

2.3.2 字段序列化

2.4 项目实例

2.5 自定义ItemExporter

3 后记

photoshop保存的图片太大微信发不了怎么办

《学习通》视频自动暂停处理方法

Android 11 保存文件到外部存储，并分享文件

word里面弄一个表格后上面的标题会跑到下面怎么办

photoshop扩展功能面板显示灰色怎么办

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

excel打印预览压线压字怎么办

TikTok加速器哪个好免费的TK加速器推荐

怎样阻止微信小程序自动打开