scrapy.Request深度爬取火影忍者人物详情并持久化存储到MySQL

阿里多多酱a

2024-04-25 帮助1人

1.创建项目

scrapy startproject Naruto
cd Naruto

2.创建爬虫文件

scrapy genspider naruto http://www.4399dmw.com/huoying/renwu/

3.项目结构

学新通

4.修改配置（settings）

ROBOTSTXT_OBEY = False robots协议改为False
LOG_LEVEL = 'ERROR' # 输出日志
ITEM_PIPELINES = {
# 'NaRuTo.pipelines.NarutoPipeline': 300,
'NaRuTo.pipelines.MysqlPileLine': 300,
} # 管道

5.爬虫文件（spiders下面的naruto）

import scrapy
from NaRuTo.items import NarutoItem
class NarutoSpider(scrapy.Spider):
name = 'naruto'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.4399dmw.com/huoying/renwu/']
def parse(self, response):
# 解析出子页面的url
href = response.xpath('//*[@id="iga"]/li/a/@href').extract()
# 因为里面有重复的url，利用set方法去重
new_href = list(set(href))
for url in new_href:
# 拼接成完整的url连接
in_url = 'http://www.4399dmw.com' url
try:
# 请求传参，将request继续交给scrapy引擎自动爬取并通过回调函数返回结果
yield scrapy.Request(url=in_url,
callback=self.parse_content)
except Exception as e:
print('请求失败:', e)
# 处理详情页数据
def parse_content(self, response):
# div_list = response.xpath('//*[@id="j-lazyimg"]/div[2]/div[1]/div[2]/div/div/div[2]')
# for div in div_list:
# 姓名
name = response.xpath('//*[@id="j-lazyimg"]/div[2]/div[1]/div[2]/div/div/div[2]/div[1]/h1/text()').extract_first()
# 详情
detail = response.xpath('//*[@id="j-lazyimg"]/div[2]/div[1]/div[2]/div/div/div[2]/div[1]/p[1]/text()').extract_first()
# 个人介绍
introduce = response.xpath('//*[@id="j-lazyimg"]/div[2]/div[1]/div[2]/div/div/div[2]/div[2]/p//text()').extract()
# 把爬取到的字符串里面的什么u3000替换为空（我也不知道是啥）
new_introduce = ''.join(introduce).replace('\u3000', '').replace('\xa0', '')
# 把爬取到的内容封装到字典里面
all_data = {
"name": name,
"detail": detail,
"introduce": new_introduce
}
# 实例化NarutoItem()
item = NarutoItem()
item['name'] = all_data['name']
item['detail'] = all_data['detail']
item['introduce'] = all_data['introduce']
# 把item传入到管道（pipelines）
yield item

6.item.py

import scrapy
class NarutoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field() # 忍者姓名
detail = scrapy.Field() # 发布详情
introduce = scrapy.Field() # 忍者介绍

7.管道（pipelines）

import pymysql
class MysqlPileLine(object):
conn = None
cursor = None
def open_spider(self, spider):
# 连接MySQL
self.conn = pymysql.Connect(
host='127.0.0.1',
port=3306,
user='root',
password='***********',
db='naruto',
charset='utf8'
)
def process_item(self, item, spider):
# 游标
self.cursor = self.conn.cursor()
insert_sql = 'insert into all_naruto_data values ("%s", "%s", "%s")' % (item['name'], item['detail'], item['introduce'])
try:
# 提交sql
self.cursor.execute(insert_sql)
self.conn.commit()
except Exception as e:
print('插入失败：', e)
self.conn.rollback()
return item
# 关闭连接
def close_spider(self, spider):
self.cursor.close()
self.conn.close()

7.忍者数据（一部分）

学新通

这篇好文章是转载于：学新通技术网

scrapy.Request深度爬取火影忍者人物详情并持久化存储到MySQL

1.创建项目

2.创建爬虫文件

3.项目结构

4.修改配置（settings）

5.爬虫文件（spiders下面的naruto）

6.item.py

7.管道（pipelines）

7.忍者数据（一部分）

photoshop保存的图片太大微信发不了怎么办

《学习通》视频自动暂停处理方法

Android 11 保存文件到外部存储，并分享文件

word里面弄一个表格后上面的标题会跑到下面怎么办

photoshop扩展功能面板显示灰色怎么办

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

excel打印预览压线压字怎么办

TikTok加速器哪个好免费的TK加速器推荐

怎样阻止微信小程序自动打开