按顺序抓取 URL
问题说明
所以,我的问题比较简单.我有一个爬虫爬取多个站点,我需要它按照我在代码中编写的顺序返回数据.贴在下面.
So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
start_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
结果以随机顺序返回,例如返回第 29 个,然后是第 28 个,然后是第 30 个.我尝试将调度程序顺序从 DFO 更改为 BFO,以防万一出现问题,但这并没有改变任何东西.
The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.
正确答案
start_urls
定义在 start_requests
方法.下载页面时,您的 parse
方法会调用每个起始 URL 的响应.但是你无法控制加载时间——第一个开始 url 可能会在 parse
的最后一个.
start_urls
defines urls which are used in start_requests
method. Your parse
method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse
.
一种解决方案——覆盖 start_requests
方法,并在生成的请求中添加一个带有 priority
键的 meta
.在 parse
中提取此 priority
值并将其添加到 item
.在管道中根据这个值做一些事情.(我不知道为什么以及在哪里需要按此顺序处理这些 url).
A solution -- override start_requests
method and add to generated requests a meta
with priority
key. In parse
extract this priority
value and add it to the item
. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).
或者让它同步——将这些起始网址存储在某个地方.将 start_urls
放入其中的第一个.在 parse
中处理第一个响应并生成项目,然后从您的存储中获取下一个 url 并使用 parse
的回调对其发出请求.
Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls
the first of them. In parse
process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse
.
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /reply/detail/tangicakf
-
YouTube API 不能在 iOS (iPhone/iPad) 工作,但在桌面浏览器工作正常?
it1352 07-30 -
iPhone,一张图像叠加到另一张图像上以创建要保存的新图像?(水印)
it1352 07-17 -
保持在后台运行的 iPhone 应用程序完全可操作
it1352 07-25 -
在android同时打开手电筒和前置摄像头
it1352 09-28 -
使用c++17更新时出现G++编译器警告
it1352 06-18 -
使用 iPhone 进行移动设备管理
it1352 07-23 -
扫描 NFC 标签时是否可以启动应用程序?
it1352 08-02 -
Android App 和三星 Galaxy S4 不兼容
it1352 07-20 -
复制文件夹/文件而不修改属性?
it1352 07-15 -
在不打开短信界面的情况下从 iPhone 应用发送短信?
it1352 07-25