• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

Google DataFlow 没办法在不同位置读写(Python SDK v0.5.5)

用户头像
it1352
帮助1

问题说明

我正在使用 Python SDK v0.5.5 编写一个非常基本的 DataFlow 管道.该管道使用带有传入查询的 BigQuerySource,该查询正在从位于欧盟的数据集中查询 BigQuery 表.

I'm writing a very basic DataFlow pipeline using the Python SDK v0.5.5. The pipeline uses a BigQuerySource with a query passed in, which is querying BigQuery tables from datasets that reside in EU.

执行管道时出现以下错误(项目名称匿名):

When executing the pipeline I'm getting the following error (project name anonymized):

HttpError: HttpError accessing <https://www.谷歌apis.com/bigquery/v2/projects/XXXXX/queries/93bbbecbc470470cb1bbb9c22bd83e9d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Thu, 09 Feb 2017 10:28:04 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Thu, 09 Feb 2017 10:28:04 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "invalid",
    "message": "Cannot read and write in different locations: source: EU, destination: US"
   }
  ],
  "code": 400,
  "message": "Cannot read and write in different locations: source: EU, destination: US"
 }
}

在指定项目、数据集和表名时也会出现该错误.但是,从可用的公共数据集(位于美国——如莎士比亚)中选择数据时没有错误.我也有运行 SDK 的 v0.4.4 的作业,但没有此错误.

The error also occurs when specifying a project, dataset and table name. However there's no error when selecting data from the public datasets available (which reside in US - like shakespeare). I also have jobs running v0.4.4 of the SDK which don't have this error.

这些版本之间的区别在于临时数据集的创建,如管道启动时的警告所示:

The difference between these versions is the creation of a temp dataset, as is shown by the warning at pipeline startup:

WARNING:root:Dataset does not exist so we will create it

我简要了解了 SDK 的不同版本,差异似乎在于这个临时数据集.看起来当前版本默认创建了一个临时数据集,其位置在美国(取自 master):

I've briefly taken a look at the different versions of the SDK and the difference seems to be around this temp dataset. It looks like the current version creates a temp dataset by default with a location in US (taken from master):

我还没有找到禁用创建这些临时数据集的方法.我是否忽略了某些东西,或者在从欧盟数据集中选择数据时这确实不再起作用?

I haven't found a way to disable the creation of these temp datasets. Am I overlooking something, or is this indeed not working anymore when selecting data from EU datasets?

正确答案

#1

感谢您报告此问题.我假设您使用的是 DirectRunner.我们更改了 DirectRunner 的 BigQuery 读取转换的实现,以创建临时数据集(适用于 SDK 版本 0.5.1 及更高版本)以支持大型数据集.似乎我们在这里没有正确设置区域.我们会研究解决这个问题.

Thanks for reporting this issue. I assume you are using DirectRunner. We changed the implementation of BigQuery read transform for DirectRunner to create a temporary dataset (for SDK versions 0.5.1 and later) to support large datasets. Seems like we are not setting the region correctly here. We'll look into fixing this.

如果您使用在正确区域创建临时数据集的 DataflowRunner,则不会出现此问题.

This issue should not occur if you use DataflowRunner which creates temporary datasets in the correct region.

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tangicajh
系列文章
更多 icon
同类精品
更多 icon
继续加载