Google DataFlow 没办法在不同位置读写(Python SDK v0.5.5)

Question

问题说明

我正在使用 Python SDK v0.5.5 编写一个非常基本的 DataFlow 管道.该管道使用带有传入查询的 BigQuerySource，该查询正在从位于欧盟的数据集中查询 BigQuery 表.

I'm writing a very basic DataFlow pipeline using the Python SDK v0.5.5. The pipeline uses a BigQuerySource with a query passed in, which is querying BigQuery tables from datasets that reside in EU.

执行管道时出现以下错误(项目名称匿名):

When executing the pipeline I'm getting the following error (project name anonymized):

HttpError: HttpError accessing <https://www.谷歌apis.com/bigquery/v2/projects/XXXXX/queries/93bbbecbc470470cb1bbb9c22bd83e9d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Thu, 09 Feb 2017 10:28:04 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Thu, 09 Feb 2017 10:28:04 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "invalid",
    "message": "Cannot read and write in different locations: source: EU, destination: US"
   }
  ],
  "code": 400,
  "message": "Cannot read and write in different locations: source: EU, destination: US"
 }
}

在指定项目、数据集和表名时也会出现该错误.但是，从可用的公共数据集(位于美国——如莎士比亚)中选择数据时没有错误.我也有运行 SDK 的 v0.4.4 的作业，但没有此错误.

The error also occurs when specifying a project, dataset and table name. However there's no error when selecting data from the public datasets available (which reside in US - like shakespeare). I also have jobs running v0.4.4 of the SDK which don't have this error.

这些版本之间的区别在于临时数据集的创建，如管道启动时的警告所示:

The difference between these versions is the creation of a temp dataset, as is shown by the warning at pipeline startup:

WARNING:root:Dataset does not exist so we will create it

我简要了解了 SDK 的不同版本，差异似乎在于这个临时数据集.看起来当前版本默认创建了一个临时数据集，其位置在美国(取自 master):

I've briefly taken a look at the different versions of the SDK and the difference seems to be around this temp dataset. It looks like the current version creates a temp dataset by default with a location in US (taken from master):

我还没有找到禁用创建这些临时数据集的方法.我是否忽略了某些东西，或者在从欧盟数据集中选择数据时这确实不再起作用?

I haven't found a way to disable the creation of these temp datasets. Am I overlooking something, or is this indeed not working anymore when selecting data from EU datasets?

Answer 1

正确答案

#1

感谢您报告此问题.我假设您使用的是 DirectRunner.我们更改了 DirectRunner 的 BigQuery 读取转换的实现，以创建临时数据集(适用于 SDK 版本 0.5.1 及更高版本)以支持大型数据集.似乎我们在这里没有正确设置区域.我们会研究解决这个问题.

Thanks for reporting this issue. I assume you are using DirectRunner. We changed the implementation of BigQuery read transform for DirectRunner to create a temporary dataset (for SDK versions 0.5.1 and later) to support large datasets. Seems like we are not setting the region correctly here. We'll look into fixing this.

如果您使用在正确区域创建临时数据集的 DataflowRunner，则不会出现此问题.

This issue should not occur if you use DataflowRunner which creates temporary datasets in the correct region.

这篇好文章是转载于：学新通技术网

Google DataFlow 没办法在不同位置读写(Python SDK v0.5.5)

问题说明

正确答案

YouTube API 不能在 iOS (iPhone/iPad) 工作，但在桌面浏览器工作正常?

iPhone，一张图像叠加到另一张图像上以创建要保存的新图像?(水印)

保持在后台运行的 iPhone 应用程序完全可操作

使用 iPhone 进行移动设备管理

在android同时打开手电筒和前置摄像头

扫描 NFC 标签时是否可以启动应用程序?

检查邮件是否发送成功

Android微调工具-删除当前选择

希伯来语的空格句子标记化错误

Android App 和三星 Galaxy S4 不兼容