在Python的BigQuery中，空格会导致问题

from google.cloud import bigquery client = bigquery.Client() dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data") dataset = client.get_dataset(dataset_ref) safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10) answers_query_job = client.query(working_query, job_config=safe_config) answers_query_job.to_dataframe()

working_query = """ SELECT a.id, a.body, a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_answers` AS a INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%' """

bad_query = """ SELECT a.id, a.body, a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_answers` AS a INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%' """

2条回答

网友

1楼 · 编辑于 2024-05-19 03:05:57

我已经使用您的两个查询执行了一些测试，它们是以相同的方式执行的

首先，我必须指出query（）方法接收一个字符串，并使用作业配置来配置作业。此外，文档没有提到与查询字符串，here中的额外空格相关的任何问题

此外，如果您导航到BigQuery UI，一次复制并粘贴一个查询并执行它，您将在作业信息下看到，两个查询将处理大约23Gb的数据，相同数量的数据将是计费的字节。因此，如果您的setbigquery.QueryJobConfig(maximum_bytes_billed=23000000000)和省略了to_dataframe()方法，那么上面提到的两个查询都会运行得很好

更新：

根据documentation，默认情况下use_query_cache设置为true，这意味着如果运行相同的查询，它将从以前的查询中检索结果。因此，不会处理任何字节。如果以前运行查询时没有maximum_bytes_billed集。然后以最大数量运行同一查询，即使该查询的处理量大于您现在设置的处理量，该查询仍将运行

在您的例子中，我使用了来自AI平台的Python3笔记本和Shell中的.py文件来运行以下代码

第一个代码

from google.cloud import bigquery
import pandas

client = bigquery.Client()
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

job_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
job_config.use_query_cache = False

working_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """
answers_query_job = client.query(working_query, job_config) 
answers_query_job.to_dataframe()

第二个代码

from google.cloud import bigquery
import pandas

client = bigquery.Client()
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

job_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
job_config.use_query_cache = False


bad_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q 
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

answers_query_job = client.query(working_query, job_config) 
answers_query_job.to_dataframe()

上述代码均无效。它们导致了以下错误：

Query exceeded limit for bytes billed: 10000000000. 24460132352 or higher required.

另一方面，如果设置了job_config = bigquery.QueryJobConfig(maximum_bytes_billed=25000000000)。两个查询都正常运行

网友

2楼 · 编辑于 2024-05-19 03:05:57

您可能启用了成本控制：documentation

此错误意味着您的查询将要扫描的字节数超过“计费的最大字节数”中设置的限制

你能可靠地重现这个错误吗？查询中的空白看起来与BigQueryrols中的成本控制无关。。也许只是巧合，要么是数据更大，要么是引入了成本控制

编辑：Alexandre的回答是正确的——“好的查询”成功了，因为它从缓存中获取结果。仅使用重试（注意：在上面的注释线程中使用\u查询\u缓存，而不是使用QueryCache）

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10, use_query_cache=False)

对于好的查询，得到了相同的错误。此外，您还可以在结果作业中检查cache_hit，以查看是否从缓存中获取响应。只要查询成功，它就等于true：

print("Cache hit: ")
print(answers_query_job.cache_hit)

相关问题更多 >

编程相关推荐

热门问题

热门文章