如何从Spark SQL Query[PySpark]获取表名？

Py4JError: An error occurred while calling o78.tableDesc. Trace: py4j.Py4JException: Method tableDesc([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:835)

1条回答

网友

1楼 · 发布于 2024-06-08 00:26:46

我有办法，但相当复杂。它转储Java对象和JSON（穷人的序列化过程），将其反序列化为python对象，过滤和解析表名

import json
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            yield plan_item['tableIdentifier']['table']

当我迭代函数list(get_tables(query))时，会产生['fast_track_gv_nexus', 'buybox_gv_nexus']

注意不幸的是，CTE

示例

^{pr2}$

为了解决这个问题，我必须通过正则表达式来破解

import json
import re
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    plan_string = plan.toString()
    cte = re.findall(r"CTE \[(.*?)\]", plan_string)
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            tableIdentifier = plan_item['tableIdentifier']
            table =  plan_item['tableIdentifier']['table']   
            database =  tableIdentifier.get('database', '')
            table_name = "{}.{}".format(database, table) if database else table
            if table_name not in cte:
                yield table_name

相关问题更多 >

编程相关推荐

热门问题

热门文章