PyDeequ与PySpark集成:错误'JavaPackage'对象不可调用
我正在尝试在我的Streamlit应用中将PyDeequ与PySpark结合起来,以对CSV文件进行全面的数据质量检查。我想利用PyDeequ的功能来进行各种测试,包括完整性、正确性、唯一性、异常值检测和日期格式正确性。不过,我遇到了一个错误,提示“JavaPackage”对象不可调用。以下是相关的代码片段、我尝试进行的具体测试以及错误信息:
import streamlit as st
from pyspark.sql import SparkSession
from pydeequ import AnalysisRunner
from pydeequ.analyzers import Completeness
def create_spark_session():
return SparkSession.builder.appName("DataQualityCheck").getOrCreate()
def read_csv_data(spark, uploaded_file):
df = spark.read.csv(uploaded_file, header=True, inferSchema=True)
return df
def main():
st.title("Data Quality Checker")
uploaded_file = st.file_uploader("Choose a CSV file:", key="csv_uploader", type="csv")
if uploaded_file is not None:
spark = create_spark_session()
df = read_csv_data(spark, uploaded_file)
analysis_runner = AnalysisRunner(spark)
analysis_result = analysis_runner.onData(df).addAnalyzer(Completeness("MRN")).run()
completeness_results = analysis_result['Completeness']
completeness_mrn = completeness_results['MRN']
completeness_percent_mrn = completeness_mrn['completeness']
missing_count_mrn = completeness_mrn['count']
if __name__ == "__main__":
main()
TypeError: 'JavaPackage' object is not callable
Traceback:
File "E:\Deequ\pydeequ_env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 542, in _run_script
exec(code, module.__dict__)
File "E:\data_quality.py", line 43, in <module>
completeness_mrn = completeness_results['MRN']
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 52, in onData
return AnalysisRunBuilder(self._spark_session, df)
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 124, in __init__
self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBu
数据质量测试:
- 完整性:确保某些列(例如“MRN”和“入院日期”)的数据是完整的。
- 正确性:验证特定列中的数据是否符合某些格式或正确性规则(例如,“MRN”的格式是否正确)。
- 唯一性:检查某些列是否包含唯一的值(例如,“MRN”的唯一性)。
- 异常值检测:识别数值列中的任何异常值(例如,“账单金额”)。
- 日期未来格式:确保某一列中的日期(例如“入院日期”)不在未来。
我在环境中安装了PyDeequ版本1.2.0和降级后的PySpark版本3.3.1。有人能帮我理解为什么会出现这个错误以及如何解决它吗?
0 个回答
暂无回答