我使用的是python3.7。如果我的代码有点乱,我会提前道歉。这是我从事过的第一个项目,所以我在学习的过程中学到了很多东西
我正在尝试创建一个程序,该程序扫描和解析PDF中的特定表达式(使用Regex),然后将这些结果与excel电子表格中包含的数据进行比较和标识
目前,该程序成功地从PDF中提取正确的信息,并与excel中的B列进行比较,以确认数据存在且正确无误
我想要它做的是打印B列中某个特定单元格的数据C列中它旁边的单元格
这是我当前的代码:
# Open file dialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
# Open DOC and extract text
pdfFile = open(file_path, 'rb')
reader = PyPDF2.PdfFileReader(pdfFile)
pageNum = str(reader.numPages)
print('Your document has ' + pageNum + ' pages' + '\n')
for pN in range(reader.numPages):
decCon = reader.getPage(pN).extractText()
#print(decCon) #to test if extracting worked.
# find the harmonised standards
# EN 000 000-1 V0.0.0, EN000000-1V0.0.0, EN 00000:0000, EN 00000:0000
docRegex = re.compile('''
EN\s\d\d\d\s\d\d\d-\d\sV\d.\d.\d|
EN\s\d\d\d\d\d:\d\d\d\d|
EN\s\d\d\d\d\d-\d:\d\d\d\d
''', re.VERBOSE)
# extract the harmonised standards
extractedHs = docRegex.findall(decCon)
# DEBUG - to ensure it is collecting correct data
print('It contains the following standards: ' + '\n')
pprint.pprint(extractedHs)
print('\n' + '\n')
# setup progress bar
print('Scanning all ETSI standards...')
toolbar_width = 10
sys.stdout.write("-" * toolbar_width)
for i in range(toolbar_width):
time.sleep(0.25)
sys.stdout.write("-")
sys.stdout.flush()
sys.stdout.write('\n' + '\n' + 'Printing results now...' + "\n" + '\n')
# extract from etsi spreadsheet
wb = openpyxl.load_workbook('All About Standards.xlsx')
sheet = wb["ETSI Catalog"]
etsi = []
for col in sheet['B']:
etsi.append(col.value)
#print(etsi) # DEBUG PRINT
extractedEtsi = docRegex.findall(str(etsi))
# comparison code
for item1 in extractedHs:
for item2 in extractedEtsi:
if item1 == item2:
print('Standard found: ' + item2)
抱歉,如果我的解释有点冗长,我会尝试进一步解释或简化,如果需要
提前谢谢
目前没有回答
相关问题 更多 >
编程相关推荐