Python:将pdf转换为csv（多行列）

,Élément,État général,Observations 0,ENTRÉE,Etat d'usage, 1,PORTES,Etat d'usage,Chaînette cassé Serrure du bas en mauvais état le système est cassé au niveau de la chaînette 2,ENTRÉE / PORTESENTRÉE / PORTES,, 3,Type de porte,,Porte blindée 4,Poignée,,Bon état 5,Couleur,,Bois

,Élément,État général,Observations 0,ENTRÉE,Etat d'usage, 1,PORTES,Etat d'usage,Chaînette cassé; Serrure du bas en mauvais état le système ... 2,ENTRÉE / PORTESENTRÉE / PORTES,, 3,Type de porte,,Porte blindée 4,Poignée,,Bon état 5,Couleur,,Bois

import os import io import shutil import tabula import time start_time = time.time() path = './' i=0 j=0 for( directory, subdirectories, file ) in os.walk(path): for f in file: if f.endswith('.pdf'): df = tabula.read_pdf(str(directory) + "/" + str(f), pages='all') i=0 j+=1 for curr_df in df: i+=1 curr_df.to_csv('./' + str(directory) + '-' + str(i) + '.csv') print("--- convert %d .PDF to %d .CSV in %s seconds ---" % (j, i, time.time() - start_time))

2条回答

网友

1楼 · 编辑于 2024-06-07 06:12:03

对于@Rjadriansen，我得到的错误是：

fixed: PDF-8.csv_fixed.csv
fixed: PDF-5.csv_fixed.csv
fixed: PDF-7.csv_fixed.csv
fixed: PDF-6.csv_fixed.csv
fixed: PDF-2.csv_fixed.csv
fixed: PDF-10.csv_fixed.csv
fixed: PDF-3.csv_fixed.csv
fixed: PDF-4.csv_fixed.csv
Traceback (most recent call last):
  File "corrCSV_v2.py", line 24, in <module>
    process_csv(file)
  File "corrCSV_v2.py", line 12, in process_csv
    if i[0] ==',' or i[0].isnumeric():
IndexError: string index out of range

错误来自此.csv文件

,Élément,État général,Observations
0,CUISINE,Etat d'usage,
1,CUISINECUISINE 15CUISINE 18

CUISINE 19,,

我想这是因为空线

网友

2楼 · 编辑于 2024-06-07 06:12:03

您可以打开csv，读取行，并将不以空开头（标题）或以数字开头的字符串添加到前一行。然后将这些行写入一个新的csv文件：

with open('filename.csv') as f:
    text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
    lines = []
    for i in text:
        try:
            if i[0] ==',' or i[0].isnumeric():
                lines.append(i)
            else:
                lines[-1] = lines[-1] + "; " + i
        except:
            continue
            
with open('new_file.csv', mode='wt', encoding='utf-8') as newfile:
    newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()

要处理目录中的所有文件，我们可以将其放入函数中，并将目录中的所有文件提供给该函数：

import os as os
import glob as glob

def process_csv(filepath):
    with open(filepath) as f:
        text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
        lines = []
        for i in text:
            try:
                if i[0] ==',' or i[0].isnumeric():
                    lines.append(i)
                else:
                    lines[-1] = lines[-1] + "; " + i
            except:
                continue

    with open(os.path.basename(filepath) + '_fixed.csv', mode='wt', encoding='utf-8') as newfile:
        newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
        print('fixed: ' + os.path.basename(filepath) + '_fixed.csv')

files = glob.glob('./*.csv') #use glob to create a list of filepath of csv files in a directory

for file in files: # loop through the list and feed each file to the function process_csv
    process_csv(file)

相关问题更多 >

编程相关推荐

热门问题

热门文章