从pdf-fi读取表格数据

2024-04-26 22:37:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在python中以字符串的形式从表中读取数字数据。(我尝试了许多不同的方法将表格转换为CSV、Excel等,但似乎没有什么能完美地工作。因此我想尝试弦方法) 每条线基本上如下所示:

"ebit 34 894 38 445 28 013 26 356 12 387 -8 680 -2 760 838"

这里有8列。右边最后的数字:838属于一列,-2760属于一列,12387属于一列,依此类推。有人有聪明的方法知道哪些数字属于哪一列吗?你知道吗


Tags: csv数据方法字符串数字excel形式表格
1条回答
网友
1楼 · 发布于 2024-04-26 22:37:35

如果不访问实际数据,很难准确地解决这个问题,但基本上您需要使用复制粘贴以外的其他方法来解析PDF表,因为这会导致列间距和用作千位分隔符的空间之间的混淆。你知道吗

首先,我建议使用Xpdf tools之类的工具,这是一组用于解析PDF文档的命令行实用程序。其中一个实用程序叫做pdftotext.exe,我已经在一个叫做intrum_q317_presentation.pdfsample PDF file上测试过了

例如,要提取本文档第17页的表格:

enter image description here

您可以运行以下命令:

C:\Program Files\xpdf-tools-win-4.00\bin64\pdftotext.exe" -table -f 17 -l 17 intrum_q317_presentation.pdf parsed_output.txt

它产生这个输出(在parsed_output.txt):

Cash flow statement

                                                                  Q3   Q3    Dev    YTD     YTD     Dev

SEK M                                                         2017     2016  %      2017    2016    %

Operating earnings (EBIT)                                         977  506   93     1 921   1 379   39

Depreciation                                                      163  40    308    245     120     104

Amortization and revaluation of purchased debt                    866  389   123    1 845   1 137   62

Income tax paid                                                   -97  -33   194    -283    -187    51

Changes in factoring receivables                                  7    -25   -128   -39     -45     -13

Other changes in working capital                                  5    -60   -108   -8      -119    n/a

Financial net & other non-cash items                          -125     -6    1983   -486    -74     557

Cash flow from operating activities (CFFO)                    1 796    811   121    3 195   2 211   45

Purchases of tangible and intangible fixed assets (CAPEX)         -38  -33   15     -115    -103    12

Purchases of debt                                             -1  124  -732  54     -4 317  -2 188  97

Purchases of shares in subsidiaries and associated companies      -2   -1    100    -171    -89     92

Liquid assets in acquired subsidiaries                            0    0            975     1

Other cash flow form investing activities                         -1   2     -150   -2      6       -133

Cash flow from investing activities (CFFI)                    -1  165  -764  52     -3 630  -2 373  53

Cash flow from investing activities (CFFI)

excl liquid assets in acquired subsidiaries                   -1  165  -764  52     -4 605  -2 374  94

Free cash flow (CFFO - CFFI)                                      631  47    1 243  -435    -167    160

Free cash flow (CFFO - CFFI) excl liquid

assets in acquired subsidiaries                                   631  47    1 243  -1 410  -168    739

                                                                                                17

您可以看到这很像您的字符串,但是各个列之间的间距更大。你知道吗

然后我们可以使用一些python将其解析为二维数组:

from tabulate import tabulate
import re

template = ''

with open('C:\\parsed_output.txt') as f:
    raw_lines = [line for line in f.readlines() if line.strip() != '']
    lines = raw_lines[1:-1] # ignore first and last lines
    for raw_line in lines:
        length = max([len(template), len(raw_line)])
        old_template = template.ljust(length)
        line = raw_line.ljust(length)
        template = ''
        for i in range(0,length):
            template += ' ' if (old_template[i]==' ' and line[i]==' ') else 'X'

# try to work out the column widths, based on alignment of spaces:
column_widths = [len(x) for x in template.split()]
column_count = len(column_widths)
column_starts = [0]
start = 0
for i in range(1, column_count):
    start = template.find(' X',start) + 1
    column_starts.append(start)
column_starts.append(len(template)) # add final value to terminate right-most column

# now divide up each line using our column widths
rows=[]
for raw_line in lines:
    line = raw_line.ljust(len(template))
    row=[]
    for i in range(0, column_count):
        value = line[column_starts[i]:column_starts[i+1]].strip()
        if i>0: value = re.sub('\s+', '', value)
        row.append(value)
    rows.append(row)

print(tabulate(rows, tablefmt='grid'))

。。。结果如下:

+                               +   -+   +   +   -+   -+   +
|                                                              | Q3    | Q3   | Dev  | YTD   | YTD   | Dev  |
+                               +   -+   +   +   -+   -+   +
| SEK M                                                        | 2017  | 2016 | %    | 2017  | 2016  | %    |
+                               +   -+   +   +   -+   -+   +
| Operating earnings (EBIT)                                    | 977   | 506  | 93   | 1921  | 1379  | 39   |
+                               +   -+   +   +   -+   -+   +
| Depreciation                                                 | 163   | 40   | 308  | 245   | 120   | 104  |
+                               +   -+   +   +   -+   -+   +
| Amortization and revaluation of purchased debt               | 866   | 389  | 123  | 1845  | 1137  | 62   |
+                               +   -+   +   +   -+   -+   +
| Income tax paid                                              | -97   | -33  | 194  | -283  | -187  | 51   |
+                               +   -+   +   +   -+   -+   +
| Changes in factoring receivables                             | 7     | -25  | -128 | -39   | -45   | -13  |
+                               +   -+   +   +   -+   -+   +
| Other changes in working capital                             | 5     | -60  | -108 | -8    | -119  | n/a  |
+                               +   -+   +   +   -+   -+   +
| Financial net & other non-cash items                         | -125  | -6   | 1983 | -486  | -74   | 557  |
+                               +   -+   +   +   -+   -+   +
| Cash flow from operating activities (CFFO)                   | 1796  | 811  | 121  | 3195  | 2211  | 45   |
+                               +   -+   +   +   -+   -+   +
| Purchases of tangible and intangible fixed assets (CAPEX)    | -38   | -33  | 15   | -115  | -103  | 12   |
+                               +   -+   +   +   -+   -+   +
| Purchases of debt                                            | -1124 | -732 | 54   | -4317 | -2188 | 97   |
+                               +   -+   +   +   -+   -+   +
| Purchases of shares in subsidiaries and associated companies | -2    | -1   | 100  | -171  | -89   | 92   |
+                               +   -+   +   +   -+   -+   +
| Liquid assets in acquired subsidiaries                       | 0     | 0    |      | 975   | 1     |      |
+                               +   -+   +   +   -+   -+   +
| Other cash flow form investing activities                    | -1    | 2    | -150 | -2    | 6     | -133 |
+                               +   -+   +   +   -+   -+   +
| Cash flow from investing activities (CFFI)                   | -1165 | -764 | 52   | -3630 | -2373 | 53   |
+                               +   -+   +   +   -+   -+   +
| Cash flow from investing activities (CFFI)                   |       |      |      |       |       |      |
+                               +   -+   +   +   -+   -+   +
| excl liquid assets in acquired subsidiaries                  | -1165 | -764 | 52   | -4605 | -2374 | 94   |
+                               +   -+   +   +   -+   -+   +
| Free cash flow (CFFO - CFFI)                                 | 631   | 47   | 1243 | -435  | -167  | 160  |
+                               +   -+   +   +   -+   -+   +
| Free cash flow (CFFO - CFFI) excl liquid                     |       |      |      |       |       |      |
+                               +   -+   +   +   -+   -+   +
| assets in acquired subsidiaries                              | 631   | 47   | 1243 | -1410 | -168  | 739  |
+                               +   -+   +   +   -+   -+   +

当然,它并不完美(例如“Q3 2017”应该在一个单元格中),也不能保证使用精确的数据(例如,您可能需要手动调整列宽),但它应该可以让您开始使用。你知道吗

相关问题 更多 >