用Python读取文件中间的新标题的数据文件

c = pd.read_csv('C:\filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])

<Title1> ... <Title12> 134849000 -0.420384078515376 ... 244.507248 135016000 -0.406915327374619 ... 244.507248 135183000 -0.406915327374619 ... 244.507248 135349000 -0.406915327374619 ... 244.507248 135516000 -0.406915327374619 ... 244.507248 ... ... ... ... <-- (somewhere in here there is a new header with three columns) 2316226000 0.349323222511261 ... NaN 2316393000 0.359268272664523 ... NaN 2316560000 0.346797179431672 ... NaN 2316726000 0.291363936474923 ... NaN 2316893000 0.256587672540276 ... NaN [26188 rows x 12 columns]

<Header1> <Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13> 134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792 135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792 135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792 135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792 135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792 135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792 135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792 136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792 136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792 136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792 136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792 136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792 136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792 137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792 137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792 137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792 137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792 137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792 137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792 <Header2> <Title13(same as Title 1)><Title14><Title15> 134849000,0.120862187115588,0 135016000,0.171543242833847,0 135183000,0.146335932645973,0 135349000,0.09773669641824,0 135516000,0.0882672298282907,0 135683000,0.124406962864472,0 135849000,0.186013875486258,0 136016000,0.219045896500945,0 136183000,0.197246332120462,0 136349000,0.150083583561413,0 136516000,0.0838562129822536,0 136683000,0.00269632558524612,0 136849000,-0.0447052988191479,0 137016000,-0.00496292706410619,0 137183000,0.0799457149607322,0 137349000,0.137388731956788,0 137516000,0.142305654943302,0 137683000,0.115943857754048,0 137849000,0.0991913228381935,0

3条回答

网友

1楼 · 编辑于 2024-06-06 17:24:57

一种可能的解决办法：

临时更改索引

c.reset_index(inplace=True)

在第二个标题中找到新列的行

newcols = c.iloc[c[c.iloc[:, 1].isna()].index.min() + 2:, [1, 2]].reset_index(drop=True)

重命名新列

newcols.rename(columns={'<Title1>' : '<Title14>', '<Title2>' : '<Title15>'}, inplace=True)

添加新列，删除第二个标题中的行，并恢复原始索引

c = pd.concat([c, newcols], axis=1).dropna().set_index('index')

网友

2楼 · 编辑于 2024-06-06 17:24:57

这里有两种解决方案，第一种方法生成一个新文件，第二种方法在读取csv操作期间修复头文件。如果文件将被多次处理，则可以使用第一个，但需要至少读取所有行两次。如果需要只读取一次多个大文件，则首选第二种方法

解决方案1：预处理您的文件

对文件进行一次解析以删除额外的头

# create second file with unique header
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    header = f_in.readlines(1)[0]
    f_out.write(header)
    for line in f_in.readlines():
        if line != header:
            f_out.write(line)

# then read corrected csv file
pd.read_csv('file_single_header.csv')

解决方案2：将标题行视为注释并手动分配标题

# read first line of the file to get header and split names
import re
with open('file.csv', 'r') as f:
    header = re.split('\s+', f.readlines(1)[0].strip())

# exclude header lines and assign names manually
pd.read_csv('file.csv', comment='<', names= header)

注意。从您的示例中不清楚您是否真的有另一个分隔符的逗号分隔值。如果这是空间，您需要调整read_csv，如下所示。此外，如果索引位于csv文件中，则需要为选项2添加名称（此处无）

# option 1
pd.read_csv('file_fixed_header.csv', sep='\s+')

# option 2
pd.read_csv('file.csv',
        comment='<',
        names=[None] + header, # added None for index
        sep='\s+',
        index_col=0
       ).dropna(axis=0, how='all')

解决方案3：修复错误的csv文件

with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    i = 0
    for line in f_in.readlines():
        if line.strip().startswith('<'):
            if i == 1:
                f_out.write(','.join(line.strip('<>\n').split('><'))+'\n')
            i+=1
        else:
            f_out.write(line)

网友

3楼 · 编辑于 2024-06-06 17:24:57

我设法为这个特定的数据集找到了一个相对健壮和简单的解决方案

读取数据并跳过第一个标题后：

raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])

我检查第一列中的非数值，以找出下一个标题的位置：

a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())

结果:

0        False
1        False
2        False
3        False
4        False
         ...  
26183    False
26184    False
26185    False
26186    False
26187    False
Name: 0, Length: 26188, dtype: bool

然后我找到语句为true的索引：

a = np.where(a)[0]

结果:

[13093 13094]

从这里，我可以简单地使用索引为两个标题的数据编制索引：

d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]

在e中，我还确保对列进行索引，因为第二个标题只有三列

结果:

               0                   1   ...          11          12
0       134849000  -0.420384078515376  ...  189.971016  244.507248
1       135016000  -0.406915327374619  ...  189.971016  244.507248
2       135183000  -0.406915327374619  ...  189.971016  244.507248
3       135349000  -0.406915327374619  ...  189.971016  244.507248
4       135516000  -0.406915327374619  ...  189.971016  244.507248
...           ...                 ...  ...         ...         ...
13088  2316226000   -0.30945361835179  ...  188.914284  243.942856
13089  2316393000    -0.4099956694033  ...  188.914284  243.942856
13090  2316560000    -0.4099956694033  ...  188.914284  243.942856
13091  2316726000    -0.4099956694033  ...  188.914284  243.942856
13092  2316893000  -0.429752713005517  ...  188.914284  243.942856

[13093 rows x 13 columns]

                0                   1  2
13095   134849000   0.120862187115588  0
13096   135016000   0.171543242833847  0
13097   135183000   0.146335932645973  0
13098   135349000    0.09773669641824  0
13099   135516000  0.0882672298282907  0
...           ...                 ... ..
26183  2316226000   0.349323222511261  0
26184  2316393000   0.359268272664523  0
26185  2316560000   0.346797179431672  0
26186  2316726000   0.291363936474923  0
26187  2316893000   0.256587672540276  0

[13093 rows x 3 columns]

由于两个数据集都有一个公共列（每个标题的第一列），我使用merge将底部数据集附加到顶部数据集：

f = pd.merge(d,e, on=[0,0])

结果:

                0                 1_x  ...                 1_y 2_y
0       134849000  -0.420384078515376  ...   0.120862187115588   0
1       135016000  -0.406915327374619  ...   0.171543242833847   0
2       135183000  -0.406915327374619  ...   0.146335932645973   0
3       135349000  -0.406915327374619  ...    0.09773669641824   0
4       135516000  -0.406915327374619  ...  0.0882672298282907   0
...           ...                 ...  ...                 ...  ..
13088  2316226000   -0.30945361835179  ...   0.349323222511261   0
13089  2316393000    -0.4099956694033  ...   0.359268272664523   0
13090  2316560000    -0.4099956694033  ...   0.346797179431672   0
13091  2316726000    -0.4099956694033  ...   0.291363936474923   0
13092  2316893000  -0.429752713005517  ...   0.256587672540276   0

[13093 rows x 15 columns]

现在我有了可以保存的正确数据集，用.to_csv定义我自己的标题

一种可能的解决办法：

解决方案1：预处理您的文件

解决方案2：将标题行视为注释并手动分配标题

解决方案3：修复错误的csv文件

相关问题更多 >

编程相关推荐

热门问题

热门文章