我有一个.txt数据文件,其中包含许多具有不同标题的列。我可以读取包含所有列和行的文件。但是,我的问题是,该文件包含一个附加的头,该头带有三列,附加在初始头的最后一行数据上。如何将最后三列与第一列分开?此外,我想删除三列中的第一列,因为它是第一列的副本,并将其他两列按列附加到文件顶部的列。我曾经这样读过文件:
c = pd.read_csv('C:\filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])
结果是:
<Title1> ... <Title12>
134849000 -0.420384078515376 ... 244.507248
135016000 -0.406915327374619 ... 244.507248
135183000 -0.406915327374619 ... 244.507248
135349000 -0.406915327374619 ... 244.507248
135516000 -0.406915327374619 ... 244.507248
... ... ... ... <-- (somewhere in here there is a new header with three columns)
2316226000 0.349323222511261 ... NaN
2316393000 0.359268272664523 ... NaN
2316560000 0.346797179431672 ... NaN
2316726000 0.291363936474923 ... NaN
2316893000 0.256587672540276 ... NaN
[26188 rows x 12 columns]
可以看出,数据集的“第4个量子点”(或第4-12列,第x行,1索引)包含NaN值,因为这三列已附加在第一个标头的最后一行,因此它们保留空值,因为文件包含从顶部开始的12列。此外,两个标题都有两行,其中第一行不需要,因此我需要跳过这些行
示例文件:
<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0
一种可能的解决办法:
临时更改索引
在第二个标题中找到新列的行
重命名新列
添加新列,删除第二个标题中的行,并恢复原始索引
这里有两种解决方案,第一种方法生成一个新文件,第二种方法在读取csv操作期间修复头文件。如果文件将被多次处理,则可以使用第一个,但需要至少读取所有行两次。如果需要只读取一次多个大文件,则首选第二种方法
解决方案1:预处理您的文件
对文件进行一次解析以删除额外的头
解决方案2:将标题行视为注释并手动分配标题
注意。从您的示例中不清楚您是否真的有另一个分隔符的逗号分隔值。如果这是空间,您需要调整read_csv,如下所示。此外,如果索引位于csv文件中,则需要为选项2添加名称(此处无)
解决方案3:修复错误的csv文件
我设法为这个特定的数据集找到了一个相对健壮和简单的解决方案
读取数据并跳过第一个标题后:
我检查第一列中的非数值,以找出下一个标题的位置:
结果:
然后我找到语句为true的索引:
结果:
从这里,我可以简单地使用索引为两个标题的数据编制索引:
在e中,我还确保对列进行索引,因为第二个标题只有三列
结果:
d=
e=
由于两个数据集都有一个公共列(每个标题的第一列),我使用merge将底部数据集附加到顶部数据集:
结果:
现在我有了可以保存的正确数据集,用.to_csv定义我自己的标题
相关问题 更多 >
编程相关推荐