在pandas中合并两个没有列名的数据框(新手)
简短说明:
如果你的数据里有重复的列名,记得在读取文件的时候给其中一列改个名字。
如果数据里有 NaN
等无效值,记得把它们删掉。
然后按照下面的正确方法进行合并。
这可能是个很简单的问题。
我有两个数据集,都是用 pandas.read_csv()
读取的。
我的数据分在两个不同的csv文件里。
用以下代码:
import mibian
import pandas as pd
underlying = pd.read_csv("txt1.csv", names=['dt1','price']);
options = pd.read_csv("txt2.txt", names=['dt2','ticker','maturity','strike','cP','px','strike','yield','rF','T','rlzd10']);
merged = underlying.merge(options, left_on='dt1', right_on='dt2');
我的两个数据集的前几行看起来是这样的:
>>> underlying.head();
0 1
0 20040326 3.579987
1 20040329 3.690494
2 20040330 3.755247
3 20040331 3.719373
4 20040401 3.728671
还有
>>> options.head();
0 1 2 3 4 5 6 7 8 9 10
0 20130628 SVXY 20130817 32.5 call 39.22 32.5 0 0.005 0.136986 0.411224
所以我想合并的关键列是两个数据集的第0列,我希望保留两个结果集里的所有数据。
我该怎么做呢?我在网上找到的例子都需要关键列,但我在结果里没有这个。
但是在合并的时候我遇到了以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Applications/Spyder.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/Users/jasonmellone/.spyder2/.temp.py", line 12, in <module>
merged = underlying.merge(options, left_on='dt1', right_on='dt2',how='outer');
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/frame.py", line 3723, in merge
suffixes=suffixes, copy=copy)
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 40, in merge
return op.get_result()
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 197, in get_result
result_data = join_op.get_result()
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 722, in get_result
return BlockManager(result_blocks, self.result_axes)
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 1954, in __init__
self._set_ref_locs(do_refs=True)
File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 2091, in _set_ref_locs
'have _ref_locs set' % (block, labels))
AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [dt1], 1 x 372145, dtype: int64] with duplicate items [Index([u'dt1', u'price', u'dt2', u'ticker', u'maturity', u'strike', u'cP', u'px', u'strike', u'yield', u'rF', u'T', u'rlzd10'], dtype='object')] does not have _ref_locs set
我检查过我的数据集,没有重复的内容。
谢谢!
3 个回答
-1
如果你想用同一列来合并数据,这在你的情况中是正确的,你可以简单地使用 on=0
,这里的 0
代表两个数据表中的第一列。
import pandas as pd
merged = underlying.merge(options, on=0, how='outer')
# or
merged = pd.merge(underlying, options, on=0, how='outer')
如果两个数据表中的索引列不同,那么你可以使用 left_on
和 right_on
这两个选项。
# here 0 is the index column for df1 and 2 is the index column for df2
pd.merge(df1, df2, left_on=0, right_on=2, how='outer')
2
我遇到的类似问题让我找到了这个讨论。我当时碰到了一个键错误。解决的方法是把 left_on='0'
中的单引号去掉,改成 left_on=0
。
merged = underlying.merge(options, left_on='0', right_on='0')
merged = underlying.merge(options, left_on=0, right_on=0)
3
你仍然可以在这些列上进行合并:
merged = underlying.merge(options, left_on='0', right_on='0')
这会进行一个内部合并,也就是说只会保留两个数据集中都有的部分,也就是在列 0
中的值都存在于两个数据集中。如果你想要所有的值,可以指定 outer
:
merged = underlying.merge(options, left_on='0', right_on='0', how='outer')
In [10]:
merged = underlying.merge(options, left_on='0', right_on='0', how='outer')
merged
Out[10]:
0 1_x 1_y 2 3 4 5 6 7 8 \
0 20040326 3.579987 NaN NaN NaN NaN NaN NaN NaN NaN
1 20040329 3.690494 NaN NaN NaN NaN NaN NaN NaN NaN
2 20040330 3.755247 NaN NaN NaN NaN NaN NaN NaN NaN
3 20040331 3.719373 NaN NaN NaN NaN NaN NaN NaN NaN
4 20040401 3.728671 NaN NaN NaN NaN NaN NaN NaN NaN
5 20130628 NaN SVXY 20130817 32.5 call 39.22 32.5 0 0.005
9 10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 0.136986 0.411224
[6 rows x 12 columns]
你需要重命名或者移动那些冲突的列 1_x
和 1_y
。
在这之前,最好先把列重命名为一些更有意义的名字。当你读取csv文件时,可以传入一个列名的列表:
df = pd.read_csv('data.csv', names=['Id', 'Price'])