读取列之间带有换行符的csv文件

data = io.StringIO( ''' a; b; c; d x10; 20; 30; 40 x11; 21; 31; 41 x12; 22; 32; 42 x13; 23; 33; 43 x14; 24; 34; 44 x15; 25; 35; 45 ''' ) pd.read_csv( data, sep=';' )

2条回答

网友
1楼 · 编辑于 2024-05-16 21:09:26

这是大型csv经常发生的事情。我解决这个问题的方法是使用python读取它们，并检查分隔符的数量是否与您期望的匹配，否则就删除该行。一旦原始数据被纠正，您就可以使用StringIO将其加载到pandas中。一个例子胜过你的错误例子：
# We load the file filestream = open(filepath) # Now we filter the data as follows data = filter(lambda l: l.count(";")==3, filestream) # Now we convert to String IO stream = io.StringIO("\n".join(data)) # And finally we read with Pandas pd.read_csv(stream, sep=';' )

网友
2楼 · 编辑于 2024-05-16 21:09:26

我从ivallesp's answer得到了零件，并想出了一个解决方案来保持虚线。你知道吗
我把它贴在这里作为未来我的文档（他们经常会忘记这些事情）以及其他可能遇到类似问题的人的文档。你知道吗
坏文件，有断线
infile = io.StringIO( ''' a; b; c; d x10; 20; 30; 40 x11; 21; 31; 41 x12; 22; 32; 42 x13; 23; 33; 43 x14; 24; 34; 44 x15; 25; 35; 45 ''' ) # The lines are joined with a \n, and whitespace stripped data = '\n'.join( [ item.strip() for item in infile ] ) # Now data is not a file stream, but a string, with \n s in between #Search for occurrences of newline + NOT(x + number) and just keep # found group data = re.sub( '\n(?!x\d\d)', '\1', data ) # Now data is a file stream again data = io.StringIO( data ) # Fed to pandas.read_csv pd.read_csv( data, sep=';' )
变化
对于磁盘中的实际文件（不是io.StringIO），我不得不做一个小的修改，删除.strip()，不知道为什么。除此之外，它还可以无连接（''.join(...)）。你知道吗
最后，我的实际文件在第一列中有时间，形式是00:00、00:05等等。所以我实际上是这样用的：
import re with open( 'broken_rows_file.csv', 'r' ) as infile: data = ''.join( [ item for item in infile ] ) #All that is NOT ##:## should be replaced data = re.sub( '\n(?!\d\d:\d\d)', '\1', data ) data = io.StringIO( data ) df = pd.read_csv( data, sep=';' ) df a b c d 0 00:10 20 30 40 1 00:11 21 31 41 2 00:12 22 32 42 3 00:13 23 33 43 4 00:14 24 34 44 5 00:15 25 35 45

相关问题更多 >

编程相关推荐

热门问题

热门文章