从二进制文件中提取文本（在Windows7上使用Python2.7）

^@^@^@^@^@^@^@^@^@^@^@BLLBBCC^X^X^X^X^X^X^X^X^X ^X^X^X MVT^M EA1123 TEXT TEXT TEXT^M END^M \xaa^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@TTBBTT^X^X^X^X^X^X^X^X^X ^X^X^X blah blah blah... of control characters.. and then the message comes.. MVT MESSAGE 2 ED1123 etc.

import fileinput import sys import re strfile = r'C:\Users\' \ r'\Learn\python\mvt\sitatex_test.msgs' f = open(strfile, 'rb') contents = f.read() # read whole file in contents #extract the string between two \xaaU.. multiline pattern match #with look ahead assertion #and this is stored in a list with all msgs msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M) for msg in msgs: #loop through msgs.. to find the first msg then next and so on. print "## NEW MESSAGE STARTS HERE ##" #for each msg split the lines.. to read line by line # stored as list in msglines msglines = msg.splitlines() line = 0 #then process each msgline with a message for msgline in msglines: line += 1 #msgline = re.sub(r'[\x00]+', r' ', msgline) mystr = msgline print mystr textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)

3条回答

网友
1楼 · 编辑于 2024-06-16 10:07:36

Python也支持regex。我不会说Perl，所以我不知道您的Perl代码到底是做什么的，但是这个Python程序可能会帮助您：
import re with open('yourfile.pst') as f: contents = f.read() textstrings = re.findall(r'[\x20-\x7E]+', contents)
这将得到文件中包含一个或多个ASCII可打印字符的所有字符串的列表。这可能不是你想要的，但是你可以从那里调整它。
请注意，如果您使用的是python3，那么您必须担心二进制数据和文本数据之间的区别，这会变得更加复杂。我假设你在python2中。

网友
2楼 · 编辑于 2024-06-16 10:07:36

你说：
Still I will need assistance to eliminate the list altogether but return just a string. like this
换句话说，你有foo = [some_string]，你正在做print foo，作为一个边，repr(some_string)但是你不想用方括号括起来。所以只要做print repr(foo[0])。
似乎有几件事无法解释：
您说有用的文本用\xaaU括起来，但是在示例文件中，在开头附近只有\xaa（缺少U），而没有其他内容。
你说呢
I have found out that re.findall(r'.+', line1) strips to ...
实际上是剥离\n（但不是\r！！）我认为在尝试恢复电子邮件时，换行是值得保留的。
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n') ['abc\r', 'def\r', '\r']
你对\r字符做了什么？你测试过多行信息吗？你测试过多消息文件吗？
一个人可以猜测谁或什么人打算消耗你的输出；你写
I need to parse the text line by line and word by word
但你似乎过于关心用\xab而不是乱七八糟地打印消息。
最新代码中的最后6行左右（for msgline in msglines:等）应该缩进一级。
有没有可能澄清以上所有的问题？

网友
3楼 · 编辑于 2024-06-16 10:07:36

问：如何读取文件？二进制和文本穿插在一起

A:不用麻烦，只要把它当作普通文本来读，你就可以保持你的二进制/文本二分法（否则你就不能很容易地对它进行正则表达式）

fh = open('/path/to/my/file.ext', 'r')
fh.read()

如果以后出于某种原因想读取二进制文件，只需在open的第二个输入中添加一个b：

^{pr2}$

Q：消除不必要的控制字符

A:使用pythonre模块。你的下一个问题是如何

Q:解析两个\xaa有用文本信息之间的消息\xaa（HEX'aa'）

A:re模块有一个findall函数，它的工作方式与您（大多数）期望的一样。

import re

mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)

Q：打印出所需资料

*A:有一个打印功能。。。

print usefultext

Q:把所有的线都圈起来。。以及更多文件。

fh = open('/some/file.ext','r')

for lines in fh.readlines():
    #do stuff

我将让您找出os模块来确定存在哪些文件/如何遍历它们。

相关问题更多 >

编程相关推荐

热门问题

热门文章