内存中用于python文档到docx转换的临时文件

word = comtypes.client.CreateObject('Word.Application') doc = word.Documents.Open(input_file) doc.SaveAs(output_file, FileFormat=16) return_dataframe = docx_to_dataframe(output_file) doc.Close() word.Quit() os.remove(output_file)

1条回答

网友

1楼 · 发布于 2024-04-24 11:11:53

我有一个类似的用例，这是我提出的解决方案，直到我找到更好的

我基本上需要1）从base64格式解码文档文件2）读取内存中的“文件”，这导致unicode中的字符混合。3）使用正则表达式捕获文本。我是这样做的：

import olefile
#retrieve base64 image and decode into bytes, in this case from a df
message = row['text']
text_bytes = message.encode('ascii')
decoded = base64.decodebytes(text_bytes)
#write in memory
result = BytesIO()
result.write(decoded)
#open and read file
ole=olefile.OleFileIO(result)
y = ole.openstream('WordDocument').read()
y=y.decode('latin-1',errors='ignore')
#replace all characters that are not part of the unicode list below (all latin characters) and spaces with an Astrisk. This can probably be shortened using a similar pattern used in the next step and combining them
y=(re.sub(r'[^\x0A,\u00c0-\u00d6,\u00d8-\u00f6,\u00f8-\u02af,\u1d00-\u1d25,\u1d62-\u1d65,\u1d6b-\u1d77,\u1d79-\u1d9a,\u1e00-\u1eff,\u2090-\u2094,\u2184-\u2184,\u2488-\u2490,\u271d-\u271d,\u2c60-\u2c7c,\u2c7e-\u2c7f,\ua722-\ua76f,\ua771-\ua787,\ua78b-\ua78c,\ua7fb-\ua7ff,\ufb00-\ufb06,\x20-\x7E]',r'*', y))
#Isolate the body of the text from the rest of the gibberish
p=re.compile(r'\*{300,433}((?:[^*]|\*(?!\*{14}))+?)\*{15,}')
result=(re.findall(p, y))
#remove * left in the capture group
result = result[0].replace('*','')

对我来说，我需要确保在解码过程中不会丢失重音字符，因为我的文档是英语、西班牙语和葡萄牙语的，所以我选择使用拉丁语-1进行解码。在此基础上，我使用正则表达式模式来识别所需的文本。解码后，我发现在我所有的文档中，捕获组前面有~400'*'和a'：'。不确定在使用这种方法解码时，这是否是所有文档的标准，但我使用它作为起点来创建一个正则表达式模式，以将所需的文本与其他乱七八糟的内容隔离开来

相关问题更多 >

编程相关推荐

热门问题

热门文章