读取.docx文件以提取文本以及文本的字体和其他格式信息

1条回答

网友

1楼 · 发布于 2024-04-26 02:15:19

对于DocX来说，最好使用VBA来收集细节

然而，一个“潜在的”替代方法可能是通过从写字板导出到基本RTF来简单地删除任何样式覆盖。然后查看目标块的重定义特征

注意：-根据转换情况，这可能不是100%可靠的，以实现您的目标

虽然我们可以从命令行使用写字板将DocX转换为PDF，但如果不使用VBS宏，我们无法将DocX转换为RTF，但这是另一个问题

从页眉可以看到CodePage=1252&；2057=~~英国（英国）~~英国：-）

按眼睛分类 \b\f0\fs24\lang9 Hello \b0\i World\ul\i0 !\ulnone\fs22\par

\b - Is the start of Bold
\f0 - Calibri in the given language (BEWARE here 0 is an index NOT a stop)
\fs24 - Is points x 2 so the text here is 12 point
\lang9 - I forget at the moment, awaiting correction in comments :-)
 Hello - Has both a leading and trailing space (leading is to be ignored)
\b0 - My BAD, boldening STOPS, AFTER the space between the words
\i - Start italics (ignore the space before World)
\ul - Start underlining
\i0 - Stop italics (ignore the space before !)
\ulnone - Stop underline (don't ask me why not \ul0)
\fs22 - I will let you guess the default page font height but by now you know it is not 22

\par - THE END, "That's all Folks!" ™

p.S.

我重新访问了源代码，进行了2次更正，看看您是否能够解决这两个更改。第二个的“我的”线索在上面，但在使用正则表达式时很容易让你绊倒

\b\f0\fs22\lang9 Hello,\i \b0 World\ul\i0 !\ulnone\par

虽然它最终应该是

\b\f0\fs22\lang9 Hello,\b0 \i World\ul\i0 !\ulnone\par

相关问题更多 >

编程相关推荐

热门问题

热门文章