如何用Python、Perl或sed提取头文件中的注释?

3 投票
3 回答
2407 浏览
提问于 2025-04-15 23:02

我有一个这样的头文件:

/*
 * APP 180-2 ALG-254/258/772 implementation
 * Last update: 03/01/2006
 * Issue date:  08/22/2004
 *
 * Copyright (C) 2006 Somebody's Name here
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. Neither the name of the project nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

#ifndef HEADER_H
#define HEADER_H

/* More comments and C++ code here. */

#endif /* End of file. */

我想只提取出第一个 C 风格的注释内容,并去掉每行开头的 " *",这样我就能得到一个包含以下内容的文件:

 APP 180-2 ALG-254/258/772 implementation
 Last update: 03/01/2006
 Issue date:  08/22/2004

 Copyright (C) 2006 Somebody's Name here
 All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:
 1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the distribution.
 3. Neither the name of the project nor the names of its contributors
    may be used to endorse or promote products derived from this software
    without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED.  IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
 FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

请给我推荐一个简单的方法,用 Python、Perl、sed 或其他在 Unix 上的方法。最好是一行代码就能搞定。

3 个回答

-1
sed -i -r "s/[\/\ ]{1}\*[\/\ ]?//g" YOURFILENAME

这段代码会从你的文件中去掉注释,但保留其他内容。不过,这样会直接修改你的文件(YOURFILENAME)。如果你不想修改文件,可以把这一行中的 -i 去掉。

4

Pyparsing这个库自带了一种模式,可以用来匹配不同编程语言的注释格式。通过使用cStyleCommentscanString来找到源文件中的第一个注释,接下来的操作就只是一些字符串处理的函数了。

c_src = open(c_source_file).read()

from pyparsing import cStyleComment
cmt = cStyleComment.scanString(c_src).next()[0][0]
lines = [l[3:] for l in cmt.splitlines()]
print '\n'.join(lines)

scanString是一个生成器,它会在找到每一个匹配项后返回结果,然后再继续查找下一个匹配项,所以它只会处理第一个注释。根据你的示例代码,这样做会返回:

APP 180-2 ALG-254/258/772 implementation 
Last update: 03/01/2006 
Issue date:  08/22/2004 

Copyright (C) 2006 Somebody's Name here 
All rights reserved. 

Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions 
are met: 
1. Redistributions of source code must retain the above copyright 
   notice, this list of conditions and the following disclaimer. 
2. Redistributions in binary form must reproduce the above copyright 
   notice, this list of conditions and the following disclaimer in the 
   documentation and/or other materials provided with the distribution. 
3. Neither the name of the project nor the names of its contributors 
   may be used to endorse or promote products derived from this software 
   without specific prior written permission. 

THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND 
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
ARE DISCLAIMED.  IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE 
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 
SUCH DAMAGE. 
5

这个应该能帮到你:

sed -n '/\*\//q; /^\/\*/d; s/^ \* \?//p' <file.h >comment.txt

下面我来解释一下:sed(你可能听说过)是一个命令,它会逐行处理一个文件,并对每一行应用一系列规则。每条规则由一个“选择器”和一些命令组成,这些命令只会在选择器匹配时应用到那一行。

第一条规则的选择器是 /\*\//。这是一个正则表达式选择器,它会匹配任何包含字符 */ 的行。这里的字符需要用反斜杠转义,因为它们在正则表达式中有特殊含义。(我假设这只会匹配你评论的结束行,并且这一整行应该被删除。)命令是 q,意思是“退出”。sed 就会停止处理。通常情况下,它会打印出这一行,但我加了 -n 这个选项,意思是“除非特别指示,否则不打印。”

第二条规则的选择器是 /^\/\*/,这又是一个正则表达式选择器,它匹配行首的字符 /*。同样,我假设这一行不会包含评论的一部分。命令 d 告诉 sed 删除这一行,然后继续处理下一行。

最后一条规则没有选择器,所以它适用于所有行(除非之前的命令阻止了处理到达这一条规则)。最后一条规则中的命令是替换命令 s/PATTERN/REPLACEMENT/,它会查找匹配某个模式的文本,并用替换文本替换掉。这里的模式是 ^ \* \?,它匹配行首的一个空格、一个星号,以及零个或一个空格。替换的内容是空的。所以 sed 只是删除了开头的空格-星号-(空格)? 这个序列。p 实际上是替换命令的一个标志,告诉 sed 打印出替换的结果。因为有 -n 选项,所以这个标志是必须的。

撰写回答