如何在awk或sed中编写查找所有的函数（使用正则表达式）

2 投票

2 回答

958 浏览

提问于 2025-04-16 04:07

我有一个bash函数，它运行Python程序（这个程序会从标准输入中返回所有找到的正则表达式）。

function find-all() {
    python -c "import re
import sys
print '\n'.join(re.findall('$1', sys.stdin.read()))"
}

当我使用这个正则表达式 find-all 'href="([^"]*)"' < index.html 时，它应该返回正则表达式的第一个组（也就是文件index.html中href属性的值）。

我该如何用sed或awk来写这个呢？

2 个回答

这是一个用gawk实现的例子（没有在其他awk上测试过）：find_all.sh

awk -v "patt=$1" '
    function find_all(str, patt) {
        while (match(str, patt, a) > 0) {
            for (i=0; i in a; i++) print a[i]
            str = substr(str, RSTART+RLENGTH)
        }
    }
    $0 ~ patt {find_all($0, patt)}
' -

然后：

echo 'asdf href="href1" asdf asdf href="href2" asdfasdf
asdfasdfasdfasdf href="href3" asdfasdfasdf' | 
find_all.sh 'href="([^"]+)"'

输出结果是：

href="href1"
href1
href="href2"
href2
href="href3"
href3

如果你只想打印捕获的组，可以把i=0改成i=1。使用i=0时，即使你的模式里没有括号，也会有输出。

回答于 2025-04-16 由 Python大师

分享举报

我建议你使用 grep -o。

-o, --only-matching
       Show only the part of a matching line that matches PATTERN.

比如说：

$ cat > foo
test test test
test
bar
baz test
$ grep -o test foo
test
test
test
test
test

更新

如果你想从html文件中提取href属性，可以使用这样的命令：

$ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html
href="style.css"
href="iehacks.css"
href="old/"

你可以通过使用 cut 和 sed 来提取这些值，像这样：

$ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html| cut -f2 -d'=' | sed -e 's/"//g'
style.css
iehacks.css
old/

不过如果想要更可靠，还是使用html/xml解析器比较好。

回答于 2025-04-16 由 Python大师

分享举报

如何在awk或sed中编写查找所有的函数（使用正则表达式）

2 个回答

撰写回答