Python html在关键字后提取数字

line1 = " The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over." line2 = " The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."

3条回答

网友

1楼 · 编辑于 2024-06-02 08:12:14

在第2行中，findall找到了3个以上的匹配项，而您试图仅在3个变量上解压它们。你知道吗

用这样的方法：

[householdIncome, familyIncome, perCapitalIncome] = re.findall("\d+,\d+",line1)[:3]

网友

2楼 · 编辑于 2024-06-02 08:12:14

执行re.findall("\d+,\d+",line2)的结果是['31,893', '38,508', '30,076', '20,275', '16,336']。因此，眼前的问题是正则表达式有五个结果，而您只允许三个。然而，还有一个稍深的问题。当我检查这两个句子时，我发现它们有不同的结构。在第一句中，家庭收入、家庭收入和人均收入似乎确实排在第一位，但在第二句中似乎并非如此。我想说的是，你需要对这个句子作一些更复杂的分析。你知道吗

网友

3楼 · 编辑于 2024-06-02 08:12:14

正如其他人所指出的，您将需要一些额外的编程逻辑。考虑以下示例，该示例使用正则表达式来查找相关值，并在必要时计算中值：

import re, locale
from locale import atoi
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )

lines = ["The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over.",
"The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."]

# define the regex
rx = re.compile(r'''
        (?P<type>household|family|per\ capita)
        \D+
        \$(?P<amount>\d[\d,]*\d)
        (?:
            \s+versus\s+
            \$(?P<amount2>\d[\d,]*\d)
        )?''', re.VERBOSE)

def afterwork(match):
    if match.group('amount2'):
        amount = (atoi(match.group('amount')) + atoi(match.group('amount2'))) / 2
    else:
        amount = atoi(match.group('amount'))
    return amount

result = {}
for index, line in enumerate(lines):
    result['line' + str(index)] = [(m.group('type'), afterwork(m)) for m in rx.finditer(line)]

print(result)
# {'line1': [('household', 31893), ('family', 38508), ('per capita', 16336)], 'line0': [('household', 64411), ('family', 78940), ('per capita', 22466)]}

相关问题更多 >

编程相关推荐

热门问题

热门文章