使用HTML标记递归73033个元素的列表，并从i获取上下文

NewSoups = [BeautifulSoup(NewR) for NewR in NewRs]. captions = [soup.find_all("div", class_ = "photocaption") for soup in NewSoups] flattened_captions = [] for x in captions: for y in x: flattened_captions.append(y) print(len(flattened_captions)) #73033 import re results = [re.sub('<[^>]*>', '', y) for y in flattened_captions] #where the error comes from

Traceback (most recent call last): File "picked.py", line 22, in <module> results = [re.sub('<[^>]*>', '', y) for y in flattened_captions] File "/opt/conda/lib/python2.7/re.py", line 155, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer

1条回答

网友

1楼 · 发布于 2024-06-11 00:06:57

我要发布的不是处理发布的问题的最优雅或最有效的方法。正如Welbog所指出的，BeautifulSoup本身提供了提取上下文的功能。然而，当我在发布最初的问题时收到了错误，我只是好奇这个错误是从哪里来的。原来，从压扁的标题返回的东西不是字符串。这很容易解决。方法如下。你知道吗

str_flattened_captions = [str(flattened_captions[i]) for i in range(len(flattened_captions))]

gains = [re.sub('<[^>]*>', '', item) for item in str_flattened_captions]

测试

print(gains[:5])
r Barbara Schorr ', ' Architect Joan Dineen with Alyson Liss ', ' Author/Designer Carleton Varney with Jim Druckman ', ' Designers Richard Cerrone, Lisa Hyman and Rhonda Eleish (front) in their room called "Holiday Nod To Nature" ']

相关问题更多 >

编程相关推荐

热门问题

热门文章