我试图在Python3中解析SEC Edgar文本的文本部分,例如:https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt
我的目标是收集某些关键字的10-K语句在可见文本体中出现的次数,并将其保存到字典中(即,我对任何表格、展品等都不感兴趣)
我是Python新手,非常感谢您的帮助
这是我到目前为止所写的,但是这里的代码没有返回正确的出现次数,并且没有捕获最终用户可见的主文本体
import requests
from bs4 import BeautifulSoup
# this part I would like to change such that it only collects words visible to the normal user in the page (is that the body?)
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
print('*'*20)
return len(words)
def main():
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
word_list = ['assets']
for word in word_list:
count = count_words(url, word)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
print('--'*20)
# this part I dont understand
if __name__ == '__main__':
main()
目前没有回答
相关问题 更多 >
编程相关推荐