我一直在尝试使用BeautifulSoup制作一个简单的函数,从一个名为Tangorin的网站上抽取所有的例句,并给出一本日语书。我试着写了两种不同类型的函数来提取数据,但我就是无法让它工作。对不起,这是一个很长的问题,但我尝试了一些不起作用的东西,我只编写了3周的代码。我尝试提取的数据结构如下所示:
下面是这个单词页面上一个句子的数据结构英語 作为搜索词,网页https://tangorin.com/sentences?search=英語
我试着把页面上所有的句子都拉到一个单独句子的列表中。每个句子都在下面这样的一个块中,我突出显示了我试图寻找的关键信息
**<div class="entry entry-border sentences undefined ">**
<div class="entry-menu-wrap">
<button class="entry-menu-btn btn">
<svg class="icon" role="img" alt="" width="1em" height="1em" viewBox="0 0 50 50">
<use xlink:href="#icon-chevron-down">
</use>
</svg>
</button>
</div>
<dt class="s-jp" lang="ja">
<a href="/words?search=話 %23n">
<ruby>
**話**
<rt class="roma">hanashi</rt>
</ruby>
</a>
<a href="/words?search=を %23part">
<ruby>
**を**
<rt class="roma">wo</rt>
</ruby>
</a>
<a href="/words?search=為る %23v">
<ruby>
**する**
<rt class="roma">suru</rt>
</ruby>
</a>
<a href="/words?search=事 %23n">
<ruby>
**こと**
<rt class="roma">koto</rt>
</ruby>
</a>
<a href="/words?search=で %23part">
<ruby>
**で**
<rt class="roma">de</rt>
</ruby>
</a>
<a href="/words?search=自分 %23n">
<ruby>
**自分**
<rt class="roma">jibun</rt>
</ruby>
</a>
<a href="/words?search=を %23part">
<ruby>
**を**
<rt class="roma">wo</rt>
</ruby>
</a>
<ruby>
**曝け出す**
<rt class="roma">曝kedasu</rt>
</ruby>
<a href="/words?search=事 %23n">
<ruby>
**こと**
<rt class="roma">koto</rt>
</ruby>
</a>
<a href="/words?search=を %23part">
<ruby>
**を**
<rt class="roma">wo</rt>
</ruby>
</a>
<a href="/words?search=恐れる %23v">
<ruby>
**恐れず**
<rt class="roma">osorezu</rt>
</ruby>
</a>
**、**
<a href="/words?search=英語 %23n">
<mark>
<ruby>
**英語**
<rt class="roma">eigo</rt>
</ruby>
</mark>
</a>
<a href="/words?search=で %23part">
<ruby>
**で**
<rt class="roma">de</rt>
</ruby>
</a>
<a href="/words?search=他人 %23n">
<ruby>
**他人**
<rt class="roma">tanin</rt>
</ruby>
</a>
<a href="/words?search=と %23part">
<ruby>
**と**
<rt class="roma">to</rt>
</ruby>
</a>
<a href="/words?search=喋る %23v">
<ruby>
**しゃべる**
<rt class="roma">shaberu</rt>
</ruby>
</a>
<a href="/words?search=有らゆる %23pn-adj">
<ruby>
**あらゆる**
<rt class="roma">arayuru</rt>
</ruby>
</a>
<a href="/words?search=機会 %23n">
<ruby>
**機会**
<rt class="roma">kikai</rt>
</ruby>
</a>
<a href="/words?search=を %23part">
<ruby>
**を**
<rt class="roma">wo</rt>
</ruby>
</a>
<a href="/words?search=捕らえる %23v">
<ruby>
**とらえ**
<rt class="roma">torae</rt>
</ruby>
</a>
<a href="/words?search=なさる %23v">
<ruby>
**なさい**
<rt class="roma">nasai</rt>
</ruby>
</a>
**。**
<a href="/words?search=そうすれば %23adv">
<ruby>
**そうすれば**
<rt class="roma">sousureba</rt>
</ruby>
</a>
<a href="/words?search=直に %23n">
<ruby>
**じきに**
<rt class="roma">jikini</rt>
</ruby>
</a>
<a href="/words?search=形式張る %23n">
<ruby>
**形式張らない**
<rt class="roma">keishikiharanai</rt>
</ruby>
</a>
<a href="/words?search=会話 %23n">
<ruby>
**会話**
<rt class="roma">kaiwa</rt>
</ruby>
</a>
<a href="/words?search=の %23part">
<ruby>
**の**
<rt class="roma">no</rt>
</ruby>
</a>
<a href="/words?search=場面 %23n">
<ruby>
**場面**
<rt class="roma">bamen</rt>
</ruby>
</a>
<a href="/words?search=で %23part">
<ruby>
**で**
<rt class="roma">de</rt>
</ruby>
</a>
<a href="/words?search=気楽 %23n">
<ruby>
**気楽**
<rt class="roma">kiraku</rt>
</ruby>
</a>
<a href="/words?search=に %23part">
<ruby>
**に**
<rt class="roma">ni</rt>
</ruby>
</a>
<a href="/words?search=慣れる %23v">
<ruby>
**なれる**
<rt class="roma">nareru</rt>
</ruby>
</a>
<a href="/words?search=である %23aux-v">
<ruby>
**であろう**
<rt class="roma">dearou</rt>
</ruby>
</a>
**。**
</dt>
因此,问题是在类内部,我想要的数据以3种方式之一存储在a=href超链接中,然后进一步以ruby缩写形式存储。或者使用不在a=href中的ruby缩写,或者有时在“、”、“?”等的纯文本行中
因此,在朋友的帮助下,我使用Beauty Soup编写了以下代码:
# all of your sentences from anki deck
# also new sentences will go here
ALL_SENTENCES = set()
# This piece of code returns true if char is in the set of all roman letters
# and false if not
def is_english(char):
lower_case = ord("a") <= ord(char) <= ord("z")
upper_case = ord("A") <= ord(char) <= ord("Z")
return lower_case or upper_case
# This piece of code is to take a random Japanese word from a file and
# generate a Tangorin URL to the page of example
# sentencese for that word
def make_url(word):
return f"https://tangorin.com/sentences?search={word}"
# This function searches through all of the descendants that have been added into the total
def filter_jap(sentences):
jap_only = [
[word for word in sentence if not is_english(word[0])] for sentence in sentences
]
for sentence in jap_only:
as_string = "".join(sentence) + "\n"
print(as_string)
def get_random_sentence(all_sentences):
return random.choice(all_sentences)
def get_example_sentences(word):
url = make_url(word)
source = requests.get(url).text
soup = BeautifulSoup(source, "lxml")
all_sentences = []
curr_sentence = ""
for sentence in soup.findAll(
"div", class_="entry entry-border sentences undefined"
):
character_blocks = sentence.dt
for desc in character_blocks.descendants:
# end of sentence detected, add curr sentence to all sentence list
# and reset curr sentence
if desc == "。":
all_sentences.append(curr_sentence)
curr_sentence = ""
# if character is non-english (japanese) add it to current sentence
elif type(desc) == NavigableString and not is_english(desc[0]):
curr_sentence += desc
return all_sentences
def gen_example_sentence(word):
all_sentences = get_example_sentences(word)
random_choice = get_random_sentence(all_sentences)
return(random_choice)
with open('Japanese Words.txt', 'r', encoding="utf-8") as f:
for line in f:
x = gen_example_sentence(line)
print(x)`
此代码的问题在于遇到如下块时:
<ruby>
**曝け出す**
<rt class="roma">曝kedasu</rt>
</ruby>
在这里,ruby缩写的rt类中的项的格式不正确,开头是日语字符,因此通过以英语字符开头的navigablestring类型的后代进行解析的for循环理解在这里完全失败。因此,我尝试使用不同的字符串操作方法,但我的编码技能仍然不够好,因此这是一个严重的失败:
`def make_url(word):
return f"https://tangorin.com/sentences?search={word}"
word = "英語"
url = make_url(word)
source = requests.get(url).text
soup = BeautifulSoup(source, "lxml")
all_sentences = []
curr_sentence = ""
for sentence in soup.findAll("div", class_="entry entry-border sentences undefined"):
y = sentence.dt
for line in y:
if str(type(line)) == "<class 'bs4.element.Tag'>":
x = str(line.ruby)
block = str(x.split('<'))
block = str(block.split('>'))
print(block)[3]
#else:
#curr_sentence += line
#print(curr_sentence)
#curr_sentence = ""`
我不知道如何着手解决这个问题,以获得成功重新编译句子所需的确切信息,然后将它们添加到一组字符串中
对于一些基于日语的额外知识,日语句子中的字符之间没有空格,除了在诸如“、”、“.”“等字符中内置空格之外
此外,ruby标记是一种称为ruby缩写的东西,它是一种描述日语字符位置的方式,以及在上面用英语字符阅读的方式。“roma”代表romaji,意为日语单词的英语字符读物
抱歉,这是一个很长的问题,但我已经在beautifulsoup和其他解析方法上观看了大量youtube视频,我无法理解这个问题
另外,为了防止粗体显示不起作用,在xml块中,****指示符中的内容是我要提取的位。基本上,我想要的是ruby缩写中的内容,而不是rt类中的内容,我还想要那些纯文本项“、”、“”等。如果运行原始后代代码,您将得到99%的准确句子,以查看我想要的输出是什么类型。提前感谢您的帮助
您可以先提取带有英语单词的标记,然后使用
.get_text()
例如(我希望输出正确,我看不懂日语):
印刷品:
相关问题 更多 >
编程相关推荐