如何使用BeautifulSoup找出HTML中两个元素的距离

0 投票
2 回答
1265 浏览
提问于 2025-04-18 02:45

目标是找出两个标签之间的距离,比如第一个外部链接的属性和标签之间的距离,使用BeautifulSoup这个库。</p> <pre><code>html = '<title>stackoverflow</title><a href="https://stackoverflow.com">test</a>' soup = BeautifulSoup(html) ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE)) title = soup.title dist = abs_distance_between_tags(ext_link,title) print dist 30 </code></pre> <p>我该怎么做才能不使用正则表达式呢?</p> <p>需要注意的是,这些标签的顺序可能会不同,并且可能会有多个匹配项(虽然我们只用find()方法取第一个)。</p> <p>我在BeautifulSoup中找不到一个方法可以返回匹配项在HTML中的位置。</p> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9314860051288758" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-9314860051288758" data-ad-slot="2721561324"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="tags-section"> <a target="_blank" href="/tags/%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96" class="tag">数据提取</a> <a target="_blank" href="/tags/html%E8%A7%A3%E6%9E%90" class="tag">html解析</a> <a target="_blank" href="/tags/beautifulsoup" class="tag">beautifulsoup</a> <a target="_blank" href="/tags/%E7%BD%91%E9%A1%B5%E7%88%AC%E8%99%AB" class="tag">网页爬虫</a> <a target="_blank" href="/tags/%E6%A0%87%E7%AD%BE%E5%8C%B9%E9%85%8D" class="tag">标签匹配</a> <a target="_blank" href="/tags/%E5%85%83%E7%B4%A0%E8%B7%9D%E7%A6%BB" class="tag">元素距离</a> </div> </div> </div> <!-- 回答区域 --> <div class="card"> <div class="card-header"> <div class="answers-header"> <h2 class="answers-title">2 个回答</h2> <div class="answer-sort"> <select> <option>按票数排序</option> <option>按时间排序</option> </select> </div> </div> </div> <div class="answer-item"> <div class="answer-wrapper"> <div class="answer-voting"> <button class="vote-button up"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="18 15 12 9 6 15"></polyline> </svg> </button> <div class="vote-count">1</div> <button class="vote-button down"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="6 9 12 15 18 9"></polyline> </svg> </button> </div> <div class="answer-content"> <p>Beautiful Soup 4 现在支持 <code>Tag.sourceline</code> 和 <code>Tag.sourcepos</code> 这两个功能。</p> <p>参考链接: <a href="https://beautiful-soup-4.readthedocs.io/en/latest/#line-numbers" rel="nofollow noreferrer">https://beautiful-soup-4.readthedocs.io/en/latest/#line-numbers</a></p> </div> </div> <div class="answer-footer"> <div class="answer-author"> 回答于 2025-04-18 由 <a href="#" class="author-name">Python大师</a> </div> <div class="answer-actions"> <a href="#" class="answer-action">分享</a> <a href="#" class="answer-action">举报</a> </div> </div> </div> <div class="answer-item"> <div class="answer-wrapper"> <div class="answer-voting"> <button class="vote-button up"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="18 15 12 9 6 15"></polyline> </svg> </button> <div class="vote-count">1</div> <button class="vote-button down"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="6 9 12 15 18 9"></polyline> </svg> </button> </div> <div class="answer-content"> <p>正如你所提到的,使用BeautifulSoup似乎无法获取元素的确切字符位置。</p> <p>也许<a href="https://stackoverflow.com/a/12463426/25097">这个回答</a>能帮到你:</p> <blockquote> <p>据我所知,lxml只提供源代码行号,这个信息不够用。可以参考<a href="http://lxml.de/api/lxml.etree._Element-class.html" rel="nofollow noreferrer">这个API</a>:<code>解析器找到的原始行号,如果不知道则返回None。</code></p> <p>但是,expat可以提供文件中的确切偏移量:CurrentByteIndex。</p> <ul> <li>从start_element处理程序获取时,它返回标签开始(也就是'<')的偏移量。</li> <li>从char_data处理程序获取时,它返回数据开始(也就是你例子中的'B')的偏移量。</li> </ul> </blockquote> </div> </div> <div class="answer-footer"> <div class="answer-author"> 回答于 2025-04-18 由 <a href="#" class="author-name">Python大师</a> </div> <div class="answer-actions"> <a href="#" class="answer-action">分享</a> <a href="#" class="answer-action">举报</a> </div> </div> </div> <div class="answer-form"> <h3 class="form-title">撰写回答</h3> <form> <div class="form-control"> <label for="answer" class="form-label">您的回答</label> <textarea id="answer" class="form-input" placeholder="编写您的回答..."></textarea> </div> <button type="submit" class="btn btn-primary">提交回答</button> </form> </div> </div> </main> <aside class="sidebar"> <!-- 侧边栏顶部广告位 --> <div class="card sidebar-box ad-container"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9314860051288758" crossorigin="anonymous"></script> <!-- qa_detail_sidebar_top --> <ins class="adsbygoogle" style="display:inline-block;width:320px;height:600px" data-ad-client="ca-pub-9314860051288758" data-ad-slot="5193841686"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">推荐教程</h3> </div> <ul class="related-questions"> <li class="related-question"> <a target="_blank" href="/python/mongodb-drop-collection.html" class="related-link">使用 drop() 删除 MongoDB 集合</a> </li> <li class="related-question"> <a target="_blank" href="/python/for-loop.html" class="related-link">Python for循环</a> </li> <li class="related-question"> <a target="_blank" href="/python/arrays.html" class="related-link">Python 数组</a> </li> <li class="related-question"> <a target="_blank" href="/python/mongodb-limit.html" class="related-link">MongoDB:使用 limit() 限制查询结果</a> </li> <li class="related-question"> <a target="_blank" href="/python/lists.html" class="related-link">Python 列表</a> </li> <li class="related-question"> <a target="_blank" href="/python/mysql-join.html" class="related-link">MySQL:多表联结(Join)教程</a> </li> <li class="related-question"> <a target="_blank" href="/python/mysql-where.html" class="related-link">MySQL查询WHERE子句</a> </li> <li class="related-question"> <a target="_blank" href="/python/list-add.html" class="related-link">Python 列表添加项</a> </li> <li class="related-question"> <a target="_blank" href="/python/tuple-methods.html" class="related-link">Python元组方法</a> </li> <li class="related-question"> <a target="_blank" href="/python/dict-copy.html" class="related-link">Python 字典(Dictionary)复制</a> </li> <li class="related-question"> <a target="_blank" href="/python/booleans.html" class="related-link">Python 布尔值</a> </li> <li class="related-question"> <a target="_blank" href="/python/string-escape.html" class="related-link">Python 字符串中转义符</a> </li> </ul> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">热门标签</h3> </div> <div style="padding: 1.25rem;"> <a href="#" class="tag">python</a> <a href="#" class="tag">json</a> <a href="#" class="tag">大数据</a> <a href="#" class="tag">内存优化</a> <a href="#" class="tag">pandas</a> <a href="#" class="tag">性能优化</a> <a href="#" class="tag">数据处理</a> <a href="#" class="tag">文件处理</a> </div> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">最新问题</h3> </div> <ul class="related-questions"> <li class="related-question"> <a href="/q/122657" class="related-link">python 从基因预测输出中提取序列</a> <div class="related-stats">3 回答 · 644 浏览</div> </li> <li class="related-question"> <a href="/q/122656" class="related-link">图像分析:在图像中查找蛋白质</a> <div class="related-stats">2 回答 · 570 浏览</div> </li> <li class="related-question"> <a href="/q/122655" class="related-link">在pandas 0.15中,使用matplotlib绘制datetimeindex时x轴刻度错误,与0.14相比</a> <div class="related-stats">3 回答 · 54047 浏览</div> </li> <li class="related-question"> <a href="/q/122654" class="related-link">为什么我无法重复调用PyUSB函数dev.read()而不出现超时错误?</a> <div class="related-stats">5 回答 · 13079 浏览</div> </li> <li class="related-question"> <a href="/q/122653" class="related-link">如何使用Python Beautiful Soup完美截图一个网站?</a> <div class="related-stats">2 回答 · 4236 浏览</div> </li> </ul> </div> </aside> </div> <!-- 页脚 --> <footer class="footer"> <div class="footer-container"> <div class="footer-section"> <h3>关于我们</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">关于Python问答</a></li> <li><a href="#" class="footer-link">团队介绍</a></li> <li><a href="#" class="footer-link">加入我们</a></li> </ul> </div> <div class="footer-section"> <h3>帮助中心</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">常见问题</a></li> <li><a href="#" class="footer-link">使用指南</a></li> <li><a href="#" class="footer-link">反馈建议</a></li> </ul> </div> <div class="footer-section"> <h3>社区</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">技术博客</a></li> <li><a href="#" class="footer-link">活动中心</a></li> <li><a href="#" class="footer-link">用户故事</a></li> </ul> </div> <div class="footer-section"> <h3>联系方式</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">联系我们</a></li> <li><a href="#" class="footer-link">商务合作</a></li> <li><a href="#" class="footer-link">微信公众号</a></li> </ul> </div> </div> <div class="footer-bottom"> <p>© 2013~2025 Python问答社区 | 京ICP备07000037号</p> </div> </footer> <script> // 移动端导航菜单切换 const navToggle = document.getElementById('navToggle'); const navMenu = document.getElementById('navMenu'); navToggle.addEventListener('click', () => { navMenu.classList.toggle('active'); }); // 主题切换功能 const themeToggle = document.getElementById('themeToggle'); const body = document.body; // 检查本地存储中的主题设置 const currentTheme = localStorage.getItem('theme') || 'green'; if (currentTheme === 'blue') { body.setAttribute('data-theme', 'blue'); } themeToggle.addEventListener('click', () => { const currentTheme = body.getAttribute('data-theme'); if (currentTheme === 'blue') { body.removeAttribute('data-theme'); localStorage.setItem('theme', 'green'); } else { body.setAttribute('data-theme', 'blue'); localStorage.setItem('theme', 'blue'); } themeToggle.classList.add('active'); setTimeout(() => { themeToggle.classList.remove('active'); }, 300); }); </script> <!-- prism.js 主库 --> <script src="https://unpkg.com/prismjs@1.29.0/prism.js"></script> <!-- prism.js python 语法支持 --> <script src="https://unpkg.com/prismjs@1.29.0/components/prism-python.min.js"></script> <script> // 页面加载完成后执行 document.addEventListener('DOMContentLoaded', function () { // 查找所有没有指定语言的代码块 const unlabeledCodeBlocks = document.querySelectorAll('pre > code:not([class*="language-"])'); unlabeledCodeBlocks.forEach(block => { block.classList.add('language-python'); }); const plaintextBlocks = document.querySelectorAll('pre > code.language-plaintext'); plaintextBlocks.forEach(block => { block.classList.remove('language-plaintext'); block.classList.add('language-python'); }); // 重新高亮所有代码块 Prism.highlightAll(); }); </script> </body> </html>