Python正则表达式反向查找需要固定宽度模式

10 投票
5 回答
6543 浏览
提问于 2025-04-15 21:27

在提取网页标题的时候,我一直使用以下的正则表达式:

(?<=<title.*>)([\s\S]*)(?=</title>)

这个表达式可以提取文档中标签之间的内容,并且会忽略掉标签本身。不过,当我在Python中使用这个正则表达式时,它会出现以下的错误:</p> <pre><code>Traceback (most recent call last): File "test.py", line 21, in <module> pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)') File "C:\Python31\lib\re.py", line 205, in compile return _compile(pattern, flags) File "C:\Python31\lib\re.py", line 273, in _compile p = sre_compile.compile(pattern, flags) File "C:\Python31\lib\sre_compile.py", line 495, in compile code = _code(p, flags) File "C:\Python31\lib\sre_compile.py", line 480, in _code _compile(code, p.data, flags) File "C:\Python31\lib\sre_compile.py", line 115, in _compile raise error("look-behind requires fixed-width pattern") sre_constants.error: look-behind requires fixed-width pattern </code></pre> <p>我使用的代码是:</p> <pre><code>pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)') m = pattern.search(f) </code></pre> <p>如果我稍微调整一下,它就能正常工作:</p> <pre><code>pattern = re.compile('(?<=<title>)([\s\S]*)(?=</title>)') m = pattern.search(f) </code></pre> <p>不过,这样做并没有考虑到可能存在的带有属性的html标题,或者其他类似的情况。</p> <p>有没有人知道解决这个问题的好办法?任何建议都很欢迎。</p> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9314860051288758" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-9314860051288758" data-ad-slot="2721561324"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="tags-section"> <a target="_blank" href="/tags/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F" class="tag">正则表达式</a> <a target="_blank" href="/tags/%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96" class="tag">数据提取</a> <a target="_blank" href="/tags/%E7%BD%91%E9%A1%B5%E8%A7%A3%E6%9E%90" class="tag">网页解析</a> <a target="_blank" href="/tags/%E9%94%99%E8%AF%AF%E8%B0%83%E8%AF%95" class="tag">错误调试</a> <a target="_blank" href="/tags/%E5%8F%8D%E5%90%91%E6%9F%A5%E6%89%BE" class="tag">反向查找</a> <a target="_blank" href="/tags/%E5%B1%9E%E6%80%A7%E5%A4%84%E7%90%86" class="tag">属性处理</a> <a target="_blank" href="/tags/html%E6%A0%87%E7%AD%BE" class="tag">HTML标签</a> </div> </div> </div> <!-- 回答区域 --> <div class="card"> <div class="card-header"> <div class="answers-header"> <h2 class="answers-title">5 个回答</h2> <div class="answer-sort"> <select> <option>按票数排序</option> <option>按时间排序</option> </select> </div> </div> </div> <div class="answer-item"> <div class="answer-wrapper"> <div class="answer-voting"> <button class="vote-button up"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="18 15 12 9 6 15"></polyline> </svg> </button> <div class="vote-count">6</div> <button class="vote-button down"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="6 9 12 15 18 9"></polyline> </svg> </button> </div> <div class="answer-content"> <p>这里有一个著名的回答,讲的是用正则表达式解析HTML的事情,它很好地说明了“不要用正则表达式来解析HTML”。</p> </div> </div> <div class="answer-footer"> <div class="answer-author"> 回答于 2025-04-15 由 <a href="#" class="author-name">Python大师</a> </div> <div class="answer-actions"> <a href="#" class="answer-action">分享</a> <a href="#" class="answer-action">举报</a> </div> </div> </div> <div class="answer-item"> <div class="answer-wrapper"> <div class="answer-voting"> <button class="vote-button up"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="18 15 12 9 6 15"></polyline> </svg> </button> <div class="vote-count">13</div> <button class="vote-button down"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="6 9 12 15 18 9"></polyline> </svg> </button> </div> <div class="answer-content"> <p>别想着用正则表达式来解析HTML,还是用真正的HTML解析库吧。经过简单搜索,我找到了<a href="http://docs.python.org/library/htmlparser.html" rel="noreferrer">这个</a>。这样提取HTML文件中的信息要安全得多。</p> <p>记住,HTML不是一种普通的语言,所以用正则表达式来提取信息根本就不合适。</p> </div> </div> <div class="answer-footer"> <div class="answer-author"> 回答于 2025-04-15 由 <a href="#" class="author-name">Python大师</a> </div> <div class="answer-actions"> <a href="#" class="answer-action">分享</a> <a href="#" class="answer-action">举报</a> </div> </div> </div> <div class="answer-item"> <div class="answer-wrapper"> <div class="answer-voting"> <button class="vote-button up"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="18 15 12 9 6 15"></polyline> </svg> </button> <div class="vote-count">2</div> <button class="vote-button down"> <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <polyline points="6 9 12 15 18 9"></polyline> </svg> </button> </div> <div class="answer-content"> <p>如果你只是想获取标题标签,</p> <pre><code>html=urllib2.urlopen("http://somewhere").read() for item in html.split("</title>"): if "<title>" in item: print item[ item.find("<title>")+7: ] </code></pre> </div> </div> <div class="answer-footer"> <div class="answer-author"> 回答于 2025-04-15 由 <a href="#" class="author-name">Python大师</a> </div> <div class="answer-actions"> <a href="#" class="answer-action">分享</a> <a href="#" class="answer-action">举报</a> </div> </div> </div> <div class="answer-form"> <h3 class="form-title">撰写回答</h3> <form> <div class="form-control"> <label for="answer" class="form-label">您的回答</label> <textarea id="answer" class="form-input" placeholder="编写您的回答..."></textarea> </div> <button type="submit" class="btn btn-primary">提交回答</button> </form> </div> </div> </main> <aside class="sidebar"> <!-- 侧边栏顶部广告位 --> <div class="card sidebar-box ad-container"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9314860051288758" crossorigin="anonymous"></script> <!-- qa_detail_sidebar_top --> <ins class="adsbygoogle" style="display:inline-block;width:320px;height:600px" data-ad-client="ca-pub-9314860051288758" data-ad-slot="5193841686"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">推荐教程</h3> </div> <ul class="related-questions"> <li class="related-question"> <a target="_blank" href="/python/string-format.html" class="related-link">Python 字符串格式化</a> </li> <li class="related-question"> <a target="_blank" href="/python/conditions.html" class="related-link">Python 条件控制</a> </li> <li class="related-question"> <a target="_blank" href="/python/string-methods.html" class="related-link">Python 字符串内置方法</a> </li> <li class="related-question"> <a target="_blank" href="/python/datetime.html" class="related-link">Python 日期和时间操作</a> </li> <li class="related-question"> <a target="_blank" href="/python/tuple-access.html" class="related-link">Python 元组访问</a> </li> <li class="related-question"> <a target="_blank" href="/python/set-add.html" class="related-link">Python集合(Set)元素添加与更新</a> </li> <li class="related-question"> <a target="_blank" href="/python/mongodb-insert.html" class="related-link">MongoDB:如何插入单条与多条文档</a> </li> <li class="related-question"> <a target="_blank" href="/python/dict-copy.html" class="related-link">Python 字典(Dictionary)复制</a> </li> <li class="related-question"> <a target="_blank" href="/python/mongodb-delete.html" class="related-link">操作MongoDB之删除文档</a> </li> <li class="related-question"> <a target="_blank" href="/python/mysql-create-table.html" class="related-link">创建 MySQL 表</a> </li> <li class="related-question"> <a target="_blank" href="/python/sets.html" class="related-link">Python 集合(Set)</a> </li> <li class="related-question"> <a target="_blank" href="/python/loop-tuple.html" class="related-link">Python 元组循环遍历</a> </li> </ul> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">热门标签</h3> </div> <div style="padding: 1.25rem;"> <a href="#" class="tag">python</a> <a href="#" class="tag">json</a> <a href="#" class="tag">大数据</a> <a href="#" class="tag">内存优化</a> <a href="#" class="tag">pandas</a> <a href="#" class="tag">性能优化</a> <a href="#" class="tag">数据处理</a> <a href="#" class="tag">文件处理</a> </div> </div> <div class="card sidebar-box"> <div class="card-header"> <h3 class="card-title">最新问题</h3> </div> <ul class="related-questions"> <li class="related-question"> <a href="/q/121949" class="related-link">无法重新连接到QuickBooks - 令牌刷新窗口超出范围或“...需要授权”</a> <div class="related-stats">1 回答 · 593 浏览</div> </li> <li class="related-question"> <a href="/q/121948" class="related-link">如何在Python中捕获任何套接字错误?</a> <div class="related-stats">1 回答 · 2034 浏览</div> </li> <li class="related-question"> <a href="/q/121947" class="related-link">对列表中的每个其他元素进行乘法运算</a> <div class="related-stats">2 回答 · 6360 浏览</div> </li> <li class="related-question"> <a href="/q/121946" class="related-link">Django 根据表单集中的两个字段计算字段</a> <div class="related-stats">1 回答 · 1829 浏览</div> </li> <li class="related-question"> <a href="/q/121945" class="related-link">使用'pip'更新'matplotlib'(1.4.1,Yosemite更新后)失败</a> <div class="related-stats">1 回答 · 1061 浏览</div> </li> </ul> </div> </aside> </div> <!-- 页脚 --> <footer class="footer"> <div class="footer-container"> <div class="footer-section"> <h3>关于我们</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">关于Python问答</a></li> <li><a href="#" class="footer-link">团队介绍</a></li> <li><a href="#" class="footer-link">加入我们</a></li> </ul> </div> <div class="footer-section"> <h3>帮助中心</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">常见问题</a></li> <li><a href="#" class="footer-link">使用指南</a></li> <li><a href="#" class="footer-link">反馈建议</a></li> </ul> </div> <div class="footer-section"> <h3>社区</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">技术博客</a></li> <li><a href="#" class="footer-link">活动中心</a></li> <li><a href="#" class="footer-link">用户故事</a></li> </ul> </div> <div class="footer-section"> <h3>联系方式</h3> <ul class="footer-links"> <li><a href="#" class="footer-link">联系我们</a></li> <li><a href="#" class="footer-link">商务合作</a></li> <li><a href="#" class="footer-link">微信公众号</a></li> </ul> </div> </div> <div class="footer-bottom"> <p>© 2013~2025 Python问答社区 | 京ICP备07000037号</p> </div> </footer> <script> // 移动端导航菜单切换 const navToggle = document.getElementById('navToggle'); const navMenu = document.getElementById('navMenu'); navToggle.addEventListener('click', () => { navMenu.classList.toggle('active'); }); // 主题切换功能 const themeToggle = document.getElementById('themeToggle'); const body = document.body; // 检查本地存储中的主题设置 const currentTheme = localStorage.getItem('theme') || 'green'; if (currentTheme === 'blue') { body.setAttribute('data-theme', 'blue'); } themeToggle.addEventListener('click', () => { const currentTheme = body.getAttribute('data-theme'); if (currentTheme === 'blue') { body.removeAttribute('data-theme'); localStorage.setItem('theme', 'green'); } else { body.setAttribute('data-theme', 'blue'); localStorage.setItem('theme', 'blue'); } themeToggle.classList.add('active'); setTimeout(() => { themeToggle.classList.remove('active'); }, 300); }); </script> <!-- prism.js 主库 --> <script src="https://unpkg.com/prismjs@1.29.0/prism.js"></script> <!-- prism.js python 语法支持 --> <script src="https://unpkg.com/prismjs@1.29.0/components/prism-python.min.js"></script> <script> // 页面加载完成后执行 document.addEventListener('DOMContentLoaded', function () { // 查找所有没有指定语言的代码块 const unlabeledCodeBlocks = document.querySelectorAll('pre > code:not([class*="language-"])'); unlabeledCodeBlocks.forEach(block => { block.classList.add('language-python'); }); const plaintextBlocks = document.querySelectorAll('pre > code.language-plaintext'); plaintextBlocks.forEach(block => { block.classList.remove('language-plaintext'); block.classList.add('language-python'); }); // 重新高亮所有代码块 Prism.highlightAll(); }); </script> </body> </html>