c++11正则表达式比python慢问题的回答

c++11正则表达式比python慢

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>通知</h2> 另请参见这个答案：<a href="https://stackoverflow.com/a/21708215">https://stackoverflow.com/a/21708215</a>，它是下面编辑2的基础。 <hr/> 我将循环增加到1000000以获得更好的计时度量。 这是我的Python计时： <pre><code>real 0m2.038s user 0m2.009s sys 0m0.024s </code></pre> 这里有一个相当于你的代码，只是有点漂亮： <pre><code>#include <regex> #include <vector> #include <string> std::vector<std::string> split(const std::string &s, const std::regex &r) { return { std::sregex_token_iterator(s.begin(), s.end(), r, -1), std::sregex_token_iterator() }; } int main() { const std::regex r(" +"); for(auto i=0; i < 1000000; ++i) split("a b c", r); return 0; } </code></pre> 时间安排： <pre><code>real 0m5.786s user 0m5.779s sys 0m0.005s </code></pre> <hr/> 这是一种避免构造/分配向量和字符串对象的优化： <pre><code>#include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1); auto rend = std::sregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } </code></pre> 时间安排： <pre><code>real 0m3.034s user 0m3.029s sys 0m0.004s </code></pre> 这几乎是100%的性能改进。 向量是在循环之前创建的，并且可以在第一次迭代中增加其内存。之后，通过<code>clear()</code>没有内存释放，向量维护内存并在适当的位置构造字符串。 <hr/> 另一个性能提升将是完全避免构造/破坏<code>std::string</code>，从而避免分配/释放其对象。 这是一个试探性的方向： <pre><code>#include <regex> #include <vector> #include <string> void split(const char *s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } } </code></pre> 时间安排： <pre><code>real 0m2.509s user 0m2.503s sys 0m0.004s </code></pre> 最终的改进是将<code>std::vector</code>的<code>const char *</code>作为返回，其中每个char指针将指向原始<code>s</code>c字符串本身内部的子字符串。问题是，不能这样做，因为它们中的每一个都不会被null终止（为此，请参见后面的示例中使用C++ 1y^ { CD6}}）。 <hr/> 最后的改进也可以通过以下方式实现： <pre><code>#include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); // the constant string("a b c") should be optimized // by the compiler. I got the same performance as // if it was an object outside the loop return 0; } </code></pre> <hr/> 我用3.3的叮当声（从树干）和-O3制作了样本。也许其他regex库能够更好地执行，但无论如何，分配/释放常常会影响性能。 <hr/> <h2>增强型正则表达式</h2> 这是c字符串参数示例的<code>boost::regex</code>计时： <pre><code>real 0m1.284s user 0m1.278s sys 0m0.005s </code></pre> 相同的代码，<code>boost::regex</code>和<code>std::regex</code>接口在这个示例中是相同的，只需要更改名称空间和include。 <>希望随着时间的推移，它越来越好，C++ STDLIB正则表达式的实现还处于起步阶段。 <h2>编辑</h2> 为了完成，我尝试了这个（上面提到的“最终改进”建议），但它并没有在任何方面提高等效<code>std::vector<std::string> &v</code>版本的性能： <pre><code>#include <regex> #include <vector> #include <string> template<typename Iterator> class intrusive_substring { private: Iterator begin_, end_; public: intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {} Iterator begin() {return begin_;} Iterator end() {return end_;} }; using intrusive_char_substring = intrusive_substring<const char *>; void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); // This can potentially be optimized away by the compiler because // the intrusive_char_substring destructor does nothing, so // resetting the internal size is the only thing to be done. // Formerly allocated memory is maintained. while(rit != rend) { v.emplace_back(rit->first, rit->second); ++rit; } } int main() { const std::regex r(" +"); std::vector<intrusive_char_substring> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } </code></pre> 这与<a href="http://cxx1y-array-string-ref.googlecode.com/git/paper.html" rel="noreferrer">array_ref and string_ref proposal</a>有关。下面是使用它的示例代码： <pre><code>#include <regex> #include <vector> #include <string> #include <string_ref> void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.emplace_back(rit->first, rit->length()); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string_ref> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } </code></pre> 对于带向量返回的<code>split</code>情况，返回<code>string_ref</code>的向量而不是<code>string</code>副本也会更便宜。 <h2>编辑2</h2> 这个新的解决方案能够通过返回获得输出。我使用了Marshall Clow在<a href="https://github.com/mclow/string_view" rel="noreferrer">https://github.com/mclow/string_view</a>找到的<code>string_view</code>（<code>string_ref</code>已重命名）libc++实现。 <pre><code>#include <string> #include <string_view> #include <boost/regex.hpp> #include <boost/range/iterator_range.hpp> #include <boost/iterator/transform_iterator.hpp> using namespace std; using namespace std::experimental; using namespace boost; string_view stringfier(const cregex_token_iterator::value_type &match) { return {match.first, static_cast<size_t>(match.length())}; } using string_view_iterator = transform_iterator<decltype(&stringfier), cregex_token_iterator>; iterator_range<string_view_iterator> split(string_view s, const regex &r) { return { string_view_iterator( cregex_token_iterator(s.begin(), s.end(), r, -1), stringfier ), string_view_iterator() }; } int main() { const regex r(" +"); for (size_t i = 0; i < 1000000; ++i) { split("a b c", r); } } </code></pre> 时间安排： <pre><code>real 0m0.385s user 0m0.385s sys 0m0.000s </code></pre> 请注意，与之前的结果相比，这一速度有多快。当然，它不会在循环中填充一个<code>vector</code>（也可能不会提前匹配任何内容），但是无论如何，您都会得到一个范围，您可以使用基于范围的<code>for</code>来覆盖它，甚至可以使用它来填充一个<code>vector</code>。 由于覆盖<code>iterator_range</code>会在原始<code>string</code>（或以空结尾的字符串）上创建<code>string_view</code>s，因此它非常轻量级，从不生成不必要的字符串分配。 为了比较使用这个<code>split</code>实现但实际上填充了<code>vector</code>我们可以这样做： <pre><code>int main() { const regex r(" +"); vector<string_view> v; v.reserve(10); for (size_t i = 0; i < 1000000; ++i) { copy(split("a b c", r), back_inserter(v)); v.clear(); } } </code></pre> 这使用boost range copy算法在每次迭代中填充向量，计时如下： <pre><code>real 0m1.002s user 0m0.997s sys 0m0.004s </code></pre> 可以看出，与优化的<code>string_view</code>输出参数版本相比没有太大的差异。 注意还有<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3593.html" rel="noreferrer">a proposal for a ^{<cd25>}</a>可以这样工作。

c++11正则表达式比python慢

1 个回答

相关Python问题