c++11正则表达式比python慢

3条回答

网友

1楼 · 编辑于 2024-05-14 01:25:28

通知

另请参见这个答案：https://stackoverflow.com/a/21708215，它是下面编辑2的基础。

我将循环增加到1000000以获得更好的计时度量。

这是我的Python计时：

real    0m2.038s
user    0m2.009s
sys     0m0.024s

这里有一个相当于你的代码，只是有点漂亮：

#include <regex>
#include <vector>
#include <string>

std::vector<std::string> split(const std::string &s, const std::regex &r)
{
    return {
        std::sregex_token_iterator(s.begin(), s.end(), r, -1),
        std::sregex_token_iterator()
    };
}

int main()
{
    const std::regex r(" +");
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r);
    return 0;
}

时间安排：

real    0m5.786s
user    0m5.779s
sys     0m0.005s

这是一种避免构造/分配向量和字符串对象的优化：

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
    auto rend = std::sregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);
    return 0;
}

时间安排：

real    0m3.034s
user    0m3.029s
sys     0m0.004s

这几乎是100%的性能改进。

向量是在循环之前创建的，并且可以在第一次迭代中增加其内存。之后，通过clear()没有内存释放，向量维护内存并在适当的位置构造字符串。

另一个性能提升将是完全避免构造/破坏std::string，从而避免分配/释放其对象。

这是一个试探性的方向：

#include <regex>
#include <vector>
#include <string>

void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

时间安排：

real    0m2.509s
user    0m2.503s
sys     0m0.004s

最终的改进是将std::vector的const char *作为返回，其中每个char指针将指向原始sc字符串本身内部的子字符串。问题是，不能这样做，因为它们中的每一个都不会被null终止（为此，请参见后面的示例中使用C++ 1y^ { CD6}}）。

最后的改进也可以通过以下方式实现：

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v); // the constant string("a b c") should be optimized
                             // by the compiler. I got the same performance as
                             // if it was an object outside the loop
    return 0;
}

我用3.3的叮当声（从树干）和-O3制作了样本。也许其他regex库能够更好地执行，但无论如何，分配/释放常常会影响性能。

增强型正则表达式

这是c字符串参数示例的boost::regex计时：

real    0m1.284s
user    0m1.278s
sys     0m0.005s

相同的代码，boost::regex和std::regex接口在这个示例中是相同的，只需要更改名称空间和include。

<>希望随着时间的推移，它越来越好，C++ STDLIB正则表达式的实现还处于起步阶段。
编辑
为了完成，我尝试了这个（上面提到的“最终改进”建议），但它并没有在任何方面提高等效std::vector<std::string> &v版本的性能：
#include <regex> #include <vector> #include <string> template<typename Iterator> class intrusive_substring { private: Iterator begin_, end_; public: intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {} Iterator begin() {return begin_;} Iterator end() {return end_;} }; using intrusive_char_substring = intrusive_substring<const char *>; void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); // This can potentially be optimized away by the compiler because // the intrusive_char_substring destructor does nothing, so // resetting the internal size is the only thing to be done. // Formerly allocated memory is maintained. while(rit != rend) { v.emplace_back(rit->first, rit->second); ++rit; } } int main() { const std::regex r(" +"); std::vector<intrusive_char_substring> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; }
这与array_ref and string_ref proposal有关。下面是使用它的示例代码：
#include <regex> #include <vector> #include <string> #include <string_ref> void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.emplace_back(rit->first, rit->length()); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string_ref> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; }
对于带向量返回的split情况，返回string_ref的向量而不是string副本也会更便宜。
编辑2
这个新的解决方案能够通过返回获得输出。我使用了Marshall Clow在https://github.com/mclow/string_view找到的string_view（string_ref已重命名）libc++实现。
#include <string> #include <string_view> #include <boost/regex.hpp> #include <boost/range/iterator_range.hpp> #include <boost/iterator/transform_iterator.hpp> using namespace std; using namespace std::experimental; using namespace boost; string_view stringfier(const cregex_token_iterator::value_type &match) { return {match.first, static_cast<size_t>(match.length())}; } using string_view_iterator = transform_iterator<decltype(&stringfier), cregex_token_iterator>; iterator_range<string_view_iterator> split(string_view s, const regex &r) { return { string_view_iterator( cregex_token_iterator(s.begin(), s.end(), r, -1), stringfier ), string_view_iterator() }; } int main() { const regex r(" +"); for (size_t i = 0; i < 1000000; ++i) { split("a b c", r); } }
时间安排：
real 0m0.385s user 0m0.385s sys 0m0.000s
请注意，与之前的结果相比，这一速度有多快。当然，它不会在循环中填充一个vector（也可能不会提前匹配任何内容），但是无论如何，您都会得到一个范围，您可以使用基于范围的for来覆盖它，甚至可以使用它来填充一个vector。

由于覆盖iterator_range会在原始string（或以空结尾的字符串）上创建string_views，因此它非常轻量级，从不生成不必要的字符串分配。

为了比较使用这个split实现但实际上填充了vector我们可以这样做：

int main() {
    const regex r(" +");
    vector<string_view> v;
    v.reserve(10);
    for (size_t i = 0; i < 1000000; ++i) {
        copy(split("a b c", r), back_inserter(v));
        v.clear();
    }
}

这使用boost range copy算法在每次迭代中填充向量，计时如下：

real    0m1.002s
user    0m0.997s
sys     0m0.004s

可以看出，与优化的string_view输出参数版本相比没有太大的差异。

注意还有a proposal for a ^{}可以这样工作。

网友

2楼 · 编辑于 2024-05-14 01:25:28

这个版本怎么样？它不是regex，但它很快就解决了拆分问题。。。

#include <vector>
#include <string>
#include <algorithm>

size_t split2(const std::string& s, std::vector<std::string>& result)
{
    size_t count = 0;
    result.clear();
    std::string::const_iterator p1 = s.cbegin();
    std::string::const_iterator p2 = p1;
    bool run = true;
    do
    {
        p2 = std::find(p1, s.cend(), ' ');
        result.push_back(std::string(p1, p2));
        ++count;

        if (p2 != s.cend())
        {
            p1 = std::find_if(p2, s.cend(), [](char c) -> bool
            {
                return c != ' ';
            });
        }
        else run = false;
    } while (run);
    return count;
}

int main()
{
    std::vector<std::string> v;
    std::string s = "a b c";
    for (auto i = 0; i < 100000; ++i)
        split2(s, v); 
    return 0;
}

$time splittest.exe

实0m0.132s 用户0m0.000s 系统0m0.109s

网友

3楼 · 编辑于 2024-05-14 01:25:28

对于优化，通常需要避免两件事：

为不必要的东西烧掉CPU周期
无所事事地等待某事发生（内存读取、磁盘读取、网络读取…）

两者可以是对立的，因为有时它会比把所有东西都缓存在内存中更快地计算一些东西。。。所以这是一个平衡的游戏。

让我们分析一下您的代码：

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +"); // only computed once

    // search for first occurrence of rsplit
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);

    auto rend = std::sregex_token_iterator();

    // simultaneously:
    // - parses "s" from the second to the past the last occurrence
    // - allocates one `std::string` for each match... at least! (there may be a copy)
    // - allocates space in the `std::vector`, possibly multiple times
    auto res = std::vector<std::string>(rit, rend);

    return res;
}

我们能做得更好吗？好吧，如果我们可以重用现有的存储，而不是继续分配和释放内存，我们应该会看到一个显著的改进[1]：

// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
    static const std::regex rsplit(" +"); // only computed once

    auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::cregex_token_iterator();

    size_t pos = 0;

    // As long as possible, reuse the existing strings (in place)
    for (size_t max = result.size();
         rit != rend && pos != max;
         ++rit, ++pos)
    {
        result[pos].assign(rit->first, rit->second);
    }

    // When more matches than existing strings, extend capacity
    for (; rit != rend; ++rit, ++pos) {
        result.emplace_back(rit->first, rit->second);
    }

    return pos;
} // split

在您执行的测试中，子匹配的数量在迭代中是恒定的，这个版本不太可能被打败：它只在第一次运行时分配内存（对于rsplit和result），然后继续重用现有内存。

[1]：免责声明，我只证明了这个代码是正确的，我没有测试过它（正如Donald Knuth所说）。

通知

增强型正则表达式

编辑

编辑2

相关问题更多 >

编程相关推荐

热门问题

热门文章