2.性能相关

2.1 指针与智能指针

指针与智能指针的性能差异

在低频次调用的情况下，指针与智能指针性能没有显著差异；但在高频次调用的情况下，智能指针的性能比较差，而且测试用例是在没有多线程的环境下的对比结果（因为智能指针的引用计数，一般是原子操作的，但在某些指令集下面是使用mutex完成的），可以想象多线程环境中它的性能会更差些。

测试代码如下所示：

void TestSmPtr()
{
    array<int,5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    int* pInt = new int{0};
    auto spInt{make_shared<int>(10)};
    
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            *pInt = n;
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "orignal pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            *spInt = n;
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "smart pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
    }

}

具体性能差异如下：

2.2 function，lambda，callbinder

普通函数，lambda，function，_callbinder的性能对比

Function对象的性能最差（内部调用invoke完成函数执行），lambda与普通函数接近，很显然编译器对lambda做了优化（lambda被编译成了函数对象）。

bind数据的情况下，callbinder性能最差（callbinder比function差将近一倍，因为bind返回的callbinder函数，然后再调用invoke）。其他函数类型的性能没有显著区别。

#include <functional>
#include<array>
#include <chrono>
using namespace std;

int testaddfunc(int a, int b)
{
    return a + b;
}
void testfuncperf()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    auto flamb = [](int a, int b)
    {
        return a + b;
    };
    function<int(int, int)> fo = flamb;

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            testaddfunc(10,12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            flamb(10,12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    fo = testaddfunc;
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }
}
void testfuncperf2()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    int adata = 10;
    auto flamb = [adata](int b)
    {
        return adata + b;
    };
    auto fo = bind(flamb,12);
    auto f1 = bind(testaddfunc, 10, placeholders::_1);

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            testaddfunc(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            flamb(12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo();
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    //fo = testaddfunc;
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            f1(12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }
}

性能差异如图所示：

2.3 内存池的性能提升

使用自c++17以来支持的memory_resources

在2000000次计算的测试下，分别比较了list，在默认的allocator，tcmalloc定制的alloctor，pmr默认的allocator，pmr的monotonic_buffer_resource，pmr的monotonic_buffer_resource定制的allocator的性能数据。

Pmr使用定制memory_resources的情况下，性能普遍好些。

#include <memory>
#include <array>
#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <list>
#include <memory_resource>
 
template<typename Func>
auto benchmark(Func test_func, int iterations)
{
    auto tp = chrono::high_resolution_clock::now();
    while (iterations-- > 0)
        test_func();
    auto tp1 = chrono::high_resolution_clock::now();
    auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
    return ms.count();
}
 
void testmemorypool()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    constexpr int iterations{ 100 };
    constexpr int total_nodes{ 2'00'000 };
    constexpr int todeletepoint{ 333 };
    constexpr int deletecount{ 100 };

    auto default_std_alloc = [total_nodes]()
    {
        std::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto default_pmr_alloc = [total_nodes]()
    {
        std::pmr::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto pmr_alloc_no_buf = [total_nodes]()
    {
        std::pmr::monotonic_buffer_resource mbr;
        std::pmr::polymorphic_allocator<int> pa{&mbr};
        std::pmr::list<int> list{pa};
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto pmr_alloc_and_buf = [total_nodes]()
    {
        std::vector<std::byte> buffer(total_nodes); // enough to fit in all nodes
        std::pmr::monotonic_buffer_resource mbr{buffer.data(), buffer.size()};
        std::pmr::polymorphic_allocator<int> pa{&mbr};
        std::pmr::list<int> list{pa};
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    {
        vector<int> test;
        //for (int i{ 0 }; i < arTestCounts.size(); ++i)
        {
            const double t1 = benchmark(default_std_alloc, iterations);
            const double t2 = benchmark(default_pmr_alloc, iterations);
            const double t3 = benchmark(pmr_alloc_no_buf, iterations);
            const double t4 = benchmark(pmr_alloc_and_buf, iterations);

            std::cout << std::fixed << std::setprecision(3)
                << "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
                << "t2 (default pmr alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
                << "t3 (pmr alloc  no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
                << "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';

            cout << " count " << iterations << endl;
        }
    }
}
// tcmalloc
struct _AllocatorT
{
    static void* Allocate(size_t size)
    {
        return malloc(size);
    }
    static void Free(void* p, size_t size)
    {
        free(p);
    }
};

void testtcmemorypool()
{
    //array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    constexpr int iterations{ 100 };
    constexpr int total_nodes{ 2'00'000 };
    constexpr int todeletepoint{ 333 };
    constexpr int deletecount{ 100 };

    auto default_std_alloc = [total_nodes]()
    {
        std::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto default_pmr_alloc = [total_nodes]()
    {
        std::list<int, STL_Allocator<int, _AllocatorT>> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    const double t1 = benchmark(default_std_alloc, iterations);
    const double t2 = benchmark(default_pmr_alloc, iterations);
    //const double t3 = benchmark(pmr_alloc_no_buf, iterations);
    //const double t4 = benchmark(pmr_alloc_and_buf, iterations);

    std::cout << std::fixed << std::setprecision(3)
        << "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
        << "t2 (with alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
        << std::endl;
        //<< "t3 (pmr alloc  no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
        //<< "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';
}

具体性能差异如图：

2.4 内存对齐的性能提升

aligned_malloc ，malloc的性能对比

对比这两个函数，发现aligned_malloc的性能有提升，但优势不是那么的明显。

#include <stdlib.h>

void testAlignAlloc()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        char* p = (char*)malloc(10 * sizeof(char));
        auto v = benchmark([p] {*p = ('c' + *p) % 127; }, arTestCounts[i]);
        free(p);
        cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;

        char* p1 = (char*)_aligned_malloc(10 * sizeof(char), 8);
        v = benchmark([p1] {*p1 = ('c' + *p1) % 127; }, arTestCounts[i]);
        _aligned_free(p1);
        cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
    }

}

struct exampleObj {
    char a;
    int b;
    char c;
};
#include <stdalign.h>
__declspec(align(8)) struct exampleObj2 {
    char a;
    int b;
    char c;
};

void testStackAlign()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([] {
            exampleObj obj{};
                obj.a = ('c' + obj.a) % 127; 
            }, arTestCounts[i]);
        cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
            exampleObj2 obj{};
            obj.a = ('c' + obj.a) % 127;
            }, arTestCounts[i]);
        cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
    }
}

性能差异图：

stack alignas

2.5 右值带来的性能提升

右值，左值，拷贝的性能对比

右值和左值的效率在本测试用例上，大体接近，右值版本略好些；但是，右值比拷贝的性能优势是很明显的。本例增加很多复杂操作，为了能放大这些差距。

string getstringprof(string a)
{
    return a + "_returned";
}
string& getstringrefprof(string& a)
{
    a += "_ref_returned";
    return a;
}
string&& getstringrrefprof(string&& a)
{
    string&& r{ move(a) };
    r += "_rrefreturned";
    return move(r);
}
void testrightref()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([] {
            getstringprof("teststring");
            }, arTestCounts[i]);
        cout << "string costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
                string teststr("leftrefstring");
                getstringrefprof(teststr);
            ;
            }, arTestCounts[i]);
        cout << "left reference costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
            getstringrrefprof("rightrefstring");
            }, arTestCounts[i]);
        cout << "right reference costs " << v << " count " << arTestCounts[i] << endl;
    }
}

性能差异：

2.6 基于范围的for

Range-based for 的性能提升

对于测试用例来说，vector，range-based for的性能比基于迭代器的for，要高的太多；同时也比基于索引的for高不少。

#include <random>

void testforperf()
{
    std::random_device rd;
    std::vector<int> vec;
    std::uniform_int_distribution<int> dist(0, 1000);
    for (int i{}; i < 1000; ++i)
    {
        vec.push_back(dist(rd));
    }

    array<int, 3> arTestCounts{10000, 100000, 1000000};// , 10000000, 100000000

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([&vec] {
            for (auto it = vec.begin(); it != vec.end(); ++it)
                ;
            }, arTestCounts[i]);
        cout << "pod for costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([&vec] {
            for (int n{}; n < vec.size(); ++n)
                ;
            }, arTestCounts[i]);
        cout << "ndx for costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([&vec] {
            for (auto & c : vec)
                ;
            }, arTestCounts[i]);
        cout << "ranged for costs " << v << " count " << arTestCounts[i] << endl;
        ;
    }
}

性能差异：

三者对比

2.7 同步的代价

互斥体添加的位置，对性能影响很大

同步需要对共享的数据块加以保护，比如锁；但是如果锁添加不慎，那么对后续的性能影响还是比较大的。比如下面的测试用例。

由此，我们需要尽量减少多线程之间对共享数据的访问，如果必须要访问，那么需要考虑的是，需要一次传递适中的数据。

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>

//std::vector<int> data; // 共享容器
//std::mutex mtx; // 互斥锁

void addToVector(std::vector<int> & data, std::mutex & mtx, int value) {
    std::lock_guard<std::mutex> lock(mtx); // 上锁
    data.push_back(value);
} // 离开作用域时自动解锁
void addToVector2(std::vector<int>& data, std::mutex& mtx, const list<int>& lst)
{
    std::lock_guard<std::mutex> lock(mtx); // 上锁
    for (const auto& v : lst)
        data.emplace_back(v);
}

int testMutexVector() {
    //array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    //for (int i{ 0 }; i < arTestCounts.size(); ++i)
    //{

        std::vector<int> data;
        std::mutex mtx;
        static const int count = 1000000;
        array<int, 4> costs{};

        std::thread thread1([&data,&mtx, &costs]() {
            auto t = benchmark([&data,&mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 1; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                },1);

            //cout << "500 per tick, cost " << t << endl;
            costs[0] = t;
            });
       
        std::thread thread2([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 1000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[1] = t;
            });

        std::thread thread3([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 10000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[2] = t;
            });

        std::thread thread4([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 100000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[3] = t;
            });

        thread1.join();
        thread2.join();
        thread3.join();
        thread4.join();

        cout << "1 per tick, cost " << costs[0] << endl;
        cout << "1000 per tick, cost " << costs[1] << endl;
        cout << "10000 per tick, cost " << costs[2] << endl;
        cout << "100000 per tick, cost " << costs[3] << endl;
        //}

    // 输出容器内容
    //std::lock_guard<std::mutex> lock(mtx);
    //std::cout << "Container contents:";
    //for (int num : data) {
    //    std::cout << " " << num;
    //}
    //std::cout << std::endl;

    return 0;
}

性能差异如下：

3.小结

C++是一种持续发展的编程语言，每个新版本都会引入新的特性和改进。这里简单的对比了C++的最新版本（指代C++17、C++20或更高版本）与C++98之间的新特性，并提供相关的性能测试数据，以评估新版本对代码性能的影响。

自动类型推导 (Automatic Type Deduction):

C++的最新版本引入了自动类型推导，通过使用关键字auto和decltype，可以在编译器的帮助下自动推断变量的类型。这样可以简化代码，并使其更具可读性。我们可以进行性能测试，以比较使用自动类型推导和显式类型声明的代码的执行效率。

范围基于的for循环 (Range-based For Loop):

C++的最新版本引入了范围基于的for循环，通过简化迭代容器中的元素访问，提供了更加简洁和直观的语法。我们可以进行性能测试，以比较传统的for循环和范围基于的for循环的执行效率。

智能指针 (Smart Pointers):

C++的最新版本引入了智能指针，包括std::unique_ptr、std::shared_ptr和std::weak_ptr，用于管理动态分配的内存。智能指针提供了更安全和便捷的内存管理方式，避免了内存泄漏和悬挂指针等问题。我们可以进行性能测试，以评估使用智能指针和传统指针的代码性能差异。

并发编程库 (Concurrency Library):

C++的最新版本引入了一套强大的并发编程库，如std::thread、std::mutex和std::condition_variable等，用于实现多线程和并发操作。这些库提供了丰富的工具和机制，使得并发编程更加简单和安全。我们可以进行性能测试，以比较使用并发库和传统线程操作的代码性能。

最后，本人还会持续发掘C++的最新版本相对于C++98引入的新特性和改进，它们提供了更强大、更高效的编程工具。期望通过学习和测评对比，可以在将来根据实际需求选择合适的特性来提升程序性能。

网站首页 > 博客文章正文

c++20 语法与性能介绍 part 3（c++语法题）

2.性能相关

2.1 指针与智能指针

2.2 function，lambda，callbinder

2.3 内存池的性能提升

2.4 内存对齐的性能提升

2.5 右值带来的性能提升

2.6 基于范围的for

2.7 同步的代价

3.小结

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 博客文章 正文

c++20 语法与性能介绍 part 3（c++语法题）

2.性能相关

2.1 指针与智能指针

2.2 function，lambda，callbinder

2.3 内存池的性能提升

2.4 内存对齐的性能提升

2.5 右值带来的性能提升

2.6 基于范围的for

2.7 同步的代价

3.小结

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 博客文章正文

取消回复欢迎你发表评论: