专业的编程技术博客社区

网站首页 > 博客文章 正文

c++20 语法与性能介绍 part 3(c++语法题)

baijin 2024-08-13 00:55:44 博客文章 6 ℃ 0 评论

2.性能相关

2.1 指针与智能指针

指针与智能指针的性能差异

在低频次调用的情况下,指针与智能指针性能没有显著差异;但在高频次调用的情况下,智能指针的性能比较差,而且测试用例是在没有多线程的环境下的对比结果(因为智能指针的引用计数,一般是原子操作的,但在某些指令集下面是使用mutex完成的),可以想象多线程环境中它的性能会更差些。

测试代码如下所示:

void TestSmPtr()
{
    array<int,5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    int* pInt = new int{0};
    auto spInt{make_shared<int>(10)};
    
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            *pInt = n;
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "orignal pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            *spInt = n;
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "smart pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
    }

}

具体性能差异如下:

2.2 function,lambda,callbinder

普通函数,lambda,function,_callbinder的性能对比

Function对象的性能最差(内部调用invoke完成函数执行),lambda与普通函数接近,很显然编译器对lambda做了优化(lambda被编译成了函数对象)。

bind数据的情况下,callbinder性能最差(callbinder比function差将近一倍,因为bind返回的callbinder函数,然后再调用invoke)。其他函数类型的性能没有显著区别。

#include <functional>
#include<array>
#include <chrono>
using namespace std;

int testaddfunc(int a, int b)
{
    return a + b;
}
void testfuncperf()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    auto flamb = [](int a, int b)
    {
        return a + b;
    };
    function<int(int, int)> fo = flamb;

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            testaddfunc(10,12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            flamb(10,12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    fo = testaddfunc;
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }
}
void testfuncperf2()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    int adata = 10;
    auto flamb = [adata](int b)
    {
        return adata + b;
    };
    auto fo = bind(flamb,12);
    auto f1 = bind(testaddfunc, 10, placeholders::_1);

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            testaddfunc(10, 12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            flamb(12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            fo();
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }

    //fo = testaddfunc;
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto tp = chrono::high_resolution_clock::now();
        for (int n = 0; n < arTestCounts[i]; ++n)
        {
            f1(12);
        }
        auto tp1 = chrono::high_resolution_clock::now();
        auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
        cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
    }
}

性能差异如图所示:

2.3 内存池的性能提升

使用自c++17以来支持的memory_resources

在2000000次计算的测试下,分别比较了list,在默认的allocator,tcmalloc定制的alloctor,pmr默认的allocator,pmr的monotonic_buffer_resource,pmr的monotonic_buffer_resource定制的allocator的性能数据。

Pmr使用定制memory_resources的情况下,性能普遍好些。

#include <memory>
#include <array>
#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <list>
#include <memory_resource>
 
template<typename Func>
auto benchmark(Func test_func, int iterations)
{
    auto tp = chrono::high_resolution_clock::now();
    while (iterations-- > 0)
        test_func();
    auto tp1 = chrono::high_resolution_clock::now();
    auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
    return ms.count();
}
 
void testmemorypool()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    constexpr int iterations{ 100 };
    constexpr int total_nodes{ 2'00'000 };
    constexpr int todeletepoint{ 333 };
    constexpr int deletecount{ 100 };

    auto default_std_alloc = [total_nodes]()
    {
        std::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto default_pmr_alloc = [total_nodes]()
    {
        std::pmr::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto pmr_alloc_no_buf = [total_nodes]()
    {
        std::pmr::monotonic_buffer_resource mbr;
        std::pmr::polymorphic_allocator<int> pa{&mbr};
        std::pmr::list<int> list{pa};
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto pmr_alloc_and_buf = [total_nodes]()
    {
        std::vector<std::byte> buffer(total_nodes); // enough to fit in all nodes
        std::pmr::monotonic_buffer_resource mbr{buffer.data(), buffer.size()};
        std::pmr::polymorphic_allocator<int> pa{&mbr};
        std::pmr::list<int> list{pa};
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    {
        vector<int> test;
        //for (int i{ 0 }; i < arTestCounts.size(); ++i)
        {
            const double t1 = benchmark(default_std_alloc, iterations);
            const double t2 = benchmark(default_pmr_alloc, iterations);
            const double t3 = benchmark(pmr_alloc_no_buf, iterations);
            const double t4 = benchmark(pmr_alloc_and_buf, iterations);

            std::cout << std::fixed << std::setprecision(3)
                << "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
                << "t2 (default pmr alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
                << "t3 (pmr alloc  no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
                << "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';

            cout << " count " << iterations << endl;
        }
    }
}
// tcmalloc
struct _AllocatorT
{
    static void* Allocate(size_t size)
    {
        return malloc(size);
    }
    static void Free(void* p, size_t size)
    {
        free(p);
    }
};

void testtcmemorypool()
{
    //array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    constexpr int iterations{ 100 };
    constexpr int total_nodes{ 2'00'000 };
    constexpr int todeletepoint{ 333 };
    constexpr int deletecount{ 100 };

    auto default_std_alloc = [total_nodes]()
    {
        std::list<int> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    auto default_pmr_alloc = [total_nodes]()
    {
        std::list<int, STL_Allocator<int, _AllocatorT>> list;
        for (int i{}; i != total_nodes; ++i)
        {
            list.push_back(i);
            if (i % todeletepoint == 0)
            {
                for (int j{}; list.size() && j < deletecount; ++j)
                    list.pop_back();
            }
        }
    };

    const double t1 = benchmark(default_std_alloc, iterations);
    const double t2 = benchmark(default_pmr_alloc, iterations);
    //const double t3 = benchmark(pmr_alloc_no_buf, iterations);
    //const double t4 = benchmark(pmr_alloc_and_buf, iterations);

    std::cout << std::fixed << std::setprecision(3)
        << "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
        << "t2 (with alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
        << std::endl;
        //<< "t3 (pmr alloc  no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
        //<< "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';
}

具体性能差异如图:

2.4 内存对齐的性能提升

aligned_malloc ,malloc的性能对比

对比这两个函数,发现aligned_malloc的性能有提升,但优势不是那么的明显。

#include <stdlib.h>

void testAlignAlloc()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        char* p = (char*)malloc(10 * sizeof(char));
        auto v = benchmark([p] {*p = ('c' + *p) % 127; }, arTestCounts[i]);
        free(p);
        cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;

        char* p1 = (char*)_aligned_malloc(10 * sizeof(char), 8);
        v = benchmark([p1] {*p1 = ('c' + *p1) % 127; }, arTestCounts[i]);
        _aligned_free(p1);
        cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
    }

}

struct exampleObj {
    char a;
    int b;
    char c;
};
#include <stdalign.h>
__declspec(align(8)) struct exampleObj2 {
    char a;
    int b;
    char c;
};

void testStackAlign()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([] {
            exampleObj obj{};
                obj.a = ('c' + obj.a) % 127; 
            }, arTestCounts[i]);
        cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
            exampleObj2 obj{};
            obj.a = ('c' + obj.a) % 127;
            }, arTestCounts[i]);
        cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
    }
}

性能差异图:

stack alignas

2.5 右值带来的性能提升

右值,左值,拷贝的性能对比

右值和左值的效率在本测试用例上,大体接近,右值版本略好些;但是,右值比拷贝的性能优势是很明显的。本例增加很多复杂操作,为了能放大这些差距。

string getstringprof(string a)
{
    return a + "_returned";
}
string& getstringrefprof(string& a)
{
    a += "_ref_returned";
    return a;
}
string&& getstringrrefprof(string&& a)
{
    string&& r{ move(a) };
    r += "_rrefreturned";
    return move(r);
}
void testrightref()
{
    array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([] {
            getstringprof("teststring");
            }, arTestCounts[i]);
        cout << "string costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
                string teststr("leftrefstring");
                getstringrefprof(teststr);
            ;
            }, arTestCounts[i]);
        cout << "left reference costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([] {
            getstringrrefprof("rightrefstring");
            }, arTestCounts[i]);
        cout << "right reference costs " << v << " count " << arTestCounts[i] << endl;
    }
}

性能差异:

2.6 基于范围的for

Range-based for 的性能提升

对于测试用例来说,vector,range-based for的性能比基于迭代器的for,要高的太多;同时也比基于索引的for高不少。

#include <random>

void testforperf()
{
    std::random_device rd;
    std::vector<int> vec;
    std::uniform_int_distribution<int> dist(0, 1000);
    for (int i{}; i < 1000; ++i)
    {
        vec.push_back(dist(rd));
    }

    array<int, 3> arTestCounts{10000, 100000, 1000000};// , 10000000, 100000000

    for (int i{ 0 }; i < arTestCounts.size(); ++i)
    {
        auto v = benchmark([&vec] {
            for (auto it = vec.begin(); it != vec.end(); ++it)
                ;
            }, arTestCounts[i]);
        cout << "pod for costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([&vec] {
            for (int n{}; n < vec.size(); ++n)
                ;
            }, arTestCounts[i]);
        cout << "ndx for costs " << v << " count " << arTestCounts[i] << endl;

        v = benchmark([&vec] {
            for (auto & c : vec)
                ;
            }, arTestCounts[i]);
        cout << "ranged for costs " << v << " count " << arTestCounts[i] << endl;
        ;
    }
}

性能差异:

三者对比

2.7 同步的代价

互斥体添加的位置,对性能影响很大

同步需要对共享的数据块加以保护,比如锁;但是如果锁添加不慎,那么对后续的性能影响还是比较大的。比如下面的测试用例。

由此,我们需要尽量减少多线程之间对共享数据的访问,如果必须要访问,那么需要考虑的是,需要一次传递适中的数据。

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>

//std::vector<int> data; // 共享容器
//std::mutex mtx; // 互斥锁

void addToVector(std::vector<int> & data, std::mutex & mtx, int value) {
    std::lock_guard<std::mutex> lock(mtx); // 上锁
    data.push_back(value);
} // 离开作用域时自动解锁
void addToVector2(std::vector<int>& data, std::mutex& mtx, const list<int>& lst)
{
    std::lock_guard<std::mutex> lock(mtx); // 上锁
    for (const auto& v : lst)
        data.emplace_back(v);
}

int testMutexVector() {
    //array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
    //for (int i{ 0 }; i < arTestCounts.size(); ++i)
    //{

        std::vector<int> data;
        std::mutex mtx;
        static const int count = 1000000;
        array<int, 4> costs{};

        std::thread thread1([&data,&mtx, &costs]() {
            auto t = benchmark([&data,&mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 1; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                },1);

            //cout << "500 per tick, cost " << t << endl;
            costs[0] = t;
            });
       
        std::thread thread2([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 1000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[1] = t;
            });

        std::thread thread3([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 10000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[2] = t;
            });

        std::thread thread4([&data, &mtx, &costs]() {
            auto t = benchmark([&data, &mtx]() {
                list<int> lst;
                for (int i{}; i < count; )
                {
                    for (int j{}; j < 100000; ++j, ++i)
                        lst.push_back(i);
                    addToVector2(data, mtx, lst);
                    lst.clear();
                }
                }, 1);

            //cout << "1000 per tick, cost " << t << endl; 
            costs[3] = t;
            });

        thread1.join();
        thread2.join();
        thread3.join();
        thread4.join();

        cout << "1 per tick, cost " << costs[0] << endl;
        cout << "1000 per tick, cost " << costs[1] << endl;
        cout << "10000 per tick, cost " << costs[2] << endl;
        cout << "100000 per tick, cost " << costs[3] << endl;
        //}

    // 输出容器内容
    //std::lock_guard<std::mutex> lock(mtx);
    //std::cout << "Container contents:";
    //for (int num : data) {
    //    std::cout << " " << num;
    //}
    //std::cout << std::endl;

    return 0;
}

性能差异如下:

3.小结

C++是一种持续发展的编程语言,每个新版本都会引入新的特性和改进。这里简单的对比了C++的最新版本(指代C++17、C++20或更高版本)与C++98之间的新特性,并提供相关的性能测试数据,以评估新版本对代码性能的影响。

自动类型推导 (Automatic Type Deduction):

C++的最新版本引入了自动类型推导,通过使用关键字auto和decltype,可以在编译器的帮助下自动推断变量的类型。这样可以简化代码,并使其更具可读性。我们可以进行性能测试,以比较使用自动类型推导和显式类型声明的代码的执行效率。

范围基于的for循环 (Range-based For Loop):

C++的最新版本引入了范围基于的for循环,通过简化迭代容器中的元素访问,提供了更加简洁和直观的语法。我们可以进行性能测试,以比较传统的for循环和范围基于的for循环的执行效率。

智能指针 (Smart Pointers):

C++的最新版本引入了智能指针,包括std::unique_ptr、std::shared_ptr和std::weak_ptr,用于管理动态分配的内存。智能指针提供了更安全和便捷的内存管理方式,避免了内存泄漏和悬挂指针等问题。我们可以进行性能测试,以评估使用智能指针和传统指针的代码性能差异。

并发编程库 (Concurrency Library):

C++的最新版本引入了一套强大的并发编程库,如std::thread、std::mutex和std::condition_variable等,用于实现多线程和并发操作。这些库提供了丰富的工具和机制,使得并发编程更加简单和安全。我们可以进行性能测试,以比较使用并发库和传统线程操作的代码性能。

最后,本人还会持续发掘C++的最新版本相对于C++98引入的新特性和改进,它们提供了更强大、更高效的编程工具。期望通过学习和测评对比,可以在将来根据实际需求选择合适的特性来提升程序性能。

Tags:

本文暂时没有评论,来添加一个吧(●'◡'●)

欢迎 发表评论:

最近发表
标签列表