2.性能相关
2.1 指针与智能指针
指针与智能指针的性能差异
在低频次调用的情况下,指针与智能指针性能没有显著差异;但在高频次调用的情况下,智能指针的性能比较差,而且测试用例是在没有多线程的环境下的对比结果(因为智能指针的引用计数,一般是原子操作的,但在某些指令集下面是使用mutex完成的),可以想象多线程环境中它的性能会更差些。
测试代码如下所示:
void TestSmPtr()
{
array<int,5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
int* pInt = new int{0};
auto spInt{make_shared<int>(10)};
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
*pInt = n;
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "orignal pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
}
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
*spInt = n;
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "smart pointer cost " << ms.count() << " count " << arTestCounts[i] << endl;
}
}
具体性能差异如下:
2.2 function,lambda,callbinder
普通函数,lambda,function,_callbinder的性能对比
Function对象的性能最差(内部调用invoke完成函数执行),lambda与普通函数接近,很显然编译器对lambda做了优化(lambda被编译成了函数对象)。
bind数据的情况下,callbinder性能最差(callbinder比function差将近一倍,因为bind返回的callbinder函数,然后再调用invoke)。其他函数类型的性能没有显著区别。
#include <functional>
#include<array>
#include <chrono>
using namespace std;
int testaddfunc(int a, int b)
{
return a + b;
}
void testfuncperf()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
auto flamb = [](int a, int b)
{
return a + b;
};
function<int(int, int)> fo = flamb;
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
testaddfunc(10,12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
flamb(10,12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
fo(10, 12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
fo = testaddfunc;
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
fo(10, 12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
}
void testfuncperf2()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
int adata = 10;
auto flamb = [adata](int b)
{
return adata + b;
};
auto fo = bind(flamb,12);
auto f1 = bind(testaddfunc, 10, placeholders::_1);
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
testaddfunc(10, 12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "normal function costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
flamb(12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "lambda costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
fo();
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "functionobject(binded with lambda) costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
//fo = testaddfunc;
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto tp = chrono::high_resolution_clock::now();
for (int n = 0; n < arTestCounts[i]; ++n)
{
f1(12);
}
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
cout << "functionobject(binded with noramal function) costs " << ms.count() << " count " << arTestCounts[i] << endl;
}
}
性能差异如图所示:
2.3 内存池的性能提升
使用自c++17以来支持的memory_resources
在2000000次计算的测试下,分别比较了list,在默认的allocator,tcmalloc定制的alloctor,pmr默认的allocator,pmr的monotonic_buffer_resource,pmr的monotonic_buffer_resource定制的allocator的性能数据。
Pmr使用定制memory_resources的情况下,性能普遍好些。
#include <memory>
#include <array>
#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <list>
#include <memory_resource>
template<typename Func>
auto benchmark(Func test_func, int iterations)
{
auto tp = chrono::high_resolution_clock::now();
while (iterations-- > 0)
test_func();
auto tp1 = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(tp1 - tp);
return ms.count();
}
void testmemorypool()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
constexpr int iterations{ 100 };
constexpr int total_nodes{ 2'00'000 };
constexpr int todeletepoint{ 333 };
constexpr int deletecount{ 100 };
auto default_std_alloc = [total_nodes]()
{
std::list<int> list;
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
auto default_pmr_alloc = [total_nodes]()
{
std::pmr::list<int> list;
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
auto pmr_alloc_no_buf = [total_nodes]()
{
std::pmr::monotonic_buffer_resource mbr;
std::pmr::polymorphic_allocator<int> pa{&mbr};
std::pmr::list<int> list{pa};
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
auto pmr_alloc_and_buf = [total_nodes]()
{
std::vector<std::byte> buffer(total_nodes); // enough to fit in all nodes
std::pmr::monotonic_buffer_resource mbr{buffer.data(), buffer.size()};
std::pmr::polymorphic_allocator<int> pa{&mbr};
std::pmr::list<int> list{pa};
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
{
vector<int> test;
//for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
const double t1 = benchmark(default_std_alloc, iterations);
const double t2 = benchmark(default_pmr_alloc, iterations);
const double t3 = benchmark(pmr_alloc_no_buf, iterations);
const double t4 = benchmark(pmr_alloc_and_buf, iterations);
std::cout << std::fixed << std::setprecision(3)
<< "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
<< "t2 (default pmr alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
<< "t3 (pmr alloc no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
<< "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';
cout << " count " << iterations << endl;
}
}
}
// tcmalloc
struct _AllocatorT
{
static void* Allocate(size_t size)
{
return malloc(size);
}
static void Free(void* p, size_t size)
{
free(p);
}
};
void testtcmemorypool()
{
//array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
constexpr int iterations{ 100 };
constexpr int total_nodes{ 2'00'000 };
constexpr int todeletepoint{ 333 };
constexpr int deletecount{ 100 };
auto default_std_alloc = [total_nodes]()
{
std::list<int> list;
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
auto default_pmr_alloc = [total_nodes]()
{
std::list<int, STL_Allocator<int, _AllocatorT>> list;
for (int i{}; i != total_nodes; ++i)
{
list.push_back(i);
if (i % todeletepoint == 0)
{
for (int j{}; list.size() && j < deletecount; ++j)
list.pop_back();
}
}
};
const double t1 = benchmark(default_std_alloc, iterations);
const double t2 = benchmark(default_pmr_alloc, iterations);
//const double t3 = benchmark(pmr_alloc_no_buf, iterations);
//const double t4 = benchmark(pmr_alloc_and_buf, iterations);
std::cout << std::fixed << std::setprecision(3)
<< "t1 (default std alloc): " << t1 << " sec; t1/t1: " << t1 / t1 << '\n'
<< "t2 (with alloc): " << t2 << " sec; t1/t2: " << t1 / t2 << '\n'
<< std::endl;
//<< "t3 (pmr alloc no buf): " << t3 << " sec; t1/t3: " << t1 / t3 << '\n'
//<< "t4 (pmr alloc and buf): " << t4 << " sec; t1/t4: " << t1 / t4 << '\n';
}
具体性能差异如图:
2.4 内存对齐的性能提升
aligned_malloc ,malloc的性能对比
对比这两个函数,发现aligned_malloc的性能有提升,但优势不是那么的明显。
#include <stdlib.h>
void testAlignAlloc()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
char* p = (char*)malloc(10 * sizeof(char));
auto v = benchmark([p] {*p = ('c' + *p) % 127; }, arTestCounts[i]);
free(p);
cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;
char* p1 = (char*)_aligned_malloc(10 * sizeof(char), 8);
v = benchmark([p1] {*p1 = ('c' + *p1) % 127; }, arTestCounts[i]);
_aligned_free(p1);
cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
}
}
struct exampleObj {
char a;
int b;
char c;
};
#include <stdalign.h>
__declspec(align(8)) struct exampleObj2 {
char a;
int b;
char c;
};
void testStackAlign()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto v = benchmark([] {
exampleObj obj{};
obj.a = ('c' + obj.a) % 127;
}, arTestCounts[i]);
cout << "malloc costs " << v << " count " << arTestCounts[i] << endl;
v = benchmark([] {
exampleObj2 obj{};
obj.a = ('c' + obj.a) % 127;
}, arTestCounts[i]);
cout << "aligned_malloc costs " << v << " count " << arTestCounts[i] << endl;
}
}
性能差异图:
stack alignas
2.5 右值带来的性能提升
右值,左值,拷贝的性能对比
右值和左值的效率在本测试用例上,大体接近,右值版本略好些;但是,右值比拷贝的性能优势是很明显的。本例增加很多复杂操作,为了能放大这些差距。
string getstringprof(string a)
{
return a + "_returned";
}
string& getstringrefprof(string& a)
{
a += "_ref_returned";
return a;
}
string&& getstringrrefprof(string&& a)
{
string&& r{ move(a) };
r += "_rrefreturned";
return move(r);
}
void testrightref()
{
array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto v = benchmark([] {
getstringprof("teststring");
}, arTestCounts[i]);
cout << "string costs " << v << " count " << arTestCounts[i] << endl;
v = benchmark([] {
string teststr("leftrefstring");
getstringrefprof(teststr);
;
}, arTestCounts[i]);
cout << "left reference costs " << v << " count " << arTestCounts[i] << endl;
v = benchmark([] {
getstringrrefprof("rightrefstring");
}, arTestCounts[i]);
cout << "right reference costs " << v << " count " << arTestCounts[i] << endl;
}
}
性能差异:
2.6 基于范围的for
Range-based for 的性能提升
对于测试用例来说,vector,range-based for的性能比基于迭代器的for,要高的太多;同时也比基于索引的for高不少。
#include <random>
void testforperf()
{
std::random_device rd;
std::vector<int> vec;
std::uniform_int_distribution<int> dist(0, 1000);
for (int i{}; i < 1000; ++i)
{
vec.push_back(dist(rd));
}
array<int, 3> arTestCounts{10000, 100000, 1000000};// , 10000000, 100000000
for (int i{ 0 }; i < arTestCounts.size(); ++i)
{
auto v = benchmark([&vec] {
for (auto it = vec.begin(); it != vec.end(); ++it)
;
}, arTestCounts[i]);
cout << "pod for costs " << v << " count " << arTestCounts[i] << endl;
v = benchmark([&vec] {
for (int n{}; n < vec.size(); ++n)
;
}, arTestCounts[i]);
cout << "ndx for costs " << v << " count " << arTestCounts[i] << endl;
v = benchmark([&vec] {
for (auto & c : vec)
;
}, arTestCounts[i]);
cout << "ranged for costs " << v << " count " << arTestCounts[i] << endl;
;
}
}
性能差异:
三者对比
2.7 同步的代价
互斥体添加的位置,对性能影响很大
同步需要对共享的数据块加以保护,比如锁;但是如果锁添加不慎,那么对后续的性能影响还是比较大的。比如下面的测试用例。
由此,我们需要尽量减少多线程之间对共享数据的访问,如果必须要访问,那么需要考虑的是,需要一次传递适中的数据。
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
//std::vector<int> data; // 共享容器
//std::mutex mtx; // 互斥锁
void addToVector(std::vector<int> & data, std::mutex & mtx, int value) {
std::lock_guard<std::mutex> lock(mtx); // 上锁
data.push_back(value);
} // 离开作用域时自动解锁
void addToVector2(std::vector<int>& data, std::mutex& mtx, const list<int>& lst)
{
std::lock_guard<std::mutex> lock(mtx); // 上锁
for (const auto& v : lst)
data.emplace_back(v);
}
int testMutexVector() {
//array<int, 5> arTestCounts{10000, 100000, 1000000, 10000000, 100000000};
//for (int i{ 0 }; i < arTestCounts.size(); ++i)
//{
std::vector<int> data;
std::mutex mtx;
static const int count = 1000000;
array<int, 4> costs{};
std::thread thread1([&data,&mtx, &costs]() {
auto t = benchmark([&data,&mtx]() {
list<int> lst;
for (int i{}; i < count; )
{
for (int j{}; j < 1; ++j, ++i)
lst.push_back(i);
addToVector2(data, mtx, lst);
lst.clear();
}
},1);
//cout << "500 per tick, cost " << t << endl;
costs[0] = t;
});
std::thread thread2([&data, &mtx, &costs]() {
auto t = benchmark([&data, &mtx]() {
list<int> lst;
for (int i{}; i < count; )
{
for (int j{}; j < 1000; ++j, ++i)
lst.push_back(i);
addToVector2(data, mtx, lst);
lst.clear();
}
}, 1);
//cout << "1000 per tick, cost " << t << endl;
costs[1] = t;
});
std::thread thread3([&data, &mtx, &costs]() {
auto t = benchmark([&data, &mtx]() {
list<int> lst;
for (int i{}; i < count; )
{
for (int j{}; j < 10000; ++j, ++i)
lst.push_back(i);
addToVector2(data, mtx, lst);
lst.clear();
}
}, 1);
//cout << "1000 per tick, cost " << t << endl;
costs[2] = t;
});
std::thread thread4([&data, &mtx, &costs]() {
auto t = benchmark([&data, &mtx]() {
list<int> lst;
for (int i{}; i < count; )
{
for (int j{}; j < 100000; ++j, ++i)
lst.push_back(i);
addToVector2(data, mtx, lst);
lst.clear();
}
}, 1);
//cout << "1000 per tick, cost " << t << endl;
costs[3] = t;
});
thread1.join();
thread2.join();
thread3.join();
thread4.join();
cout << "1 per tick, cost " << costs[0] << endl;
cout << "1000 per tick, cost " << costs[1] << endl;
cout << "10000 per tick, cost " << costs[2] << endl;
cout << "100000 per tick, cost " << costs[3] << endl;
//}
// 输出容器内容
//std::lock_guard<std::mutex> lock(mtx);
//std::cout << "Container contents:";
//for (int num : data) {
// std::cout << " " << num;
//}
//std::cout << std::endl;
return 0;
}
性能差异如下:
3.小结
C++是一种持续发展的编程语言,每个新版本都会引入新的特性和改进。这里简单的对比了C++的最新版本(指代C++17、C++20或更高版本)与C++98之间的新特性,并提供相关的性能测试数据,以评估新版本对代码性能的影响。
自动类型推导 (Automatic Type Deduction):
C++的最新版本引入了自动类型推导,通过使用关键字auto和decltype,可以在编译器的帮助下自动推断变量的类型。这样可以简化代码,并使其更具可读性。我们可以进行性能测试,以比较使用自动类型推导和显式类型声明的代码的执行效率。
范围基于的for循环 (Range-based For Loop):
C++的最新版本引入了范围基于的for循环,通过简化迭代容器中的元素访问,提供了更加简洁和直观的语法。我们可以进行性能测试,以比较传统的for循环和范围基于的for循环的执行效率。
智能指针 (Smart Pointers):
C++的最新版本引入了智能指针,包括std::unique_ptr、std::shared_ptr和std::weak_ptr,用于管理动态分配的内存。智能指针提供了更安全和便捷的内存管理方式,避免了内存泄漏和悬挂指针等问题。我们可以进行性能测试,以评估使用智能指针和传统指针的代码性能差异。
并发编程库 (Concurrency Library):
C++的最新版本引入了一套强大的并发编程库,如std::thread、std::mutex和std::condition_variable等,用于实现多线程和并发操作。这些库提供了丰富的工具和机制,使得并发编程更加简单和安全。我们可以进行性能测试,以比较使用并发库和传统线程操作的代码性能。
最后,本人还会持续发掘C++的最新版本相对于C++98引入的新特性和改进,它们提供了更强大、更高效的编程工具。期望通过学习和测评对比,可以在将来根据实际需求选择合适的特性来提升程序性能。
本文暂时没有评论,来添加一个吧(●'◡'●)