对正则的使用,基本用于日志分析,比如awk、grep等操作。自C++11起,也将正则表达式纳入新标准的一部分,因为项目需求中需求场景并不是很多,所以也就仅仅知道C++11对其的支持。
记得在去年群里聊天的时候,有人提到了std::regex,有不少人进行了吐槽:
当时,没有对这块做更多的发言,毕竟没有调查也就没有发言权,直至前段时间的一个bug,才知道原来大家对std::regex的吐槽不无道理。
对于大流量业务来说,上线某个模型或者feature,需要通过实验来检验效果。通常的情况是,流量进入实验平台进行标签操作,然后将实验平台返回的实验标签以某种结构拼接起来,继续向流量下游下发,在一开始的时候,因为实验标签较少,所以将实验标签全部返回客户端进行上报,然后实验人员进行数据分析,这种方式一直运行正常。
随着业务压力越来越大,无论是算法还是产品同学,需要进行更多的实验,这就存在一个问题,随着时间的推移,实验越来越多,实验标签长度达到几千个甚至上万个字节,因此去除无用的实验标签迫在眉睫。
expa;expb;layerid_def;
,需要说明的是,因为某些特殊原因,如果没有命中某个实验层的实验,就以layerid_def这种方式来表示,经过分析,layerid_def占了整个标签串一半以上,所以征求了算法以及产品同学的意见,将这部分无用标签去掉。其实,这个算一个非常非常小的需求,几行代码的事。所以第一时间想到的是用正则
const static std::regex rex("[0-9]*_def;");
std::string result;
std::regex_replace(std::back_inserter(result),
res.begin(), res.end(), rex, "");
代码很简单,不做过多解释,结果就是:
输入:
123;345_def;456_def;789
输出:
123;789
突然在某一天,收到了报警,服务重启~~
登录服务器看了下coredump文件,存在,于是,通过gdb查看堆栈信息:
core在了regex处,自上次上线与本次coredump直接没有任何上线操作,所以基本定位到是因为std::regex导致coredump,所以,借助万能的谷歌进行关键字搜索:
乖乖,从前几个就能看到,原来std::regex crash是个问题,所以就看了下第二条,有人给gcc提的一个bug里面给出了个简单的代码示例:
#include
#include
int main() {
std::string s (100000, '*');
std::smatch m;
std::regex r ("^(.*?)$");
std::regex_search (s, m, r);
std::cout << s.substr (0, 10) << std::endl;
std::cout << m.str(1).substr (0, 10) << std::endl;
}
把这端代码在本地编译并运行后:
Program terminated with signal 11, Segmentation fault.
#0 0x000000000040a0ae in std::__detail::_Executor<__gnu_cxx::__normal_iterator, std::allocator > >, std::allocator, std::allocator > > > >, std::__cxx11::regex_traits, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator, std::allocator > >, std::allocator, std::allocator > > > >, std::__cxx11::regex_traits, true>::_Match_mode, long)
() at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:200
200 void _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-23.el6.x86_64
查看堆栈信息:
与线上服务出现的现象一样。
既然有人向gnu提了bug,也就懒得自己看源码分析原因了,直接拉到页面最下面,这么一个回帖(Nadav Har'El 2023-04-09 16:02:58 UTC):
More than 5 years later, more and more projects are discovering this bug the hard way, and moving from std::regex to boost::regex which doesn't have this bug - boost::regex defaults to BOOST_REGEX_NON_RECURSIVE mode, which uses a stack on the heap instead of recursion (but I don't know if the specific examples shown the various duplicates all need this stack in practice, for example it's unfortunate if matching " *" needs to copy the entire input string in a stack). The latest example of this exodus is https://github.com/scylladb/scylladb/pull/13452.
So I think it's about time this issue is solved. Maybe even the Boost implementation can studied for inspiration and implementation ideas?
其实,从上面回帖也能看出,此次coredump的原因基本明了,是因为递归导致的爆栈,即递归次数过多,而导致栈溢出。
好了,通过gdb分析下调用堆栈:
(gdb) bt
#0 std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long) ()
at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:273
#1 0x000000000040a24a in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long)
() at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:257
#2 0x0000000000407d99 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std:---Type <return> to continue, or q <return> to quit---
:allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_main_dispatch(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, std::integral_constant<bool, true>) ()
at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:87
#3 0x0000000000406892 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_main(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode) ()
at /usr/local/include/c++/5.4.0/bits/regex_executor.h:116
#4 0x00000000004068c5 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorreturn> to continue, or q <return> to quit---
mal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_search_from_first() ()
at /usr/local/include/c++/5.4.0/bits/regex_executor.h:101
#5 0x0000000000405a9e in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_search() () at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:42
#6 0x0000000000404d98 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, false>(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traitsreturn> to continue, or q <return> to quit---
har> > const&, std::regex_constants::match_flag_type) ()
at /usr/local/include/c++/5.4.0/bits/regex.tcc:95
#7 0x0000000000404718 in bool std::regex_search<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
at /usr/local/include/c++/5.4.0/bits/regex.h:2148
#8 0x00000000004043ff in bool std::regex_searchchar>, std::allocator<char>, std::allocatorchar const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::match_resultschar, std::char_traits<char>, std::allocator<char> >::const_iterator, std::allocatorreturn> to continue, or q <return> to quit---
cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
at /usr/local/include/c++/5.4.0/bits/regex.h:2254
从调用关系上看:
regex_search
-> regex_search
--> __detail::__regex_algo_impl
----> _M_search
------> _M_search_from_first
--------> _M_main
--------->_M_main_dispatch
---------->_M_dfs
好了,看到dfs基本就知道爆栈的原因了。
至于解决办法,有下面几个:
• 修改栈大小,从默认的1m改成4m,不过这个不推荐
• 通过split对字符串进行切割,然后进行判断
• 使用boost::regex(其默认使用BOOST_REGEX_NON_RECURSIVE方式)
最终选用了第四种也就是boost::regex,长字符串测试,灰度、全量,一切OK~~
END
来源:高性能架构探索
版权归原作者所有,如有侵权,请联系删除。
▍推荐阅读
bin文件转C语言,可以吗?
嵌入式笔记:C语言标准库大全
世界上最流行的软件,抛弃了Git