[文章作者:张宴 本文版本:v1.0 最后修改:2008.12.09 转载请注明原文链接:http://blog.zyan.cc/post/385/]
曾经在七月,写过一篇文章──《基于Sphinx+MySQL的千万级数据全文检索(搜索引擎)架构设计》,前公司的分类信息搜索基于此架构,效果明显,甚至将很大一部分带Where条件的MySQL SQL查询,都改用了Sphinx+MySQL搜索。但是,这套架构仍存在局限:一是MySQL本身的并发能力有限,在200~300个并发连接下,查询和更新就比较慢了;二是由于MySQL表的主键与Sphinx索引的ID一一对应,从而无法跨多表建立整站查询,而且新增加类别还得修改配置文件,比较麻烦;三是因为和MySQL集成,无法发挥出Sphinx的优势。
最近,我设计出了下列这套最新的搜索引擎架构,目前已经写出“搜索查询接口”和“索引更新接口”的beta版。经测试,在一台“奔腾四 3.6GHz 双核CPU、2GB内存”的普通PC机,7000万条索引记录的条件下,“搜索查询接口”平均查询速度为0.0XX秒(查询速度已经达到百度、谷歌、搜狗、中国雅虎等搜索引擎的水平,详见文章末尾的“附2”),并且能够支撑高达5000的并发连接;而“索引更新接口”进行数据分析、入队列、返回信息给用户的全过程,高达1500 Requests/Sec。
“队列控制器”这一部分是核心,它要控制队列读取,更新MySQL主表与增量表,更新搜索引擎数据存储层Tokyo Tyrant,准实时(1分钟内)完成更新Sphinx增量索引,定期合并Sphinx索引。我预计在这周写出beta版。
图示说明:
1、搜索查询接口:
①、Web应用服务器通过HTTP POST/GET方式,将搜索关键字等条件,传递给搜索引擎服务器的search.php接口;
②③、search.php通过Sphinx的API(我根据最新的Sphinx 0.9.9-rc1 API,改写了一个C语言的PHP扩展sphinx.so),查询Sphinx索引服务,取得满足查询条件的搜索引擎唯一ID(15位搜索唯一ID:前5位类别ID+后10位原数据表主键ID)列表;
④⑤、search.php将这些ID号作为key,通过Memcache协议一次性从Tokyo Tyrant中mget取回ID号对应的文本数据。
⑥⑦、search.php将搜索结果集,按查询条件,进行摘要和关键字高亮显示处理,以JSON格式或XML格式返回给Web应用服务器。
2、索引更新接口:
⑴、Web应用服务器通过HTTP POST/GET方式,将要增加、删除、更新的内容告知搜索服务器的update.php接口;
⑵、update.php将接收到的信息处理后,写入TT高速队列(我基于Tokyo Tyrant做的一个队列系统);
注:这两步的速度可达到1500次请求/秒以上,可应对6000万PV的搜索索引更新调用。
3、搜索索引与数据存储控制:
㈠、“队列控制器”守护进程从TT高速队列中循环读取信息(每次50条,直到末尾);
㈡、“队列控制器”将读取出的信息写入搜索引擎数据存储层Tokyo Tyrant;
㈢、“队列控制器”将读取出的信息异步写入MySQL主表(这张主表按500万条记录进行分区,仅作为数据永久性备份用);
㈣、“队列控制器”将读取出的信息写入MySQL增量表;
㈤、“队列控制器”在1分钟内,触发Sphinx更新增量索引,Sphinx的indexer会将MySQL增量表作为数据源,建立增量索引。Sphinx的增量索引和作为数据源的MySQL增量表成对应关系;
㈥、“队列控制器”每间隔3小时,短暂停止从TT高速队列中读取信息,并触发Sphinx将增量索引合并入主索引(这个过程非常快),同时清空MySQL增量表(保证了MySQL增量表的记录数始终只有几千条至几十万条,大大加快Sphinx增量索引更新速度),然后恢复从TT高速队列中取出数据,写入MySQL增量表。
本架构使用的开源软件:
1、Sphinx 0.9.9-rc1
2、Tokyo Tyrant 1.1.9
3、MySQL 5.1.30
4、Nginx 0.7.22
5、PHP 5.2.6
本架构自主研发的程序:
1、搜索查询接口(search.php)
2、索引更新接口(update.php)
3、队列控制器
4、Sphinx 0.9.9-rc1 API的PHP扩展(sphinx.so)
5、基于Tokyo Tyrant的高速队列系统
附1:MySQL FullText、Lucene搜索、Sphinx搜索的第三方对比结果:
1、查询速度:
MySQL FullText最慢,Lucene、Sphinx查询速度不相上下,Sphinx稍占优势。
2、建索引速度:
Sphinx建索引速度是最快的,比Lucene快9倍以上。因此,Sphinx非常适合做准实时搜索引擎。
3、详细对比数据见以下PDF文档:
附2:国内各大中文搜索引擎搜索速度分析:
以“APMServ张宴”为关键字,比较在各大中文搜索引擎的搜索速度:
1、百度:
①、第一次搜索
②、第二次搜索
分析:百度对第一次搜索的搜索结果做了Cache,所以第二次查询非常快。
2、谷歌:
①、第一次搜索
②、第二次搜索
分析:谷歌也对第一次搜索的搜索结果做了Cache,但两次查询跟百度同比,都要慢一些。
3、搜狗:
①、第一次搜索
②、第二次搜索
③、第三次搜索
分析:搜狗疑似对第一次搜索的搜索结果做了短暂的Cache,第二次搜索速度非常快,第三次搜索的速度比第二次搜索的速度慢。搜狗第一次搜索的速度跟百度差不多。
4、中国雅虎:
①、第一次搜索
②、第二次搜索
分析:搜索结果没有做Cache。中国雅虎的搜索速度跟百度第一次搜索的速度差不多。
5、网易有道:
①、第一次搜索
②、第二次搜索
分析:有道对第一次搜索的搜索结果做了Cache。但是,跟谷歌一样,两次搜索同比都要较百度、搜狗、中国雅虎慢一些。
曾经在七月,写过一篇文章──《基于Sphinx+MySQL的千万级数据全文检索(搜索引擎)架构设计》,前公司的分类信息搜索基于此架构,效果明显,甚至将很大一部分带Where条件的MySQL SQL查询,都改用了Sphinx+MySQL搜索。但是,这套架构仍存在局限:一是MySQL本身的并发能力有限,在200~300个并发连接下,查询和更新就比较慢了;二是由于MySQL表的主键与Sphinx索引的ID一一对应,从而无法跨多表建立整站查询,而且新增加类别还得修改配置文件,比较麻烦;三是因为和MySQL集成,无法发挥出Sphinx的优势。
最近,我设计出了下列这套最新的搜索引擎架构,目前已经写出“搜索查询接口”和“索引更新接口”的beta版。经测试,在一台“奔腾四 3.6GHz 双核CPU、2GB内存”的普通PC机,7000万条索引记录的条件下,“搜索查询接口”平均查询速度为0.0XX秒(查询速度已经达到百度、谷歌、搜狗、中国雅虎等搜索引擎的水平,详见文章末尾的“附2”),并且能够支撑高达5000的并发连接;而“索引更新接口”进行数据分析、入队列、返回信息给用户的全过程,高达1500 Requests/Sec。
“队列控制器”这一部分是核心,它要控制队列读取,更新MySQL主表与增量表,更新搜索引擎数据存储层Tokyo Tyrant,准实时(1分钟内)完成更新Sphinx增量索引,定期合并Sphinx索引。我预计在这周写出beta版。
图示说明:
1、搜索查询接口:
①、Web应用服务器通过HTTP POST/GET方式,将搜索关键字等条件,传递给搜索引擎服务器的search.php接口;
②③、search.php通过Sphinx的API(我根据最新的Sphinx 0.9.9-rc1 API,改写了一个C语言的PHP扩展sphinx.so),查询Sphinx索引服务,取得满足查询条件的搜索引擎唯一ID(15位搜索唯一ID:前5位类别ID+后10位原数据表主键ID)列表;
④⑤、search.php将这些ID号作为key,通过Memcache协议一次性从Tokyo Tyrant中mget取回ID号对应的文本数据。
⑥⑦、search.php将搜索结果集,按查询条件,进行摘要和关键字高亮显示处理,以JSON格式或XML格式返回给Web应用服务器。
2、索引更新接口:
⑴、Web应用服务器通过HTTP POST/GET方式,将要增加、删除、更新的内容告知搜索服务器的update.php接口;
⑵、update.php将接收到的信息处理后,写入TT高速队列(我基于Tokyo Tyrant做的一个队列系统);
注:这两步的速度可达到1500次请求/秒以上,可应对6000万PV的搜索索引更新调用。
3、搜索索引与数据存储控制:
㈠、“队列控制器”守护进程从TT高速队列中循环读取信息(每次50条,直到末尾);
㈡、“队列控制器”将读取出的信息写入搜索引擎数据存储层Tokyo Tyrant;
㈢、“队列控制器”将读取出的信息异步写入MySQL主表(这张主表按500万条记录进行分区,仅作为数据永久性备份用);
㈣、“队列控制器”将读取出的信息写入MySQL增量表;
㈤、“队列控制器”在1分钟内,触发Sphinx更新增量索引,Sphinx的indexer会将MySQL增量表作为数据源,建立增量索引。Sphinx的增量索引和作为数据源的MySQL增量表成对应关系;
㈥、“队列控制器”每间隔3小时,短暂停止从TT高速队列中读取信息,并触发Sphinx将增量索引合并入主索引(这个过程非常快),同时清空MySQL增量表(保证了MySQL增量表的记录数始终只有几千条至几十万条,大大加快Sphinx增量索引更新速度),然后恢复从TT高速队列中取出数据,写入MySQL增量表。
本架构使用的开源软件:
1、Sphinx 0.9.9-rc1
2、Tokyo Tyrant 1.1.9
3、MySQL 5.1.30
4、Nginx 0.7.22
5、PHP 5.2.6
本架构自主研发的程序:
1、搜索查询接口(search.php)
2、索引更新接口(update.php)
3、队列控制器
4、Sphinx 0.9.9-rc1 API的PHP扩展(sphinx.so)
5、基于Tokyo Tyrant的高速队列系统
附1:MySQL FullText、Lucene搜索、Sphinx搜索的第三方对比结果:
1、查询速度:
MySQL FullText最慢,Lucene、Sphinx查询速度不相上下,Sphinx稍占优势。
2、建索引速度:
Sphinx建索引速度是最快的,比Lucene快9倍以上。因此,Sphinx非常适合做准实时搜索引擎。
3、详细对比数据见以下PDF文档:
下载文件
附2:国内各大中文搜索引擎搜索速度分析:
以“APMServ张宴”为关键字,比较在各大中文搜索引擎的搜索速度:
1、百度:
①、第一次搜索
②、第二次搜索
分析:百度对第一次搜索的搜索结果做了Cache,所以第二次查询非常快。
2、谷歌:
①、第一次搜索
②、第二次搜索
分析:谷歌也对第一次搜索的搜索结果做了Cache,但两次查询跟百度同比,都要慢一些。
3、搜狗:
①、第一次搜索
②、第二次搜索
③、第三次搜索
分析:搜狗疑似对第一次搜索的搜索结果做了短暂的Cache,第二次搜索速度非常快,第三次搜索的速度比第二次搜索的速度慢。搜狗第一次搜索的速度跟百度差不多。
4、中国雅虎:
①、第一次搜索
②、第二次搜索
分析:搜索结果没有做Cache。中国雅虎的搜索速度跟百度第一次搜索的速度差不多。
5、网易有道:
①、第一次搜索
②、第二次搜索
分析:有道对第一次搜索的搜索结果做了Cache。但是,跟谷歌一样,两次搜索同比都要较百度、搜狗、中国雅虎慢一些。
qwgqwg
2020-12-18 05:07
Thanks to the author for writing the post, it was quite necessary for me and liked it. I wrote a note on the https://ukbestessays.org/ about this. I will be happy if you read it and accept it. Thank you for your concern.
地方
2021-3-17 11:29
“队列控制器”将读取出的信息写入MySQL增量表;这个增量表怎么做的啊? insert的好弄, update的不好弄啊
nelson lima
2021-5-12 19:33
Nursing assignment help Ireland of the highest quality from the most competent specialists. We have many experts in nursing subjects and they write unique or plagiarism free content. Experts have many years of experience in writing work. Get top-quality nursing assignments from the best assignment writers. Order Now!Best Assignment Writer
nelson lima
2021-5-15 13:48
Feeling trapped while writing an essay? There’s a way out of getting your essay written. Get professional essay writing help from qqiassignments.com. It has proved to be a wise option for for students who need essay writing help for their due essay assignments. Students, who avail qqiassignments.com essay help online services, always appreciate the quality they receive. All credit goes to our highly talented and qualified essay helpers who have years of experience in creating professionally written essay pieces which are tailor-made to student's personal needs. Each essay is unique and free from any grammatical error.Essay Writing Help
nelson lima
2021-5-18 15:38
if you are seeking essay help Ireland then you are on right place because qqiassignments.com has team of expert essay writers they provide quality service across Ireland at nominal rates for Irish university or college students.Essay Writing Help
nelson lima
2021-5-20 15:39
Compare assignment to draw every person’s interest, Establish and clarify the exact arrangement or possessions of all homework. Insert articles linked to people. Grab the difficulties that come up in just how chosen subject. Illustrate how issues may be at the homework and offer a remedy to overcome all those issues. Find connections between those writers. Asses sing your own idea. QQI Assignment Help composing writing can possibly be an effective means to generate a fantastic mission.QQI Assignment Help
nelson lima
2021-5-25 11:31
When the students look for best quality essay writing service, they think, “Can I buy best essay paper online at an affordable price?” Well, if you are one of those students qqiassignments.com would like to announce that no other service provider offers inexpensive assistance as we do. our writers are aware of the immense expenses involved in universities. Hence, we keep our service charge low so that all the students can avail our best quality essay help.Essay Writing Service
davidwilly
2021-7-10 12:57
Thanks for the wonderful tips, we often face problems while working at our workplace, I shall suggest my friends also read this article Ace4sure Advanced-Administrator Exam Study Guide. I am sure they will also be able to draw some positive aspects from this post. https://www.ace4sure.com/Advanced-Administrator-questions.html
Joe Kevin
2021-10-27 14:33
In my relaxing time, I often look to small games for entertainment, they do not consume my time. But not every game I play. But I wanted to find games that were really attractive because that made me forget all around to really "focus on entertainment", which I really needed at that time. That helps me relax completely before continuing to work. That, to me is good! And I often choose Impossible Game to serve my relaxing time!
tombrownn1975
2022-1-27 00:13
That is so cool! If some o you dont know how to write the good essya adn homewrok you can use this cool service and be surea at this guys rally know how to do it! Check this professional personal statement writing services and be sure that its so cool! If you never used befoure go! And win!
luna
2022-2-22 17:53
Ij scan utility comes packed with interactive features to ease your scanning for an unmatched experience.It also offers you to incorporate with such third-party applications and software by exporting your scanned items to the required service.ij scan utility | ij scan utility download | ij canon scan utility | canon.com/ijsetup
Tomcook280
2022-2-23 14:59
The setup process for every Canon model is almost similar, however the download through https //ij.start.cannon or http //ij.start.cannon and installation process may differ .ij.start.cannon All-in-one Canon Inkjet printers are suitable for home, business, school, and others to improve productivity. Depending on your requirement, it offers a type printer including PIXMA, SELPHY, MAXIFY, etc. Some factors need to be in mind while choosing an inkjet printer for you. Later, you can easily set up your Canon printer through drivers from canon.com/ijsetup, wireless connection, USB, and a few components. https //ij.start.cannon
ij scan utility
2022-2-24 17:54
Meanwhile, choosing to operate in Document Mode, you ensure the improved readability of your documents as a result of scanning the document. Similarly, in Photo Mode, you can get the best quality of your photos scanned.ij scan utility |ij canon scan utility
shz
2022-7-20 20:07
I was taking a gander at some of your posts on this site and I consider this site is truly informational! Keep setting up.. Scam Risk
shzz
2022-7-20 20:10
I really loved reading your blog. It was very well authored and easy to understand. Unlike other blogs I have read which are really not that good.Thanks alot! Scam Risk
shz
2022-7-20 20:12
A great website with interesting and unique material what else would you need. Scam Risk
shzz
2022-7-20 20:13
Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It 's really very nice and Useful post.Thanks Scam Risk
shz
2022-7-20 20:15
Thank you so much for ding the impressive job here, everyone will surely like your post. Scam Risk
shzz
2022-7-20 20:16
My friend mentioned to me your blog, so I thought I’d read it for myself. Very interesting insights, will be back for more! Scam Risk
shz
2022-7-20 20:18
Its a great pleasure reading your post.Its full of information I am looking for and I love to post a comment that "The content of your post is awesome" Great work. Scam Risk
分页: 6/7 1 2 3 4 5 6 7