[文章作者:张宴 本文版本:v1.0 最后修改:2008.07.27 转载请注明原文链接:http://blog.zyan.cc/post/360/]

  前言:本文阐述的是一款经过生产环境检验的千万级数据全文检索(搜索引擎)架构。本文只列出前几章的内容节选,不提供全文内容。

  在DELL PowerEdge 6850服务器(四颗64 位Inter Xeon MP 7110N处理器 / 8GB内存)、RedHat AS4 Linux操作系统、MySQL 5.1.26、MyISAM存储引擎、key_buffer=1024M环境下实测,单表1000万条记录的数据量(这张MySQL表拥有int、datetime、varchar、text等类型的10多个字段,只有主键,无其它索引),用主键(PRIMARY KEY)作为WHERE条件进行SQL查询,速度非常之快,只耗费0.01秒。

  出自俄罗斯的开源全文搜索引擎软件Sphinx,单一索引最大可包含1亿条记录,在1千万条记录情况下的查询速度为0.x秒(毫秒级)。Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。

  基于以上几点,我设计出了这套搜索引擎架构。在生产环境运行了一周,效果非常不错。有时间我会专为配合Sphinx搜索引擎,开发一个逻辑简单、速度快、占用内存低、非表锁的MySQL存储引擎插件,用来代替MyISAM引擎,以解决MyISAM存储引擎在频繁更新操作时的锁表延迟问题。另外,分布式搜索技术上已无任何问题。



  一、搜索引擎架构设计:
  1、搜索引擎架构图:
  点击在新窗口中浏览此图片

  2、搜索引擎架构设计思路:
  (1)、调用方式最简化:
  尽量方便前端Web工程师,只需要一条简单的SQL语句“SELECT ... FROM myisam_table JOIN sphinx_table ON (sphinx_table.sphinx_id=myisam_table.id) WHERE query='...';”即可实现高效搜索。

  (2)、创建索引、查询速度快:
  ①、Sphinx Search 是由俄罗斯人Andrew Aksyonoff 开发的高性能全文搜索软件包,在GPL与商业协议双许可协议下发行。
  Sphinx的特征:
  •Sphinx支持高速建立索引(可达10MB/秒,而Lucene建立索引的速度是1.8MB/秒)
  •高性能搜索(在2-4 GB的文本上搜索,平均0.1秒内获得结果)
  •高扩展性(实测最高可对100GB的文本建立索引,单一索引可包含1亿条记录)
  •支持分布式检索
  •支持基于短语和基于统计的复合结果排序机制
  •支持任意数量的文件字段(数值属性或全文检索属性)
  •支持不同的搜索模式(“完全匹配”,“短语匹配”和“任一匹配”)
  •支持作为Mysql的存储引擎

  ②、通过国外《High Performance MySQL》专家组的测试可以看出,根据主键进行查询的类似“SELECT ... FROM ... WHERE id = ...”的SQL语句(其中id为PRIMARY KEY),每秒钟能够处理10000次以上的查询,而普通的SELECT查询每秒只能处理几十次到几百次:
  点击在新窗口中浏览此图片

  ③、Sphinx不负责文本字段的存储。假设将数据库的id、date、title、body字段,用sphinx建立搜索索引。根据关键字、时间、类别、范围等信息查询一下sphinx,sphinx只会将查询结果的ID号等非文本信息告诉我们。要显示title、body等信息,还需要根据此ID号去查询MySQL数据库,或者从Memcachedb等其他的存储中取得。安装SphinxSE作为MySQL的存储引擎,将MySQL与Sphinx结合起来,是一种便捷的方法。
  创建一张Sphinx类型表,将MyISAM表的主键ID和Sphinx表的ID作一个JOIN联合查询。这样,对于MyISAM表来所,只相当于一个WHERE id=...的主键查询,WHERE后的条件都交给Sphinx去处理,可以充分发挥两者的优势,实现高速搜索查询。

  (3)、按服务类型进行分离:
  为了保证数据的一致性,我在配置Sphinx读取索引源的MySQL数据库时,进行了锁表。Sphinx读取索引源的过程会耗费一定时间,由于MyISAM存储引擎的读锁和写锁是互斥的,为了避免写操作被长时间阻塞,导致数据库同步落后跟不上,我将提供“搜索查询服务”的和提供“索引源服务”的MySQL数据库进行了分开。监听3306端口的MySQL提供“搜索查询服务”,监听3406端口的MySQL提供“索引源服务”。

  (4)、“主索引+增量索引”更新方式:
  一般网站的特征:信息发布较为频繁;刚发布完的信息被编辑、修改的可能性大;两天以前的老帖变动性较小。
  基于这个特征,我设计了Sphinx主索引和增量索引。对于前天17:00之前的记录建立主索引,每天凌晨自动重建一次主索引;对于前天17:00之后到当前最新的记录,间隔3分钟自动重建一次增量索引。

  (5)、“Ext3文件系统+tmpfs内存文件系统”相结合:
  为了避免每3分钟重建增量索引导致磁盘IO较重,从而引起系统负载上升,我将主索引文件创建在磁盘,增量索引文件创建在tmpfs内存文件系统“/dev/shm/”内。“/dev/shm/”内的文件全部驻留在内存中,读写速度非常快。但是,重启服务器会导致“/dev/shm/”内的文件丢失,针对这个问题,我会在服务器开机时自动创建“/dev/shm/”内目录结构和Sphinx增量索引。

  (6)、中文分词词库:
  我根据“自整理的中文分词库”+“搜狗拼音输入法细胞词库”+“LibMMSeg高频字库”+... 综合整理成一份中文分词词库,出于某些考虑暂不提供。你可以使用LibMMSeg自带的中文分词词库。



  二、MySQL+Sphinx+SphinxSE安装步骤:
  1、安装python支持(以下针对CentOS系统,其他Linux系统请使用相应的方法安装)
yum install -y python python-devel


  2、编译安装LibMMSeg(LibMMSeg是为Sphinx全文搜索引擎设计的中文分词软件包,其在GPL协议下发行的中文分词法,采用Chih-Hao Tsai的MMSEG算法。LibMMSeg在本文中用来生成中文分词词库。)

  以下压缩包“sphinx-0.9.8-rc2-chinese.zip”中包含mmseg-0.7.3.tar.gz、sphinx-0.9.8-rc2.tar.gz以及中文分词补丁。


unzip sphinx-0.9.8-rc2-chinese.zip
tar zxvf mmseg-0.7.3.tar.gz
cd mmseg-0.7.3/
./configure
make
make install
cd ../


  3、编译安装MySQL 5.1.26-rc、Sphinx、SphinxSE存储引擎
wget http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.1.26-rc.tar.gz/from/http://mirror.x10.com/mirror/mysql/
tar zxvf mysql-5.1.26-rc.tar.gz

tar zxvf sphinx-0.9.8-rc2.tar.gz
cd sphinx-0.9.8-rc2/
patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch
patch -p1 < ../fix-crash-in-excerpts.patch
cp -rf mysqlse ../mysql-5.1.26-rc/storage/sphinx
cd ../

cd mysql-5.1.26-rc/
sh BUILD/autorun.sh
./configure --with-plugins=sphinx --prefix=/usr/local/mysql-search/ --enable-assembler --with-extra-charsets=complex --enable-thread-safe-client --with-big-tables --with-readline --with-ssl --with-embedded-server --enable-local-infile
make && make install
cd ../

cd sphinx-0.9.8-rc2/
CPPFLAGS=-I/usr/include/python2.4
LDFLAGS=-lpython2.4
./configure --prefix=/usr/local/sphinx --with-mysql=/usr/local/mysql-search
make
make install
cd ../

mv /usr/local/sphinx/etc/sphinx.conf /usr/local/sphinx/etc/sphinx.conf.old




  第二章第3节之后的正文内容不予公布,全文的目录如下(共24页):

  点击在新窗口中浏览此图片

  点击在新窗口中浏览此图片

  点击在新窗口中浏览此图片



  2010年2月5日增加:

  文档全文请访问: http://blog.zyan.cc/sphinx_search/







技术大类 » 搜索引擎技术 | 评论(153) | 引用(0) | 阅读(254475)
GAWET
2013-1-29 16:49
买轴承就到:www.jkzhoucheng.cn
GWERW
2013-3-1 15:42
买轴承就到:www.jkzhoucheng.cn
GAWERWE
2013-3-1 16:14
买轴承就到:www.jkzhoucheng.cn
伯乐网 Homepage
2013-3-5 16:54
最近开发得 IT人才网站 伯乐网就用到这个方案 但不是很完美,请指教
http://www.itbole.com/
8yong8
2013-4-21 09:24
有出书嘛?我想买一本
磨延城 Email Homepage
2013-10-21 21:09
磨途歌学习了
ASDF
2013-12-31 15:56
买轴承就到:www.jkzhoucheng.cn
123
2014-3-3 17:48
撒打发斯蒂芬啊啊文字文字好的
lilien
2014-5-6 10:20
http://bbs.csdn.net/topics/390777374大虾 看看这个需求和方案怎么样
tommy71382 Email Homepage
2014-5-6 11:27
New numbers released show racial minorities make up about 21 percent of the city employees "But what is the genesis of this? We don't know yet No Canadian communications were, or are, targeted, collected or usedKPHOTOS: Comedian Bill Engvall said he was more nervous than he thought he would be, while actress Leah Remini admitted feeling terrified about her looming dance debutWeve seen a huge increase in EDM style festivals popping up over the world with the likes of Ultra and Tomorrowland expanding on to different continentsThe derelict building now stands three stories tall and hollow in faded pink, surrounded by palms that have sprouted around the property and through the rubble
The 56-year-old worked for more than a decade in Colombia after being ordained as a priest in 1988The last piece of course is the full scope of the potential breach: while I know that it involves all investigation records up to August 2012, it will take us some time to determine how many individual records this involves I dont know him, personallyI have confidence that the NSA is not engaged in domestic surveillance or snooping around, he saidFlorida police have said the incident wasnt alcohol related, and no charges have been laid against the driver of theSUV, who has been identified as Doreen Landstra, of Palmetto, FlaCopies of the documents, titled Potential Terrorist Threats to the 2014 Sochi Winter Olympic and Paralympic Games, and Imirat Kavkaz Calls for Attacks to CheapKDusSale.com/ Stop 2014 Sochi Winter Olympics, were released under the Access to Information Act:)FriendsWithYouThursday, June 9, 7-9PMJune 10th C August 6thNEW LOCATION: 312 Bowery, New York Charles Barkley Shoes City 10012Smile! Its the first New York solo exhibition by Miami duo FriendsWithYou entitled :)
Farewell Keyshawn! I am very sad to see you go! You can find me on twitter @chelsiehightowr ;)But Barhoum, a track runner at Revere High School, said he is convinced some will blame him for the bombings, no matter whatmNative Shoes: Whats the plan for 2013?WIN WIN: We have some more videos for our songs coming out! They'll showcase our live visual style a little moreIndia became the newest entry to the Martian market two weeks ago with its first launch to Mars The original KD 6 For Sale track was a classic C I had it in my collection, so I didnt have to hesitate when asked and just made the record Brahimi himself has said both sides may bend on humanitarian corridors, prisoner exchanges and local cease-fires
zhang
2015-2-13 13:50
希望能给份文档,感谢博主,造福一方。
Chj-2@126.com
pikachu chu Email
2019-1-24 16:39
Thank you for providing this interesting and very interesting topic information. I will regularly update your next articles. slither io
sshzz Email
2022-8-2 16:51
Always so interesting to visit your site.What a great info, thank you for sharing. this will help me so much in my learning  Digital real estate business
sshzz Email
2022-8-2 16:57
It is extremely nice to see the greatest details presented in an easy and understanding manner.  Digital real estate business
sshzz Email
2022-8-2 17:01
I’m happy I located this blog! From time to time, students want to cognitive the keys of productive literary essays composing. Your first-class knowledge about this good post can become a proper basis for such people. nice one  Digital real estate business
sshzz Email
2022-8-2 17:04
Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.  Digital real estate business
shz Email
2022-8-2 17:05
Great tips and very easy to understand. This will definitely be very useful for me when I get a chance to start my blog.  Keala Kanae Review
sshzz Email
2022-8-2 17:07
This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.  Digital real estate business
shz Email
2022-8-2 17:09
I really appreciate the kind of topics you post here. Thanks for sharing us a great information that is actually helpful. Good day!  Keala Kanae Review
shz Email
2022-8-2 17:11
Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.  Keala Kanae Review
分页: 7/8 第一页 上页 2 3 4 5 6 7 8 下页 最后页
发表评论
表情
emotemotemotemotemot
emotemotemotemotemot
emotemotemotemotemot
emotemotemotemotemot
emotemotemotemotemot
打开HTML
打开UBB
打开表情
隐藏
记住我
昵称   密码   游客无需密码
网址   电邮   [注册]