  httpcws 1.0.0 (最新版本:2009-08-10发布)







  httpcws 中文简介
  1、什么是 httpcws ?
  HTTPCWS 是一款基于HTTP协议的开源中文分词系统,目前仅支持Linux系统。HTTPCWS 使用“ICTCLAS 3.0 2009共享版中文分词算法”的API进行分词处理,得出分词结果。HTTPCWS 将取代本人之前开发的 PHPCWS 中文分词扩展

  ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)是中国科学院计算技术研究所在多年研究工作积累的基础上,基于多层隐马模型研制出的汉语词法分析系统,主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典。ICTCLAS经过五年精心打造,内核升级6次,目前已经升级到了ICTCLAS3.0,分词精度98.45%,各种词典数据压缩后不到3M。ICTCLAS在国内973专家组组织的评测中活动获得了第一名,在第一届国际中文处理研究机构SigHan组织的评测中都获得了多项第一名,是当前世界上最好的汉语词法分析器。

  ICTCLAS 3.0 商业版是收费的,而免费提供的 ICTCLAS 3.0 共享版不开源,词库是根据人民日报一个月的语料得出的,很多词语不存在。所以本人补充的一个19万条词语的自定义词库,对ICTCLAS分词结果进行合并处理,输出最终分词结果。

  由于 ICTCLAS 3.0 2009 共享版只支持GBK编码,因此,如果是UTF-8编码的字符串,可以先用iconv函数转换成GBK编码,再用httpcws进行分词处理,最后转换回UTF-8编码。

  HTTPCWS 软件自身(包括httpcws.cpp源文件、dict/httpcws_dict.txt自定义词库)采用NewBSD开源协议,可以自由修改。HTTPCWS 使用的 ICTCLAS 共享版 API 及 dict/Data/ 目录内的语料库,版权及著作权归中国科学院计算技术研究所、ictclas.org所有,使用需遵循其相关协议。

  2、httpcws 中文分词在线演示

  3、httpcws 中文分词下载安装
cd /usr/local/
wget http://httpcws.googlecode.com/files/httpcws-1.0.0-i386-bin.tar.gz
tar zxvf httpcws-1.0.0-i386-bin.tar.gz
rm -f httpcws-1.0.0-i386-bin.tar.gz
cd httpcws-1.0.0-i386-bin/
ulimit -SHn 65535
/usr/local/httpcws-1.0.0-i386-bin/httpcws -d -x /usr/local/httpcws-1.0.0-i386-bin/dict/

cd /usr/local/
wget http://httpcws.googlecode.com/files/httpcws-1.0.0-x86_64-bin.tar.gz
tar zxvf httpcws-1.0.0-x86_64-bin.tar.gz
rm -f httpcws-1.0.0-x86_64-bin.tar.gz
cd httpcws-1.0.0-x86_64-bin/
ulimit -SHn 65535
/usr/local/httpcws-1.0.0-x86_64-bin/httpcws -d -x /usr/local/httpcws-1.0.0-x86_64-bin/dict/



  4、httpcws 使用方法

curl -d "有人的地方就有江湖"
curl -d "%D3%D0%C8%CB%B5%C4%B5%D8%B7%BD%BE%CD%D3%D0%BD%AD%BA%FE"

  PHP 调用 HTTPCWS 示例:

  ①、对GBK编码的字符串进行中文分词处理(HTTP POST方式):
@header('Content-Type: text/html; charset=gb2312');
$text = "有人的地方就有江湖";
$text = urlencode($text);
$opts = array(
    'header'=>"Content-type: application/x-www-form-urlencoded\r\n".
              "Content-length:".strlen($data)."\r\n" .
              "Cookie: foo=bar\r\n" .
    'content' => $text,
$context = stream_context_create($opts);
$result = file_get_contents("", false, $context);
echo $result;

  ②、对UTF-8编码的字符串进行中文分词处理(HTTP POST方式):
@header('Content-Type: text/html; charset=utf-8');
$text = "有人的地方就有江湖";
$text = iconv("UTF-8", "GBK//IGNORE", $text);
$text = urlencode($text);
$opts = array(
    'header'=>"Content-type: application/x-www-form-urlencoded\r\n".
              "Content-length:".strlen($data)."\r\n" .
              "Cookie: foo=bar\r\n" .
    'content' => $text,
$context = stream_context_create($opts);
$result = file_get_contents("", false, $context);
$result = iconv("GBK", "UTF-8//IGNORE", $result);
echo $result;

  ③、对GBK编码的字符串进行中文分词处理(HTTP GET方式):
@header('Content-Type: text/html; charset=gb2312');
$text = "有人的地方就有江湖";
$text = urlencode($text);
$result = file_get_contents("".$text);
echo $result;

  ④、对UTF-8编码的字符串进行中文分词处理(HTTP GET方式):
@header('Content-Type: text/html; charset=utf-8');
$text = "有人的地方就有江湖";
$text = iconv("UTF-8", "GBK//IGNORE", $text);
$text = urlencode($text);
$result = file_get_contents("".$text);
$result = iconv("GBK", "UTF-8//IGNORE", $result);
echo $result;

  5、httpcws 分词速度及用途

  局域网内 HTTPCWS 接口中文分词平均处理速度(Wait时间):0.001秒。HTTPCWS 基于 libevent + epoll 网络IO模型开发,经测试,每秒可处理5000~20000次请求。


  HTTPCWS 属于《[http://blog.zyan.cc/post/385.htm 亿级数据的高并发通用搜索引擎架构设计]》的一部分,用作“搜索查询接口”的关键字分词处理。在此架构中,Sphinx索引引擎对于CJK(中日韩)语言支持一元切分,假设【反恐行动是国产主视角射击网络游戏】这段文字,Sphinx会将其切成【反 恐 行 动 是 国 产 主 视 角 射 击 网 络 游 戏】,然后对每个字建立反向索引。如果用这句话中包含的字组成一个不存在的词语,例如【恐动】,也会被搜索到,所以搜索时,需要加引号,例如搜索【"反恐行动"】,就能完全匹配连在一起的四个字,不连续的【"恐动"】就不会被搜索到。但是,这样还有一个问题,搜索【"反恐行动游戏"】或【"国产网络游戏"】就会搜索不到。所以,我在搜索层写了个PHP中文分词扩展,搜索“反恐行动游戏”、“国产网络游戏”,会被httpcws中文分词函数分别切分为“反恐行动 游戏”、“国产 网络游戏”,这时候,用PHP函数给以空格分隔的词语加上引号,去搜索【"反恐行动" "游戏"】或【"国产" "网络游戏"】,就能搜索到这条记录了。由于httpcws位于搜索层,中文分词词库发生增、删、改,只需重启httpcws进程即可,无需重建搜索索引。



技术大类 » 搜索引擎技术 | 评论(419) | 引用(1) | 阅读(163728)
joker game Email Homepage
2022-1-9 17:14
betflix   The latest with a game system designed for direct pg slot players because the game format is very advanced, can play pg slot auto via ios and android systems, supports playing pg slots via mobile Deposit-withdraw automatically in just 8 seconds
Betflix Email Homepage
2022-1-9 17:15
joker game   Including web slots and online casinos With a game format that is easy to play and modern, easy to deposit, withdraw at betflix24 with Betflix auto system, automatic deposit and withdrawal in just 10 seconds. There is an admin service 24 hours a day, no need to make a turn.
เล่นสล็อต Email Homepage
2022-1-9 17:15
เล่นสล็อต  Free Trial PG and Joker is to play slots for free without having to pay a deposit first. There are currently playing slots. mostly through the mobile internet without having to go to the casino to play slots
โรม่า Email Homepage
2022-1-9 17:16
โรม่า  Free Trial PG and Joker is to play slots for free without having to pay a deposit first. There are currently playing slots. mostly through the mobile internet without having to go to the casino to play slots
เครดิตฟรี Email Homepage
2022-1-9 17:16
เครดิตฟรี   Latest free credit 2021 online gambling website latest free credit All in one website, free credit, PG, slot promotions. Free credit, no deposit, no sharing
Joker game Email Homepage
2022-1-9 17:16
Joker game  direct online slots Not through a joker agent, deposit-withdraw, no minimum, quality online gambling website that receives international standards, joker slots, easy to play, get real money.
PGGAMESLOT Email Homepage
2022-1-9 18:17
โรม่า   Roma, the most popular online slots game of all time, roma slot 888, play free, online roma slot games. playing for real money Legendary online gambling game 2021 Roma Slots from famous gaming companies
PGGAMESLOT Email Homepage
2022-1-9 18:17
pg  The latest with a game system designed for direct pg slot players because the game format is very advanced, can play pg slot auto via ios and android systems, supports playing pg slots via mobile Deposit-withdraw automatically in just 8 seconds
PGGAMESLOT Email Homepage
2022-1-9 18:18
ฝาก50รับ150   No need to turn or call each other easily understood is a promotion, deposit 50, get 150 wallet, is another promotion that many people like and think that it is a worthwhile promotion.
PGGAMESLOT Email Homepage
2022-1-9 18:18
เครดิตฟรี  No deposit required is something that online gambling sites. free credit slots pg There is a reward for the members of the web. free credit slots no deposit will be able to play free slots all within the web
PGGAMESLOT Email Homepage
2022-1-9 18:18
joker game Direct web slots, not through agents online gambling games The most legendary casino game Stuck in one of the 5 most popular slot game camps. With more than 150 games to choose from, Joker Slots
PGGAMESLOT Email Homepage
2022-1-9 18:18
โจ๊กเกอร์  Direct web slots, not through agents online gambling games The most legendary casino game Stuck in one of the 5 most popular slot game camps. With more than 150 games to choose from, Joker Slots
PGGAMESLOT Email Homepage
2022-1-9 18:18
ฝาก20รับ100   A promotion that many people like and think that it is a worthwhile promotion.
PGGAMESLOT Email Homepage
2022-1-9 18:18
19รับ100   Latest 2021 Promotion Hits Slots Can play all game camps, new members, deposit 30, get 100, give away free credit, no need to deposit, no need to share Promotion deposit 30 get 100 unlimited withdrawal No minimum deposit
PGGAMESLOT Email Homepage
2022-1-9 18:18
Slot 888   The number 1 online slot game in Thailand, slot 888 online that includes the 888 slot game camp to play more than 300 games.
PGGAMESLOT Email Homepage
2022-1-9 18:18
สล็อต 888   The number 1 online slot game in Thailand, slot 888 online that includes the 888 slot game camp to play more than 300 games.
PGGAMESLOT Email Homepage
2022-1-9 18:19
ทดลองเล่นสล็อต     Slots Free Trial Playable Withdrawable Free Trial Playable Withdraw Real Money Free credit to play slots Terms and conditions are as specified by the website.
เว็บตรง สล็อต Email Homepage
2022-1-14 08:46
เว็บตรง สล็อต jokergaming 789 slot game provider (slot) casino online slots Play online slots via the websitemakes playing your online gambling games easier
joker Email Homepage
2022-1-14 08:47
joker  jokergaming 789 slot game provider (slot) casino online slots Play online slots via the websitemakes playing your online gambling games easier
ทางเข้า joker123 Email Homepage
2022-1-14 08:47
ทางเข้า joker123 joker game online gambling site Joker Gaming top game camp that makes playing your online gambling games easier
分页: 15/21 第一页 上页 10 11 12 13 14 15 16 17 18 19 下页 最后页
昵称   密码   游客无需密码
网址   电邮   [注册]