高速全文检索引擎Sphinx安装指南及下载
http://srsman.com/2008/08/sphinx/
标签: MySQL, postgresql, Sphinx, SQL, 中文分词
标签: MySQL, postgresql, Sphinx, SQL, 中文分词
你可以像平常一样编译和安装 postgresql,使用 tsearch2 进行中文的全文索引的时候,真正的区别发生在初始化数据库的时候。
标签: postgresql, 中文分词
标签: 中文分词
标签: C#, ShootSearch, 中文分词
在Lucene里面引用别人写好的中文分词器很简单,加个CLASSPATH就好。但是在pyLucene(JCC版)里,由于python所能够引用
到的Jar包都是用JCC这个编译器(姑且认为是个编译器吧)预先编译了python调用接口的。(反过来说,就是没有经JCC编译的Jar包是休想在
python里面直接访问的)
所以,在pyLucene中使用Jar包形式的中文分词器不得不重新编译。分隔线以下是OSFoundation某热心人关于如何修改Makefile
让Jar包可以和pyLucene打包到一起的回复。
-------------------------------------热心人回复的分隔
线-------------------------------------
Andi Vajda:
To access your class(es) by name from Python, you must have JCC
generate wrappers for it (them). This is what is done line 177 and on
in PyLucene's Makefile. The easiest way for you to add your own Java
classes to PyLucene is to create another jar file with your own
analyzer classes and code and add it to the JCC invocation there.
For example, the Makefile snippet in question currently says:
GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \
--package java.lang java.lang.System \
java.lang.Runtime \
--package java.util \
java.text.SimpleDateFormat \
--package java.io java.io.StringReader \
java.io.InputStreamReader \
java.io.FileInputStream \
--exclude org.apache.lucene.queryParser.Token \
--exclude org.apache.lucene.queryParser.TokenMgrError \
--exclude
org.apache.lucene.queryParser.QueryParserTokenManager \
--exclude org.apache.lucene.queryParser.ParseException \
--python lucene \
--mapping org.apache.lucene.document.Document 'get:(Ljava/
lang/String;)Ljava/lang/String;' \
--mapping java.util.Properties 'getProperty:(Ljava/lang/
String;)Ljava/lang/String;' \
--sequence org.apache.lucene.search.Hits 'length:()I' 'doc:
(I)Lorg/apache/lucene/document/Document;' \
--version $(LUCENE_VER) \
--files $(NUM_FILES)
change the first line to say:
GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) --jar myjar.jar \
...
and rebuild PyLucene. That should be all you need to do. Your jar file
is going to be installed along with lucene's in the lucene egg and it
is going to be put on lucene.CLASSPATH which you use with
lucene.initVM().
Your classes can be declared in any Java package you want. Just make
sure that their names don't clash with other Lucene class names that
you also need to use as the class namespace is flattened in PyLucene.
For more information about JCC and its command line args see JCC's
README file at [1].
Andi..
[1] http://svn.osafoundation.org/pylucene/trunk/jcc/jcc/README
_______________________________________________
pylucene-dev mailing list
pylucene-...@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
http://groups.google.com.pe/group/python-cn/browse_thread/thread/0f085de0eab6f039
标签: 中文分词
标签: 中文分词
高效率:QieQie的赛扬PC 1 秒解析 >>> 20000汉字的词语 (实际测试结果数据,可达1秒10万+汉字。)
高可维护性:使用“庖丁”隐喻,形象明晰
高灵活性,可扩展:OOD
对比:《终于突破中文分词的效率问题》http://www.lucene.org.cn/read.php?tid=54&fpage=2 他的效率为 6秒 解析2588汉字
2007-08-08:
由于庖丁解牛进行了一些调整和重构,这里的附件代码已经是"较旧"的,最新的下载地址:
http://code.google.com/p/paoding/downloads/list
SVN地址为:http://paoding.googlecode.com/svn/trunk/paoding-analysis/
同时也可以通过浏览器访问http://paoding.googlecode.com/svn/trunk/paoding-analysis/ 直接浏览代码。
最新的在JavaEye的发布帖子是:
http://www.javaeye.com/topic/110148 中文分词 庖丁解牛 2.0.0 发布