First they ignore you
then they ridicule you
then they fight you
then you win
    -- Mahatma Gandhi
Chinese => English     英文 => 中文             
随笔-143  评论-742  文章-0  trackbacks-0
在搜索引擎,语音识别等领域常会统计单词的出现频率,下面给出Groovy实现,打印出现频率最高的6个单词以及相应的出现次数:

def content   =  
    
"""
    The Java Collections API is the basis   for   all the nice support that Groovy gives you
    through lists and maps. In fact, Groovy not only uses the same abstractions, it
    even works on the very same classes that make up the Java Collections API.
    
"""
 
def words 
=  content.tokenize()

def wordFrequency 
=  [:]

words.each {
    wordFrequency[it] 
=  wordFrequency.get(it,  0 +   1  


def wordList 
=  wordFrequency.keySet().toList()

wordList.sort {wordFrequency[it]} 

def result 
=   ''  

wordList[
- 1 .. - 6 ].each {
    result 
+=  it.padLeft( 12 +   " "   +  wordFrequency[it]  +   "  \n  "  

 
println result 



运行结果:

           the: 5
   Groovy: 2
          that: 2
 Collections: 2
         Java: 2
        same: 2 

 


如果所要处理的文本比较复杂,可以使用Regex进行处理,顺便说一句,Groovy在语言级别支持Regex!

posted on 2007-02-01 23:31 山风小子 阅读(2405) 评论(6)  编辑  收藏 所属分类: Groovy & Grails

评论:
# re: Groovy高效编程--统计单词频率 2007-03-06 16:57 | Grover, Gavin
It's fun writing things tersely in Groovy.

To get all the repeating words and frequencies, you can just use groupBy() and reverse(). To get all words occuring more than once:

content.tokenize().groupBy{ it }.
collect{ ['key':it.key, 'value':it.value.size()] }.
findAll{ it.value > 1 }.sort{ it.value }.reverse().
each{ println "${it.key.padLeft( 12 )} : $it.value" }

Or to get the most frequent 6 (最高的6个单词):

content.tokenize().groupBy{ it }.
collect{ ['key':it.key, 'value':it.value.size()] }.
sort{ it.value }.reverse().
eachWithIndex{ it, i->
if( i < 6 ) println "${it.key.padLeft( 12 )} : $it.value"
}

  回复  更多评论
  
# re: Groovy高效编程--统计单词频率 2007-03-06 18:58 | 山风小子
@Grover, Gavin
Thank you, Grover :)
I learnt a lot from you :)  回复  更多评论
  
# re: Groovy高效编程--统计单词频率 2007-03-07 11:18 | Grover, Gavin
You can even write your program in Groovy using Chinese:


(1)In the ASCII file "File1.groovy" which you run:

groovy.lang.MetaClass.setUseReflection(true)
cc= new org.codehaus.groovy.control.CompilerConfiguration()
cc.setSourceEncoding('unicode')
new GroovyShell( cc ).run( new File('File2.groovy') )


(2)In the unicode file "File2.groovy":

def b= new Binding()
b.setVariable('中文',中文) //Chinese for 'Chinese'
def conf=new org.codehaus.groovy.control.CompilerConfiguration()
conf.setSourceEncoding('unicode')
new GroovyShell(b,conf).evaluate( new File('File3.groovy') )
class 中文{
public static Map 组(Collection self, Closure closure) { self.groupBy(closure) }
public static void 打句(Object self, Object value) {self.println(value)}
public static Object 找都(Object self, Closure closure) {self.findAll(closure)}
public static List 分类(Collection self, Closure closure) {self.sort(closure)}
public static void 每(Object self, Closure closure) {self.each(closure)} //每个
public static int 夵(Object self) {self.size()} //大小
public static List 割(String self) {self.tokenize()} //分割
public static List 向后(List self) {self.reverse()}
public static List 集(Object self, Closure closure) {self.collect(closure)}
public static String 满左(String self, Number numberOfChars) {self.padLeft(numberOfChars)}
}


(3)In the unicode file "File3.groovy":

use(中文){

def 物= '''\
The Java Collections API is the basis for all the nice support that Groovy gives you
through lists and maps. In fact, Groovy not only uses the same abstractions, it
even works on the very same classes that make up the Java Collections API.
'''

物.割().组{it}.集{ ['k':it.key, 'v':it.value.夵()] }.
找都{it.v>1}.分类{it.v}.向后().每{打句"${it.k.满左(12)}: $it.v"}

//or:
物.割().组{ it }.集{ ['k':it.key, 'v':it.value.夵()] }.
分类{it.v}[-1..-6].每{打句"${it.k.满左(12)} : $it.v"}

}


Although I'm not a native Chinese speaker, I often use Chinese characters just to make my code shorter.

This example is simple, using Groovy Categories. If you use Groovy Interceptors, you can do more like disable the English names, or dynamically parse the Chinese name, eg, ['满':'pad', '左':'left']

Cheers, Gavin Grover
  回复  更多评论
  
# re: Groovy高效编程--统计单词频率 2007-03-07 12:19 | 山风小子
@Grover, Gavin
yeah, you're right.
Maybe the keywords are all in English, so I prefer to write code in English too :)
But your idea is very interesting and very useful to someone whose English is very poor, Thank you all the same:)
By the way, your Chinese is quite good :)  回复  更多评论
  
# re: Groovy高效编程--统计单词频率 2007-03-07 12:32 | Grover, Gavin
As soon as someone writes a lexical macro system for Groovy, the first thing I'll do is replace those 50 Groovy/Java keywords with single Chinese characters!
  回复  更多评论
  
# re: Groovy高效编程--统计单词频率 2007-03-07 13:17 | 山风小子
It's a pity that Chinese programmers like to write code in English and almost nobody like to write code in Chinese.
Nowadays if programmers in China want to gain high salary, they should master English, even what they read is almost in English. Certainly they talk with each other in Chinese.
Maybe we will write code in Chinese someday, but the day is very far, probably never :(  回复  更多评论
  



标题  
姓名  
主页
验证码 *  
内容(请不要发表任何与政治相关的内容)  
  登录  使用高级评论  新用户注册  返回页首  恢复上次提交      
该文被作者在 2007-04-09 15:13 编辑过