桃之夭夭
知天之所为,知人之所为者,至矣。知天之所为者,天而生也;知人之所为者,以其知之所知以养其知之所不知,终其天年而不中道夭者:是知之盛也。
BlogJava | 首页 | 发新随笔 | 发新文章 | 联系 | 聚合 | 管理

2007年9月5日

A very brief introduction to Aurum
Aurum is a Ruby-based LALR(n) parser generator that you can use to develop your own domain specified languages, scripting languages and programming languages.Although it's just yet another parser generator, Aurum is slightly different from other widely used parser generators:

  1. One of major targets of Aurum is to simplify external DSL development, espectually Ruby external DSL.
  2. Aurum uses incremental LALR(n) algorithm instead of the common used LALR(1)/Full LALR(n) algorithm. That means:
    •   Allowing the user to express grammars in a more intuitive mannar.
    •   Making it easier to handle complicated grammars. For exmaple, COBOL(LALR(2 or 3)), simplified nature language(LALR(3+)) and etc.
    •   Closer to Generalized LR in language recognizing but much more faster.
    •   Smaller parsing table comparing to Full LALR/LR(n) algorithm.
  3. Aurum supports grammar reuse, and itslef'll be shipped with some pre-defined common structures. One of the pain points of external DSL is that you have to re-define lots of common structures, such as if statements, block structure and etc. With Aurum, you could simply reuse them.
  4. Aurum uses a Ruby interal DSL as meta-language, and provides a generic lexer/parser as well. You could test your grammar by the comprehensive testing libraries Ruby has(you could even develop your lexer/parser in the TDD fashion).
  5. As the name suggested, Aurum, the Latin word for Gold, is partially inspired by the GOLD Parsing System. The grammar you created with Aurum could be completely independent of any implementation language,even Ruby.(not implemented yet :) )

Ok, let's start from the 'Hello World in Compiler Construction' —— Expression Evaluation

 1 require 'aurum'
 2 
 3 class ExpressionGrammar < Aurum::Grammar
 4   tokens do
 5     ignore string(' ').one_or_more     # <= a
 6     _number range(?0, ?9).one_or_more  # <= b
 7   end
 8 
 9   precedences do  # <= c
10     left '*', '/'
11     left '+', '-'
12   end
13 
14   productions do # <= d
15     expression expression, '+', expression {expression.value = expression1.value + expression2.value} # <= e
16     expression expression, '-', expression {expression.value = expression1.value - expression2.value}
17     expression expression, '*', expression {expression.value = expression1.value * expression2.value}
18     expression expression, '/', expression {expression.value = expression1.value / expression2.value}
19     expression '(', expression, ')'        do expression.value = expression1.value end # <= f
20     expression _number                     {expression.value = _number.value.to_i}
21     expression '+', _number                {expression.value = _number.value.to_i}
22     expression '-', _number                {expression.value = -_number.value.to_i}
23   end
24 end

If you has any experience with other compiler compiler/parser generator, you probably could understand what happens above quite easily. Instead of explaining things like token, character class, and production, I'd like to emphasise some Aurum conventions:
  1. At point a, we use 'ignore' directive to declare the ignored pattern, such as whitespaces etc.'string' is one of the helper methods(others are enum, range and concat), which is used to define lexical patterns. It will create a pattern matching the given string exactly.
  2. At point b, we declare a lexical token named '_number'. In Aurum, lexical tokens, or terminals from syntax perspective, always start with '_'. The expression '_token_name pattern' is equivalent to 'match pattern, :recognized => :_toke_name'. The 'match' directive is a common way to associate lexical action with leixcal pattern.
  3. At point c, we declare operator precedences of the Expression grammar.The eariler the operators definied, the higher precedence they will have.
  4. At point d, we declare syntax rules of Expression grammar. According to Aurum naming convention, all terminals should start with '_' while all nontermainls start with lower case alphabet character. String literals will be interpreted as reserve words, and added to lexer automatically.
  5. At point e, we define a semantic action to the Addition rule. In semantic action, you could access to the objects in value stack via the name of corresponding symbols.If there are more than one symbol with the same name, you could differentiate them by the order they appered in the production.
  6. At point f, we use do..end instead of {..}. Using Ruby internal DSL as meta-langauge is a double-side sword, you have to bear its flaws while enjoying the remaining parts. There is no perfect world, isn't it?

Now, let's find out how we could use this expression grammar. You could use the helper method as below(it will recalcuate lexical table and parsing table for every call, could be quite slow):

1 puts ExpressionGrammar.parse_expression('1+1').value

or use the lexical table and parsing table to create your own lexer & parser:
 
1   lexer = Aurum::Engine::Lexer.new(ExpressionGrammar.lexical_table, '1+1')
2   parser = Aurum::Engine::Parser.new(ExpressionGrammar.parsing_table(:expression))
3   puts parser.parse(lexer).value


At the end of this post, I'd like to give another grammar example coming from Martin Fowler's HelloParserGenerator series:

 1 require 'aurum'
 2 
 3 Item = Struct.new(:name)
 4 
 5 class Catalog < Aurum::Grammar
 6   tokens do
 7     ignore enum(" \r\n").one_or_more
 8     _item range(?a,?z).one_or_more
 9   end
10 
11   productions do
12     configuration configuration, item {configuration.value = configuration1.value.merge({item.value.name => item.value})}
13     configuration _                   {configuration.value = {}}
14     item 'item', _item                {item.value = Item.new(_item.value)}
15   end
16 end
17 
18 config = Catalog.parse_configuration(<<EndOfDSL).value
19   item camera
20   item laser
21 EndOfDSL
22 
23 puts config['camera'].name


P.S.:The post is based on the developing version of Aurum(0.2.0). You could get it from the svn repository.
P.S.P.S.: There is a more complicated example in the examples directory, a simple Smalltalk interpreter. Have fun:)
posted @ 2007-09-05 23:21 Raimundox 阅读(4124) | 评论 (0) | 编辑 收藏
 
A very brief introduction to Aurum
A very brief introduction to Aurum

Aurum是一个用Ruby实现的LALR(n) parser generator(是的,又是一个parser generator),不过它和其他一些广泛应用的parser generator相比略有不同的:

1.Aurum的主要目标之一,是简化external DSL的开发(尤其是ruby external DSL)。
2.Aurum采用增量LALR(n)算法,而不是通常的LALR(1)。这意味着:
  a.不必由于LALR(1)能力的限制,而改写语法,很多在LALR(1)中冲突的语法在LALR(n)中可以比较自然地表达。
  b.由于识别能力的增强,可以处理一些比较复杂的语法,比如COBOL(LALR(2)或LALR(3)),比如一些简化的自然语言(LALR(3+))。
  c.处理能力接近Generalized LR,却快很多
  d.比起Full LALR/LR(n),增量算法生成的语法表更小。
3.出于简化external DSL实现的考虑,Aurum支持语法重用。
4.Aurum采用Ruby internal DSL作为语法声明的元语言,可以利用Ruby丰富的测试框架,有效地对编译/解释/分析器进行测试。
5.正如名字所暗示的,Aurum(Gold的化学名称)的一部分灵感来自GOLD parsing system,它将支持独立于平台和语言的编译器开发。

好,闲话少说,看一个例子,编译原理中的Hello World —— 表达式求值:
 1 require 'aurum'
 2 
 3 class ExpressionGrammar < Aurum::Grammar
 4   tokens do
 5     ignore string(' ').one_or_more     # <= a
 6     _number range(?0, ?9).one_or_more  # <= b
 7   end
 8 
 9   precedences do  # <= c
10     left '*', '/'
11     left '+', '-'
12   end
13 
14   productions do # <= d
15     expression expression, '+', expression {expression.value = expression1.value + expression2.value} # <= e
16     expression expression, '-', expression {expression.value = expression1.value - expression2.value}
17     expression expression, '*', expression {expression.value = expression1.value * expression2.value}
18     expression expression, '/', expression {expression.value = expression1.value / expression2.value}
19     expression '(', expression, ')'        do expression.value = expression1.value end # <= f
20     expression _number                     {expression.value = _number.value.to_i}
21     expression '+', _number                {expression.value = _number.value.to_i}
22     expression '-', _number                {expression.value = -_number.value.to_i}
23   end
24 end


如果诸位对之前有用过compiler compiler或者parser generator的话,应该能看个七七八八吧。我大概解释一下:
 a.这里定义了文法空白,也就是被lexer忽略的部分,在通常的语言中,是空格回车换行之类的字符;string是用于定义lexical pattern的helper方法(出了string之外,还有range, enum和concat);ignore是一个预定义的说明指令,表示若文本匹配给定模式则该文本会被lexer自动忽略,其格式为:
    ignore pattern {//lexical action}
 b.此处为lexical token声明,所有lexical token必须以_开头,其格式为:
    _token_name pattern {//lexical action}
   这里其实是一个简略写法,等价于
    match pattern, :recognize => :_token_name
 c.此处为运算符优先级声明,支持左/右结合运算符(无结合属性运算符开发中);每一行中所有运算符具有相同优先级;比它下一行的运算符高一个优先级。比如在这个例子中,'*'和'/'具有相同优先级,但是比'+'和'-'的优先级别高。
 d.此处为语法规则声明,所使用的symbol主要有三种,nonterminal(小写字母开头),terminal(其实就是lexical token,以_开头)和literal(字符串常量),其中所有literal都会被自动声明为保留字。
 e.此处定义了一条文法规则(加法),以及对应的semantic action。在semantic action中可以直接通过symbol的名字来获取值栈中的对象。如遇到同名symbol,则按照出现顺序进行编号即可。
 f.其实这个没啥,只不过由于我们使用的是Ruby DSL,所以有时候不能都用{},需要do end,这就是一个例子。

最后测试一下实际中如何使用定义好的语法(使用helper method,注意由于分析表没有缓存,每次都会重算语法表,仅仅适用于debug mode。)
  puts ExpressionGrammar.parse_expression('1+1').value
或者通过分析表自己构造lexer和parser
  lexer = Aurum::Engine::Lexer.new(ExpressionGrammar.lexical_table, '1+1')
  parser = Aurum::Engine::Parser.new(ExpressionGrammar.parsing_table(:expression))
  puts parser.parse(lexer).value

最后最后,给另外一个例子,就是Martin Fowler Blog上的HelloParserGenerator系列中所用的语法:

 1 require 'aurum'
 2 
 3 Item = Struct.new(:name)
 4 
 5 class Catalog < Aurum::Grammar
 6   tokens do
 7     ignore enum(" \r\n").one_or_more
 8     _item range(?a,?z).one_or_more
 9   end
10 
11   productions do
12     configuration configuration, item {configuration.value = configuration1.value.merge({item.value.name => item.value})}
13     configuration _                   {configuration.value = {}}
14     item 'item', _item                {item.value = Item.new(_item.value)}
15   end
16 end
17 
18 config = Catalog.parse_configuration(<<EndOfDSL).value
19   item camera
20   item laser
21 EndOfDSL
22 
23 puts config['camera'].name


P.S.:本文是根据Aurum0.2.0写成的,你可以从rubyforge的svn上得到它。
P.S.P.S.: 在exmaples目录里有一个更复杂一些的例子,是一个简单的Smalltalk解释器。


posted @ 2007-09-05 23:12 Raimundox 阅读(4158) | 评论 (0) | 编辑 收藏
 
随笔:35 文章:0 评论:85 引用:0
<2007年9月>
日一二三四五六
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与
  • 最新评论

留言簿(12)

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • ThoughtBlog(20) (rss)
  • 学而篇第一(15) (rss)

随笔档案

  • 2007年11月 (1)
  • 2007年9月 (2)
  • 2007年8月 (1)
  • 2007年7月 (1)
  • 2007年6月 (1)
  • 2007年5月 (2)
  • 2007年3月 (8)
  • 2007年1月 (1)
  • 2006年10月 (1)
  • 2006年8月 (1)
  • 2006年3月 (1)
  • 2006年1月 (3)
  • 2005年12月 (12)

搜索

  •  

最新评论

  • 1. re: A Day of Two Pragmatic Programmers' Life
  • 能否将联系方式发到silkcut@163.com
  • --jj
  • 2. re: Misquotation
  • 哈哈
  • --Jack.Wang
  • 3. re: Selenium Better Pratice
  • 评论内容较长,点击标题查看
  • --林勇
  • 4. re: Selenium Better Pratice
  • 你好!
    还有个疑问:看sourceforge 上的文章,说Selenium是支持SAFS。这在应用中如何体现出来呢??
    非常感谢!
  • --joycezhou
  • 5. re: Selenium Better Pratice
  • 评论内容较长,点击标题查看
  • --joycezhou

阅读排行榜

  • 1. Selenium Better Pratice(6528)
  • 2. Feng Shui for Standard ML Programmers(5605)
  • 3. The Keyword 'end' Drives Me Crazy(5478)
  • 4. Why Rails so effective? Delphi Inside(5224)
  • 5. I'm Smalltalk, Which Programming Language are You?(5210)

评论排行榜

  • 1. 丧钟为谁鸣?(1)(10)
  • 2. Selenium Better Pratice(10)
  • 3. A Day of Two Pragmatic Programmers' Life(9)
  • 4. Say sorry to Object-Oriented Methodology(4)
  • 5. Agile 101: Pair Programming & Simple Design(4)

Powered by: 博客园
模板提供:沪江博客
Copyright ©2008 Raimundox