posts - 403, comments - 310, trackbacks - 0, articles - 7
  BlogJava :: 首页 :: 新随笔 :: 联系 :: 聚合  :: 管理

用了快一年了,真是好用啊好用,恩

posted @ 2007-12-19 20:02 ZelluX 阅读(327) | 评论 (2)编辑 收藏

CAL样例程序里面出现很多sample指令,google到的简单介绍:

Antialias (抗锯齿)

虽然减小像素的大小可以使图像可以更加精细,一定程度上减轻了锯齿,但是只要像素的大小大到可以互相彼此区分,那么锯齿的产生是不可避免的!抗锯齿的方法一般是多点(注意此处是“点”而不是“像素”,后面可以看出它们间的区别)采样。

一、        理论与方法:

1 Oversampling (重复取样):

1 )方法:

 首先,将场景以比你的显示器(前缓冲)更高分辨率进行渲染:

假设当前的(前 / 后缓冲)的分辨率是 800 × 600 ,那么可以先将场景渲染到 1600 × 1200 的渲染目标上(纹理);

 然后,从高分辨率的渲染目标得到低分辨率的场景渲染结果:

      此时取每 2 × 2 个像素块颜色的平均值为最终渲染的像素颜色值。

2 )优点:可以显著地改善锯齿导致的失真。

3 )缺点:需要更大的缓冲,同时填充缓冲导致性能消耗变大;

           进行多个像素的取样,导致性能下降;

           由于以上缺点, D3D 并没有采用这种抗锯齿方法。

2 Multisampling (多取样):

1 )方法:

只需要对像素进行一次取样,而是在每个像素中取 N 个点(取决于具体的取样模型),该像素的最终颜色 = 该像素原先的颜色 *  多边形覆盖的点数  /  总的取样点数;

2 )优点:可以改善锯齿带来的失真的同时而不会增加取样次数,同时比起 Oversampling 它也不需要更大的后备缓冲。

3 )缺点:原本当一个多边形覆盖了一个像素的中心点时,该像素的颜色才会由该多边形决定(在像素管线阶段典型的就是寻址到合适的纹理颜色与顶点管线输出的颜色进行调制),但是 Multisampling 中,如果该多边形覆盖了其中一部分取样点却未覆盖像素中心点,该像素颜色仍然由此多边形决定。如此一来,纹理寻址可能出现错误,这对于纹理集( atlas )会出现另一种失真效果:多边形边缘颜色错误!

3 Centriod Sampling (质心采样):

1 )方法:

     为了解决在使用 Multisampling 导致的在纹理集中进行纹理寻址带来的错误,不再采用像素中心的颜色作为“该像素原先的颜色”,而是用“该像素中被多边形覆盖的那些取样点的中心点的颜色”。这样就保证了被渲染的像素点始终是多边形的内部(也就是说纹理地址不会超出多边形的范围)。

2 )如何使用:

         ①任何有COLOR语义作为输入的Pixel Shader会自动运用质心采样;

     ②在Pixel Shader的输入参数的语义后中手动加入 _centroid 扩展,例如:

   float4  TexturePointCentroidPS( float4 TexCoord : TEXCOORD0_centroid ) : COLOR0

{

  return tex2D( PointSampler, TexCoord );

}

3 )注意:

    质心采样主要用于采用纹理集的 Multisampling ,对于一整张纹理对应一个的多边形网格的情况,采用质心采样反而会导致错误!

posted @ 2007-12-14 13:42 ZelluX 阅读(445) | 评论 (0)编辑 收藏

CS:APP P521
在CC同学的帮助下终于看懂这个程序了
关键在于P488的Generic Cache Memory Organization,以前看过,没留下什么印象
cache是有多个(2s个)大小为block size的片组成的
这样在访问B[k][j]时,B[k][j] - B[k][j + bsize - 1]这条内存就被cache了
重复bsize次后B[k][k] - b[k + bsize - 1][k + bsize - 1]这块内存被cache
后面做乘法就快很多的

posted @ 2007-12-06 13:17 ZelluX 阅读(886) | 评论 (0)编辑 收藏

lambda真是王道啊
#!/usr/bin/env python
d={'a':1,'b':5,'c':4}
print sorted(d.items(), key=lambda (k,v): (v,k))

Help on built-in function sorted in module __builtin__:

sorted(...)
    sorted(iterable, cmp=None, key=None, reverse=False) --> new sorted list

 

posted @ 2007-12-04 18:48 ZelluX 阅读(1100) | 评论 (0)编辑 收藏

Subject: Re: Explanation, please!
Summary: Original citation
From: td@alice.UUCP (Tom Duff)
Organization: AT&T Bell Laboratories, Murray Hill NJ
Date: 29 Aug 88 20:33:51 GMT
Message-ID: <8144@alice.UUCP>

I normally do not read comp.lang.c, but Jim McKie told me that ``Duff's device'' had come up in comp.lang.c again.  I have lost the version that was sent to netnews in May 1984, but I have reproduced below the note in which I originally proposed the device.  (If anybody has a copy of the netnews version, I would gratefully receive a copy at research!td or td@research.att.com.)

To clear up a few points:

  1. The point of the device is to express general loop unrolling directly in C.  People who have posted saying `just use memcpy' have missed the point, as have those who have criticized it using various machine-dependent memcpy implementations as support.  In fact, the example in the message is not implementable as memcpy, nor is any computer likely to have an memcpy-like idiom that implements it.

     

  2. Somebody claimed that while the device was named for me, I probably didn't invent it.  I almost certainly did invent it.  I had definitely not seen or heard of it when I came upon it, and nobody has ever even claimed prior knowledge, let alone provided dates and times.  Note the headers on the message below:  apparently I invented the device on November 9, 1983, and was proud (or disgusted) enough to send mail to dmr Please note that I do not claim to have invented loop unrolling, merely this particular expression of it in C.

     

  3. The device is legal dpANS C.  I cannot quote chapter and verse, but Larry Rosler, who was chairman of the language subcommittee (I think), has assured me that X3J11 considered it carefully and decided that it was legal. Somewhere I have a note from dmr certifying that all the compilers that he believes in accept it.  Of course, the device is also legal C++, since Bjarne uses it in his book.

     

  4. Somebody invoked (or more properly, banished) the `false god of efficiency.'  Careful reading of my original note will put this slur to rest.  The alternative to genuflecting before the god of code-bumming is finding a better algorithm.  It should be clear that none such was available.  If your code is too slow, you must make it faster.  If no better algorithm is available, you must trim cycles.

     

  5. The same person claimed that the device wouldn't exhibit the desired speed-up.  The argument was flawed in two regards:  first, it didn't address the performance of the device, but rather the performance of one of its few uses (implementing memcpy) for which many machines have a high-performance idiom.  Second, the poster made his claims in the absence of timing data, which renders his assertion suspect.  A second poster tried the test, but botched the implementation, proving only that with diligence it is possible to make anything run slowly.

     

  6. Even Henry Spencer, who hit every other nail square on the end with the flat round thing stuck to it, made a mistake (albeit a trivial one).  Here is Henry replying to bill@proxftl.UUCP (T. William Wells):
       >>... Dollars to doughnuts this
        >>was written on a RISC machine.
        >Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.
        

    I was at Lucasfilm when I invented the device.

     

  7. Transformations like this can only be justified by measuring the resulting code.  Be careful when you use this thing that you don't unwind the loop so much that you overflow your machine's instruction cache.  Don't try to be smarter than an over-clever C compiler that recognizes loops that implement block move or block clear and compiles them into machine idioms.

Here then, is the original document describing Duff's device:

From research!ucbvax!dagobah!td  Sun Nov 13 07:35:46 1983
Received: by ucbvax.ARPA (4.16/4.13)  id AA18997; Sun, 13 Nov 83 07:35:46 pst
Received: by dagobah.LFL (4.6/4.6b)  id AA01034; Thu, 10 Nov 83 17:57:56 PST
Date: Thu, 10 Nov 83 17:57:56 PST
From: ucbvax!dagobah!td (Tom Duff)
Message-Id: <8311110157.AA01034@dagobah.LFL>
To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr, ucbvax!research!rob

Consider the following routine, abstracted from code which copies an array of shorts into the Programmed IO data register of an Evans & Sutherland Picture System II:

 
send(to, from, count)
register 
short *to, *from;
register count;
{
    
do
        
*to = *from++;
    
while (--count>0);
}

(Obviously, this fails if the count is zero.)
The VAX C compiler compiles the loop into 2 instructions (a movw and a sobleq,
I think.)  As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%.  The standard way to get more speed out of something like this is to unwind the loop a few times, decreasing the number of sobleqs.  When you do that, you wind up with a leftover partial loop.  I usually handle this in C with a switch that indexes a list of copies of the original loop body.  Of course, if I were writing assembly language code, I'd just jump into the middle of the unwound loop to deal with the leftovers.  Thinking about this yesterday, the following implementation occurred to me:

 

send(to, from, count)
    register 
short *to, *from;
    register count;
{
    register n
=(count+7)/8;
    
switch(count%8{
        
case 0:    do {    *to = *from++;
        
case 7:        *to = *from++;
        
case 6:        *to = *from++;
        
case 5:        *to = *from++;
        
case 4:        *to = *from++;
        
case 3:        *to = *from++;
        
case 2:        *to = *from++;
        
case 1:        *to = *from++;
        }
 while(--n>0);
    }

}

Disgusting, no?  But it compiles and runs just fine.  I feel a combination of pride and revulsion at this discovery.  If no one's thought of it before, I think I'll name it after myself.

It amazes me that after 10 years of writing C there are still little corners that I haven't explored fully.  (Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.)

Many people (even bwk?) have said that the worst feature of C is that switches don't break automatically before each case label.  This code forms some sort of argument in that debate, but I'm not sure whether it's for or against.

yrs trly
Tom

posted @ 2007-11-29 16:02 ZelluX 阅读(513) | 评论 (0)编辑 收藏

其实关键的工具还是google的gprof2dot
http://google-gprof2dot.googlecode.com/

四种风格,应该在生成dot的时候还可以设定其他信息,比如每个结点费时等,毕竟profiling这个指数更重要



posted @ 2007-11-27 20:04 ZelluX 阅读(541) | 评论 (0)编辑 收藏

ORC (Open Research Compiler) 的一个讲座,里面有不少IPA的内容
http://www.blogjava.net/Files/zellux/ORC-PACT02-tutorial.rar

然后貌似龙书第二版里也讲了大量的IPA优化和call graph方面的东西,啃啊啃

posted @ 2007-11-27 15:24 ZelluX 阅读(294) | 评论 (0)编辑 收藏

University of Houston, Computer Science Department, High Performance Computing Tools Group的一篇论文:
Overview of the Open64 Compiler Infrastructure
VI.4. Interprocedural Analysis
Interprocedural Analysis (IPA) is performed in the following phases of Open64:
• Inliner phase
• IPA local summary phase
• IPA analysis phase
• IPA optimization phase
• IPA miscellaneous
By default the IPA does the function inlining in the inliner facility. The local summary phase is done in the IPL module and the analysis phase and optimization phase in the ipa-link module.
During the analysis phase, it does the following:
• IPA_Padding Analysis (common blocks Padding/Split Analysis)
• Construction of the Callgraph
Then it does space and multigot partitioning of the Callgraph. The partitioning algorithm takes into account whether it is doing partitioning for solving space or the multigot problem.
During the optimization phase the following phases are performed:
• IPA Global Variable Optimization
• IPA Dead function elimination
• IPA Interprocedural Alias Analysis
• IPA Cloning Analysis (It propagates information about formal parameters used as symbolic terms in array section summaries. This information is later used to trigger cloning.
• IPA Interprocedural Constant propagation
• IPA Array_Section Analysis
• IPA Inlining Analysis
• Array section summaries arrays for the Dependence Analyzer of the Loop Nest Optimizer.

posted @ 2007-11-26 12:53 ZelluX 阅读(392) | 评论 (1)编辑 收藏

突然要做一个相关的编译优化项目,先放一点国外网的IPA的资料上来,教育网出国不方便

GCC wiki:

Analysis and optimizations that work on more than one procedure at a time. This is usually done by making walking the Strongly Connected Components of the call graph, and performing some analysis and optimization across some set of procedures (be it the whole program, or just a subset) at once.

GCC has had a callgraph for a few versions now (since GCC 3.4 in the FSF releases), but the procedures didn't have control flow graphs (CFGs) built. The tree-profiling-branch in GCC CVS now has a CFG for every procedure built and accessible from the callgraph, as well as a basic IPA pass manager. It also contains in-progress interprocedural optimizations and analyses: interprocedural constant propagation (with cloning for specialization) and interprocedural type escape analysis.

IBM的XL Fortran V10.1 for Linux:

Benefits of interprocedural analysis (IPA)

Interprocedural Analysis (IPA) can analyze and optimize your application as a whole, rather than on a file-by-file basis. Run during the link step of an application build, the entire application, including linked libraries, is available for interprocedural analysis. This whole program analysis opens your application to a powerful set of transformations available only when more than one file or compilation unit is accessible. IPA optimizations are also effective on mixed language applications.

 

Figure 2. IPA at the link step

The following are some of the link-time transformations that IPA can use to restructure and optimize your application:

  • Inlining between compilation units
  • Complex data flow analyses across subprogram calls to eliminate parameters or propagate constants directly into called subprograms.
  • Improving parameter usage analysis, or replacing external subprogram calls to system libraries with more efficient inline code.
  • Restructuring data structures to maximize access locality.

In order to maximize IPA link-time optimization, you must use IPA at both the compile and link step. Objects you do not compile with IPA can only provide minimal information to the optimizer, and receive minimal benefit. However when IPA is active on the compile step, the resulting object file contains program information that IPA can read during the link step. The program information is invisible to the system linker, and you can still use the object file and link without invoking IPA. The IPA optimizations use hidden information to reconstruct the original compilation and can completely analyze the subprograms the object contains in the context of their actual usage in your application.

During the link step, IPA restructures your application, partitioning it into distinct logical code units. After IPA optimizations are complete, IPA applies the same low-level compilation-unit transformations as the -O2 and -O3 base optimizations levels. Following those transformations, the compiler creates one or more object files and linking occurs with the necessary libraries through the system linker.

It is important that you specify a set of compilation options as consistent as possible when compiling and linking your application. This includes all compiler options, not just -qipa suboptions. When possible, specify identical options on all compilations and repeat the same options on the IPA link step. Incompatible or conflicting options that you specify to create object files, or link-time options in conflict with compile-time options can reduce the effectiveness of IPA optimizations.

Using IPA on the compile step only

IPA can still perform transformations if you do not specify IPA on the link step. Using IPA on the compile step initiates optimizations that can improve performance for an individual object file even if you do not link the object file using IPA. The primary focus of IPA is link-step optimization, but using IPA only on the compile-step can still be beneficial to your application without incurring the costs of link-time IPA.

 

Figure 3. IPA at the compile step

IPA Levels and other IPA suboptions

You can control many IPA optimization functions using the -qipa option and suboptions. The most important part of the IPA optimization process is the level at which IPA optimization occurs. Default compilation does not invoke IPA. If you specify -qipa without a level, or specify -O4, IPA optimizations are at level one. If you specify -O5, IPA optimizations are at level two.

Table 5. The levels of IPA
IPA Level Behaviors
qipa=level=0
  • Automatically recognizes standard library functions
  • Localizes statically bound variables and procedures
  • Organizes and partitions your code according to call affinity, expanding the scope of the -O2 and -O3 low-level compilation unit optimizer
  • Lowers compilation time in comparison to higher levels, though limits analysis
qipa=level=1
  • Level 0 optimizations
  • Performs procedure inlining across compilation units
  • Organizes and partitions static data according to reference affinity
qipa=level=2
  • Level 0 and level 1 optimizations
  • Performs whole program alias analysis which removes ambiguity between pointer references and calls, while refining call side effect information
  • Propagates interprocedural constants
  • Eliminates dead code
  • Performs pointer analysis
  • Performs procedure cloning
  • Optimizes intraprocedural operations, using specifically:
    • Value numbering
    • Code propagation and simplification
    • Code motion, into conditions and out of loops
    • Redundancy elimination techniques

IPA includes many suboptions that can help you guide IPA to perform optimizations important to the particular characteristics of your application. Among the most relevant to providing information on your application are:

  • lowfreq which allows you to specify a list of procedures that are likely to be called infrequently during the course of a typical program run. Performance can increase because optimization transformations will not focus on these procedures.
  • partition which allows you to specify the size of the regions within the program to analyze. Larger partitions contain more procedures, which result in better interprocedural analysis but require more storage to optimize.
  • threads which allows you to specify the number of parallel threads available to IPA optimizations. This can provide an increase in compilation-time performance on multi-processor systems.
  • clonearch which allows you to instruct the compiler to generate duplicate subprograms with each tuned to a particular architecture.

Using IPA across the XL compiler family

The XL compiler family shares optimization technology. Object files you create using IPA on the compile step with the XL C, C++, and Fortran compilers can undergo IPA analysis during the link step. Where program analysis shows that objects were built with compatible options, such as -qnostrict, IPA can perform transformations such as inlining C functions into Fortran code, or propagating C++ constant data into C function calls.

posted @ 2007-11-25 23:04 ZelluX 阅读(736) | 评论 (0)编辑 收藏

     摘要:  from IBM developerWorks  原文的代码部分很乱,整理了一下 Although users usually think of Python as a procedural and object-oriented language, it actually contains everything you need for a completely func...  阅读全文

posted @ 2007-11-23 21:15 ZelluX 阅读(1327) | 评论 (0)编辑 收藏

仅列出标题
共39页: First 上一页 9 10 11 12 13 14 15 16 17 下一页 Last