工作日誌
2013年2月27日 星期三
2013年2月24日 星期日
gnuplot color names
http://www.uni-hamburg.de/Wiss/FB/15/Sustainability/schneider/gnuplot/colors.htm
2013年2月6日 星期三
On Stack Replacement
From this post, it said:
A simple example copied from the same post:
OSR-compiled function:
after OSR_main is compiled, the execution will transfer from goto loop in the interpreter to the OSR_main.
But I am not quiet clear about why ``done'' part in OSR_main is never reached?
It’s used to convert a running function’s interpreter frame into a JIT’d frame – in the middle of that method.[On-Stack-Replacement (OSR) compilation was first introduced in the famous hotspot server paper, to the best of my knowledge.]
A simple example copied from the same post:
public static void main(String args[]) { S1; i=0; loop: if(P) goto done S3; A[i++]; goto loop; // <<--- here="" osr="" span=""> done: S2; }--->
OSR-compiled function:
void OSR_main() { A=value on entry from interpreter; i=value on entry from interpreter; goto loop; loop: if(P) goto done S3; A[i++]; goto loop; done: ...never reached... }
after OSR_main is compiled, the execution will transfer from goto loop in the interpreter to the OSR_main.
But I am not quiet clear about why ``done'' part in OSR_main is never reached?
2013年2月3日 星期日
Interrupt handling
In emulation, some asynchronous events may arrive in the middle of execution.
Two kinds of asynchronous events are interrupts and exception.
Semantics of Interrupts are explained in this paper.
First, it is a hardware-supported asynchronous transfer of control to an interrupt vector based on the signaling of some condition external to the processor core. An interrupt vector is a dedicated or configurable location in memory that specifies the address to which execution should jump when an interrupt occurs. Second, an interrupt is the execution of an interrupt handler : code that is reachable from an interrupt vector.
It is irrelevant whether the interrupting condition originates on-chip (e.g., timer expiration) or off-chip (e.g., closure of a mechanical switch). Interrupts usually, but not always, return to the flow of control that was interrupted. Typically an interrupt changes the state of main memory and of device registers, but leaves the main processor context (registers, page tables, etc.) of the interrupted computation undisturbed.
2013年1月29日 星期二
2013年1月28日 星期一
Precise Exception Semantics in Dynamic Compilation
Precise Exception Semantics in Dynamic Compilation
published in 2002, CC
slide:
paper:
2013年1月27日 星期日
Indirect branch profiling:
there is only one target of this indirect branch instruction during execution.
therefore when we say n-target indirect branch, we mean indirect branches that has n targets during execution.
frequency
Assume there are 100 guest indirect branches executed (dynamic count) during execution, and there are 8 1-target indirect branches executed. The frequency is 8/100 = 8%.
N-Target Indirect Branches Frequency Distribution of 400.perlbench with diffmail.pl test input:
Predict next target of one indirect branch with its last target.
400.perlbench with ref. input diffmail.pl:
Overall 73.63%
RET 93.99%
INDIRECT_CALL 20.45%
UNCOND_INDIRECT_JMP 48.98%
1. what is the frequency of 1-target indirect branches in benchmarks?
1-target indirect branchthere is only one target of this indirect branch instruction during execution.
therefore when we say n-target indirect branch, we mean indirect branches that has n targets during execution.
frequency
Assume there are 100 guest indirect branches executed (dynamic count) during execution, and there are 8 1-target indirect branches executed. The frequency is 8/100 = 8%.
N-Target Indirect Branches Frequency Distribution of 400.perlbench with diffmail.pl test input:
seems not very useful information
2. what is the hit ratio of last-target prediction of indirect branches?
last-target prediction:Predict next target of one indirect branch with its last target.
400.perlbench with ref. input diffmail.pl:
Overall 73.63%
RET 93.99%
INDIRECT_CALL 20.45%
UNCOND_INDIRECT_JMP 48.98%
gcc with ref. input 166.i:
Overall 64.77%
RET 61.08%
INDIRECT_CALL 97.93%
UNCOND_INDIRECT_JMP 66.35%
gcc with ref. input scilab.i
Overall 58.93%
RET 61.54%
INDIRECT_CALL 92.98%
UNCOND_INDIRECT_JMP 44.43%
445.gobmk with ref. test input nngs.tst :
Overall 56.17%
RET 56.01%
INDIRECT_CALL 78.30%
UNCOND_INDIRECT_JMP 67.25%
mini cache prediction is not helpful, it degrades performance about 4%.
2013年1月25日 星期五
Some interesting slides/papers about trace optimization in JVM
http://researcher.watson.ibm.com/researcher/files/us-pengwu/challeng-potential-trace-compilation.pdf
http://researcher.watson.ibm.com/researcher/files/us-pengwu/oopsla111-wu.pdf
http://researcher.watson.ibm.com/researcher/files/us-pengwu/UIUC-Seminar-Scripting-Languages-05-03.pdf
SINOF: A dynamic-static combined framework for dynamic binary translation
http://dl.acm.org/citation.cfm?id=2350593&CFID=174278273&CFTOKEN=47796794
Similar to permanent code cache, previous compiled blocks are saved and loaded by future runs.
Saved blocks are analyzed and optimized by runtime profiling information.
1. What kind of analysis and optimization they used?
2. What kind of information do they collect at runtime?
3. What's the benefit?
First, they use their own IR, and explain why they don't use LLVM or UQBT IR.
Second, in their evaluation, both guest ISA and host ISA are IA32! But they achieved on average 1.38X normalized by native execution time.
A low-overhead dynamic optimization framework for multicores
http://dl.acm.org/citation.cfm?id=2370899&CFID=174278273&CFTOKEN=47796794
Don't know what they do from abstract.
A very short paper (2-page), but I still have no idea what they did.
http://researcher.watson.ibm.com/researcher/files/us-pengwu/oopsla111-wu.pdf
http://researcher.watson.ibm.com/researcher/files/us-pengwu/UIUC-Seminar-Scripting-Languages-05-03.pdf
SINOF: A dynamic-static combined framework for dynamic binary translation
http://dl.acm.org/citation.cfm?id=2350593&CFID=174278273&CFTOKEN=47796794
Similar to permanent code cache, previous compiled blocks are saved and loaded by future runs.
Saved blocks are analyzed and optimized by runtime profiling information.
1. What kind of analysis and optimization they used?
2. What kind of information do they collect at runtime?
3. What's the benefit?
First, they use their own IR, and explain why they don't use LLVM or UQBT IR.
Second, in their evaluation, both guest ISA and host ISA are IA32! But they achieved on average 1.38X normalized by native execution time.
A low-overhead dynamic optimization framework for multicores
http://dl.acm.org/citation.cfm?id=2370899&CFID=174278273&CFTOKEN=47796794
Don't know what they do from abstract.
A very short paper (2-page), but I still have no idea what they did.
Adaptive multi-level compilation in a trace-based Java JIT compiler
Adaptive multi-level compilation in a trace-based Java JIT compiler
http://dl.acm.org/citation.cfm?id=2384630&CFID=173817186&CFTOKEN=20831145
an extended work of IBM's Trace-based JVM published in CGO 2011.
The ``meat'' of this paper is : Trace Recompilation via Trace-Transition Graph.
That is, they select traces to be recompiled via Trace-Transition Graph.
First, the recompiled code fragments are still TRACE! not region.
Their scenario is there may be short fragmented traces due to the limit of maximum length in the initial trace building phase. They set max-trace-length to two.
They would like to merge those fragmented traces into one trace.
The other interesting part is the construction of the Trace-Transition Graph.
The information needed are : 1. transition between traces, and 2. how frequency between transition.
They did not use hardware performance monitoring information.
Instead, they periodically check which transition between traces and record the frequency.
They use Branch-and-link instruction for transitions between traces, rather than using jump.
In this approach, the linker register record who the source trace is.
2012年12月27日 星期四
2012年12月17日 星期一
2012年12月14日 星期五
2012年12月12日 星期三
2012年12月4日 星期二
2012年12月2日 星期日
ARM SPEC CPU2006 Native Run
ARM SPEC CPU2006
gcc flags:
-static -O3 -marm -march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ffast-math -ftree-vectorize -funroll-all-loops
ARM Native Run with Ref. input
400.perlbench 2878 2879
401.bzip2 5007 4975
403.gcc 2942 2942
429.mcf 4082 4081
445.gobmk 3265 3267
456.hmmer 3268 3280
458.sjeng 3952 3935
462.libquantum 19697 19732
464.h264ref 4700 4718
471.omnetpp 2772 2782
473.astar 3002 2969
483.xalancbmk 2772 2766
gcc flags:
-static -O3 -marm -march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ffast-math -ftree-vectorize -funroll-all-loops
ARM Native Run with Ref. input
400.perlbench 2878 2879
401.bzip2 5007 4975
403.gcc 2942 2942
429.mcf 4082 4081
445.gobmk 3265 3267
456.hmmer 3268 3280
458.sjeng 3952 3935
462.libquantum 19697 19732
464.h264ref 4700 4718
471.omnetpp 2772 2782
473.astar 3002 2969
483.xalancbmk 2772 2766
2012年11月30日 星期五
Trace Mode, Ref Input
Trace Mode, CINT2006 reference input:
400.perlbench 9770 6657 1.47 *
401.bzip2 9650 2623 VE
403.gcc 8050 6133 RE
429.mcf 9120 9.01 RE
445.gobmk 10490 10356 1.01 *
456.hmmer 9330 6044 1.54 *
458.sjeng 12100 10927 1.11 *
462.libquantum 20720 22428 0.924 *
464.h264ref 22130 11599 1.91 *
471.omnetpp 6250 204 RE
473.astar 7020 4235 1.66 *
483.xalancbmk 6900 6767 1.02 *
Fail to run 401, 403, 429, 471
429.mcf needs 839 MB buf we only have 913MB in ARM, not enough memory.
After creating a swap of 1G, mcf can run successfully.
in top, only 17MB in SWAP area.
400.perlbench 9770 6657 1.47 *
401.bzip2 9650 2623 VE
403.gcc 8050 6133 RE
429.mcf 9120 9.01 RE
445.gobmk 10490 10356 1.01 *
456.hmmer 9330 6044 1.54 *
458.sjeng 12100 10927 1.11 *
462.libquantum 20720 22428 0.924 *
464.h264ref 22130 11599 1.91 *
471.omnetpp 6250 204 RE
473.astar 7020 4235 1.66 *
483.xalancbmk 6900 6767 1.02 *
Fail to run 401, 403, 429, 471
429.mcf needs 839 MB buf we only have 913MB in ARM, not enough memory.
After creating a swap of 1G, mcf can run successfully.
in top, only 17MB in SWAP area.
2012年11月29日 星期四
IBTC + Victim Performance Evaluation
IBTC IBTC+Victim
2^11 0.997770 0.999741
2^12 0.999387 0.999932
2^13 0.999725 0.999981
2^14 0.999857 0.999992
2^15 0.999887
2^16 0.999949
2^17 0.999955
2^18 0.999995
Block Mode
Performance (No inline)
IBTC(2^11)+Victim
T: 369.863281 0.545624 138.762726 0.000000 230.276672 0.278259
IBTC(2^18)
T: 375.984528 0.372223 138.313660 0.000000 237.024841 0.273804
Performance - Inline
IBTC(2^11)+Victim
T: 367.764587 0.378113 141.984161 0.000000 225.127594 0.274719
IBTC(2^18)
T: 373.440033 0.549896 145.442383 0.000000 227.171478 0.276276
IBTC(2^12)+Victim (make ibtc table 2^16 KB)
T: 4m20s
2^11 0.997770 0.999741
2^12 0.999387 0.999932
2^13 0.999725 0.999981
2^14 0.999857 0.999992
2^15 0.999887
2^16 0.999949
2^17 0.999955
2^18 0.999995
Block Mode
Performance (No inline)
IBTC(2^11)+Victim
T: 369.863281 0.545624 138.762726 0.000000 230.276672 0.278259
IBTC(2^18)
T: 375.984528 0.372223 138.313660 0.000000 237.024841 0.273804
Performance - Inline
IBTC(2^11)+Victim
T: 367.764587 0.378113 141.984161 0.000000 225.127594 0.274719
IBTC(2^18)
T: 373.440033 0.549896 145.442383 0.000000 227.171478 0.276276
IBTC(2^12)+Victim (make ibtc table 2^16 KB)
T: 4m20s
Hanging at SSH2_MSG_SERVICE_ACCEPT received
Question: The connection to remote server is slow, and after use -v, it hangs at SSH2_MSG_SERVICE_ACCEPT received.
Solution:
edit /etc/ssh/sshd_config and add one line:
UseDNS no
restart ssh service and done.
Solution:
edit /etc/ssh/sshd_config and add one line:
UseDNS no
restart ssh service and done.
2012年11月28日 星期三
work log
*** longjmp causes uninitialized stack frame ***
longjmp corrupt stack exception
abort execution
maybe cpu state content got wrongly overwritten.
R7 base register is overwritten!
Mark R7 as ReservedReg in ARMBaseRegisterInfo.cpp
longjmp corrupt stack exception
abort execution
maybe cpu state content got wrongly overwritten.
R7 base register is overwritten!
Mark R7 as ReservedReg in ARMBaseRegisterInfo.cpp
- mmap memory manager is broken; stop using it until we fix it.
Constant Pool related information and bugs
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
constant pool bug:
Trace fragments are usually much bigger than block fragments. LLVM ARM JIT fails to put constant pool within the range of load instructions with immediate offset (ranging from 4096 to -4096 bytes), which is referred to as the out-of-range bug.
My first thought is the ARM JIT forget to take out-of-range bug into consideration, but ARMConstantIslandPass does take this into consideration. And, immediately, I found that the source of this bug is the wrongly calculated offset.
Why would this happen in trace mode? Well, this is because I add one intrinsic BLOCKLINK which takes 24 bytes but I didn't update GetInstSizeInBytes() in ARMBaseInstrInfo. So, after I add this information in GetInstSizeInBytes(), the offset is correctly calculated.
constant pool bug:
Trace fragments are usually much bigger than block fragments. LLVM ARM JIT fails to put constant pool within the range of load instructions with immediate offset (ranging from 4096 to -4096 bytes), which is referred to as the out-of-range bug.
My first thought is the ARM JIT forget to take out-of-range bug into consideration, but ARMConstantIslandPass does take this into consideration. And, immediately, I found that the source of this bug is the wrongly calculated offset.
Why would this happen in trace mode? Well, this is because I add one intrinsic BLOCKLINK which takes 24 bytes but I didn't update GetInstSizeInBytes() in ARMBaseInstrInfo. So, after I add this information in GetInstSizeInBytes(), the offset is correctly calculated.
2012年11月27日 星期二
Work Log
Initialize() -> StartAll() -> Create Queues and Start each threads QCond.Initialize()->
Loop() -> TryGenerateTrace() -> QCond.Wait() - until start_=false
^|----------------------------------------------------|
Before Fork
StopAll() -> set start_ to false for all threads -> -> QCond.Destroy()
After Fork
StartAll()
After inserting tasks into queue, call QCond.Wake()
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
Loop() -> TryGenerateTrace() -> QCond.Wait() - until start_=false
^|----------------------------------------------------|
Before Fork
StopAll() -> set start_ to false for all threads -> -> QCond.Destroy()
After Fork
StartAll()
After inserting tasks into queue, call QCond.Wake()
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
2012年11月16日 星期五
Related Works
[TODO: Add paper links to all related papers and top 10 to read ]
- IBM PowerVM Lx86 http://www.ibm.com/developerworks/linux/lx86/index.html
- PDF file: http://www.redbooks.ibm.com/redpapers/pdfs/redp4298.pdf
- FX!32
- Dynamo
- Advances and Future Challenges in Binary Translation and Optimization, PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
- Design and Engineering of a Dynamic Binary Optimizer, PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
- Precise Exception Semantics in Dynamic Compilation, Proceeding CC '02 Proceedings of the 11th International Conference on Compiler Construction
- Transmeta
- UQBT, Walkabout
- DynamoRIO
- Persistent code cache
- PIN
- Valgrind
- StartDBT, HDTrans
- Dr.Memory
- QEMU
2012年11月9日 星期五
work log
- @20121109 - 10:50 AM, Run CINT2006, train input, ARM host, block mode.
- expect results: Perlbench, Omnetpp fails, others should be OK.
- waiting...
- 400.perlbench -- 1091 -- S
- 401.bzip2 -- 730 -- S
- 403.gcc -- 427 -- S
- 429.mcf -- 314 -- S
- 445.gobmk -- 3792 -- S
- 456.hmmer -- 786 -- S
- 458.sjeng -- 3158 -- S
- 462.libquantum -- 71.3 -- S
- 464.h264ref -- 1647 -- S
- 471.omnetpp NR
- 473.astar -- 1075 -- S
- 483.xalancbmk -- 2544 -- S
- Same as above except in trace mode.
- expected result: hmmer hang in
- 462.libquantum hang!
- =========================================
- 400.perlbench -- 892 -- S
- 401.bzip2 -- 656 -- S
- 403.gcc -- 363 -- S
- 429.mcf -- 303 -- S
- 445.gobmk -- 3022 VE
- 456.hmmer NR
- 458.sjeng -- 2378 -- S
- 462.libquantum -- 13802 RE
- 464.h264ref -- 250 RE
- 471.omnetpp -- 96.7 RE
- 473.astar -- 984 -- S
- 483.xalancbmk -- 1688 -- S
- =========================================
- @20121110 Run ref input
- Block mode.
- Perlbench hang! input splitmail; in I == E error.
- bzip2 miscompared
- gcc mis-compared
- mcf hang!
- -----------------------------------------------------------------------------------
- Error: 1x400.perlbench 1x401.bzip2 1x403.gcc 1x429.mcf 1x464.h264ref 1x471.omnetpp
- Success: 1x445.gobmk 1x456.hmmer 1x458.sjeng 1x462.libquantum 1x473.astar 1x483.xalancbmk
- -----------------------------------------------------------------------------------
- 400.perlbench 9770 42463 RE
- 401.bzip2 9650 3163 VE
- 403.gcc 8050 8501 RE
- 429.mcf 9120 26049 RE
- 445.gobmk 10490 19462 0.539 *
- 456.hmmer 9330 7932 1.18 *
- 458.sjeng 12100 24805 0.488 *
- 462.libquantum 20720 22558 0.918 *
- 464.h264ref 22130 1937 RE
- 471.omnetpp 6250 238 RE
- 473.astar 7020 5281 1.33 *
- 483.xalancbmk 6900 7353 0.938 *
2012年11月8日 星期四
work log
- Optimization threads use polling to probe tasks in task queue.
- It uses 17% CPU just polling empty task queue continuously.
- Should change to conditional wait approach!
- 456.hmmer is trapped in a infinite loop when running in trace mode.
- Is it because of traces? Or it is due to the ``O0'' compiled code?
- Just execution ``O0'' in block mode, hmmer can successfully complete.
- So, it is traces' fault!!! NOT GOOD!
- Status of trace mode:
- 401.bzip2: OK, 142s
- 403.gcc: OK, 416s
- 429.mcf: OK, 56s
- 445.gobmk: OK, 804s
- 456.hmmer, NOT OK, infinite loop
- Due to generated traces.
- 458.sjeng -- 116 -- S
- 462.libquantum -- 14.0 -- S
- 464.h264ref -- 256 RE (SegFault)
- 473.astar -- 100 -- S
- 483.xalancbmk -- 171 -- S
- Debug 456.hmmer
- Check MI used by traces and compared with those used in blocks.
- Fail! they are the same
- will debuggingix h264ref be slightly easier?
2012年11月7日 星期三
weblog
The Problem:
When set to llvm::CodeGenOpt::None, some execution can cause segfault on ARM host.Reduced Test Case:
Found in 483.xalanc benchmarkmovd %edi,%xmm1
pshufd $0x0,%xmm1,%xmm0
mov 0x24(%esp),%ebx
lea (%ebx,%ecx,4),%ecx
mov %ecx,0x14(%esp)
xor %ecx,%ecx
mov 0x14(%esp),%ebx
movdqa %xmm0,(%ebx)
add $0x1,%ecx
add $0x10,%ebx
cmp %ebp,%ecx
jb _end
Reason:
The generated ARM code contains the instruction:
vld1.64 {d0-d1}, [sp, :128]
which requires $sp to be 16-byte (128bit) aligned.
BUT! $sp does not 16-byte aligned!
The interesting thing is, after execution this instruction, it did not throw any exception. Instead, the value of $sp changes! Therefore, any instruction that accesses the stack cause segfault.
Solution:
make sure the $sp is at least 32-byte aligned in the prologue.
Code in prologue generated by TCG ARM:
Before:
push {r4, r5, r6, r8, r9, r10, r11, lr}
---------------------------------------------------------
push {r4, r5, r6, r8, r9, r10, r11, lr}
push {r4, r5, r6, r8, r9, r10, r11, lr}
sub sp, sp, #128 ; 0x80 # reserve some space
bx r0 # go to code cache
bx r0 # go to code cache
pop {r4, r5, r6, r8, r9, r10, r11, pc}
----------------------------------------------------------
After
----------------------------------------------------------
After
----------------------------------------------------------
push {r4, r5, r6, r8, r9, r10, r11, lr}
st sp, [r7, xxx] # store stack pointer
sub sp, sp, #65536 ; 0x10000 # reserve some space
bic sp, sp, 0x1f # align to 32-byte
bic sp, sp, 0x1f # align to 32-byte
bx r0 # go to code cache
ld sp, [r7, xxx] # restore stack pointer
pop {r4, r5, r6, r8, r9, r10, r11, pc}
CP1
======================================================
======================================================
2012年10月30日 星期二
x86-to-ARM LnQ Status
Block mode, test inputs
400.perlbench NR
401.bzip2 -- 180 -- S
403.gcc -- 466 -- S
429.mcf -- 53.7 -- S
445.gobmk -- 874 -- S
456.hmmer NR
458.sjeng -- 129 -- S
462.libquantum -- 15.7 -- S
464.h264ref -- 0.0248 RE
471.omnetpp -- 62.1 RE
473.astar -- 126 -- S
483.xalancbmk -- 218 -- S
400.perlbench NR
401.bzip2 -- 180 -- S
403.gcc -- 466 -- S
429.mcf -- 53.7 -- S
445.gobmk -- 874 -- S
456.hmmer NR
458.sjeng -- 129 -- S
462.libquantum -- 15.7 -- S
464.h264ref -- 0.0248 RE
471.omnetpp -- 62.1 RE
473.astar -- 126 -- S
483.xalancbmk -- 218 -- S
------------------------------------------------------------------------------------
Summary:
Trace Mode, test inputs
- 400.perlbench, and 456.hmmer can not run due to floating point precision error
- 464.h264ref
- 471.omnetpp
Trace Mode, test inputs
400.perlbench NR
401.bzip2 -- 140 -- S
403.gcc -- 335 RE
429.mcf -- 53.5 -- S
445.gobmk -- 533 RE
456.hmmer NR
458.sjeng -- 106 -- S
462.libquantum -- 14.3 -- S
464.h264ref -- 0.0257 RE
471.omnetpp -- 50.3 RE
473.astar -- 104 -- S
483.xalancbmk -- 169 -- S
------------------------------------------------------------------------------------
Summary:
- Error inherit from block mode
- 403.gcc
- 445.gobmk
===============================================================
guest applications are re-compiled with gcc 4.7 to make sure it really use SSE instructions
- Perlbench still stuck in arith.t due to precision problem
- bzip OK
- gcc: KILLILL, illegal instruction due to incorrect encoding of VST1LNd32
- fix alignment encoding for these instructions in getAddrMode6AddressOpValue() in ARMCodeEmitter.cpp
- Re-test all but perlbench and hmmer
- hang on leslie3d, try to find out why...
- trap in infinite loop; maybe due to precision error?
- skip it.
- Omnetpp: Fail to run; try to find out why...
- QEMU cannot run either.
- Bwaves: mis-compare; does QEMU get the same result?
- Gamess: mis-compare; does QEMU get the same result?
- NOTE: unlike bwaves, there are minus signs where it should not appear
- Milc: mis-compare;
- Zeus: segfault.
- Reason: On ARM Linux, the shared library, like ld-2.13.so, libpthread-2.13.so, etc..., are loaded starting at 0x40000000, and and the image of x86 guest starts at 0x08048000. Zeus asks for 0x45efa000 memory for its image, which cannot fit in the ``hole'' between 0x08048000 and 0x40000000.
- And I move qemu image to 0x90000000;
- So, the guest image is finally put in between 0x40000000 and 0x90000000.
- however, during execution, the guest asks more memory, and finally shit happen...
- Gromacs: mis-compare
- Leslie3d: mis-compare
Summary:
- QEMU FP precision problem
- Perlbench and hmmer trap in infinite loop due to FP precision problem
- Omnetpp fail to run;
- Only cactum and namd success in floating point benchmarks...
- 9 CINT2006 benchmarks run successfully.
- Measure timing...
400.perlbench NR
401.bzip2 -- 163 -- S
403.gcc -- 507 -- S
429.mcf -- 56.1 -- S
445.gobmk -- 955 -- S
456.hmmer NR
458.sjeng -- 126 -- S
462.libquantum -- 16.3 -- S
464.h264ref -- 387 VE
471.omnetpp NR
473.astar -- 114 -- S
483.xalancbmk -- 230 -- S
======================================================================
Trace :
401.bzip2: NOT OK
403.gcc: mis-compare
429.mcf: OK
445.gobmk: OK
458.sjeng: OK
462.libquantum: OK
473.astar: OK
464.h264ref: NOT OK
401.bzip2: NOT OK
- Infinite loop
403.gcc: mis-compare
429.mcf: OK
445.gobmk: OK
458.sjeng: OK
462.libquantum: OK
473.astar: OK
464.h264ref: NOT OK
*** longjmp causes uninitialized stack frame ***: /home/tk/lnq/install/bin/qemu-i386 terminated
Aborted (core dumped)
483.xalancbmk: NOT OK
- Terminate without printing anything
400.perlbench NR
401.bzip2 NR
403.gcc -- 460 VE
429.mcf -- 56.5 -- S
445.gobmk -- 874 -- S
456.hmmer NR
458.sjeng -- 112 -- S
462.libquantum -- 14.7 -- S
464.h264ref NR
471.omnetpp NR
473.astar -- 98.4 -- S
483.xalancbmk NR
===============================================================
Debug Trace Mode:
===============================================================
First, try to find out whose fault, and always use easiest-bug-first strategy to fight.
===============================================================
GCC: Mis-compare
Debug Trace Mode:
===============================================================
First, try to find out whose fault, and always use easiest-bug-first strategy to fight.
===============================================================
GCC: Mis-compare
- When in trace mode, the blocks are compiled with IFastEnable and Opt::None options. So, check whether this error comes from fast instruction selection mode.
- Set optimization options to ``Default'' and disable IFastEnable.
- This error is gone.
- Confirm!
- Now debug becomes simple: run $l gcc two times: enabling and disabling FastISel, and compares logs to locate where went wrong.
- FAIL!
- Another approach:
- EnableFastISel does not affect correctness.
- llvm optimization does!
- When opt is set to None, gcc got segfault!
- When opt is set to Less, gcc runs successfully.
- Run block with None, and compare used MI.
- Run experiment!
- Nothing found!
- Still don't know why llvm::CodeGenOpt::None cause fault, try find reduced example.
- Run CINT2006 with llvm::CodeGenOpt::None
- Wait for result...
- Error: 1x401.bzip2 1x403.gcc 1x445.gobmk 1x464.h264ref 1x483.xalancbmk
- Success: 1x429.mcf 1x458.sjeng 1x462.libquantum 1x473.astar 1x999.specrand
2012年10月29日 星期一
Work log
In target_i386/cpu.h
- CPU86_LDouble =:: struct { uint64_t low, uint16_t high; }
Note:
- x86 versions of perlbench - arith.t and omnetpp fail to run on ARM with QEMU-0.13 official version but run successfully in QEMU-1.1.1
- I tried two compiler : gcc-4.7 and gcc-4.4
- The reason is that QEMU does not use CONFIG_SOFTFLOAT for i386 target, even when host is ARM. As a result, QEMU uses only 64-bit double for x86 80-bit floating point registers.
- This is strange! I never heard DY report that he fails to run omnetpp, or perlbench.
- Others can now run test inputs successfully
- Plan to port LnQ to QEMU 1.2 + LLVM 3.2。
- Correctness is far important than Performance!!!!!
Work log - ARM NEON Guest Support
- [TODO] ARM NEON Guest Support
- [TODO] Refine system mode experiment slide
- [TODO] Re-implment LnQ
- [TODO] Journal paper start
- [Q] Blog ?
- [Q] Plot tutorial
- [Q] x86_64 on ARM64, run user-mode in system mode, ARM64 emulation?
ARM NEON Guest Support:
- recompile SPEC with flags: -static -O3 -marm -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=softfp -mfpu=neon -ffast-math -ftree-vectorize -funroll-all-loops -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
- GCC configure :
- ../gcc-linaro-4.7-2012.05/configure --build=i686-build_pc-linux-gnu --host=i686-build_pc-linux-gnu --target=arm-linux-gnueabi --prefix=/home/tk/local/arm-toolchain --enable-languages=c,c++,fortran --with-arch=armv7-a --with-tune=cortex-a8 --with-fpu=neon --with-float=softfp --with-sysroot=/home/tk/local/arm-toolchain/.build/arm-linux-gnueabi/libc --with-pkgversion='crosstool-NG linaro-1.13.1-2012.05-20120523 - Linaro GCC 2012.05' --with-bugurl=https://bugs.launchpad.net/gcc-linaro --enable-__cxa_atexit --enable-libmudflap --enable-libgomp --enable-libssp --with-gmp=/home/tk/local/arm-toolchain/.build --with-mpfr=/home/tk/local/arm-toolchain/.build --with-mpc=/home/tk/local/arm-toolchain/.build --with-ppl=/home/tk/local/arm-toolchain/.build --with-cloog=/home/tk/local/arm-toolchain/.build --enable-cloog-backend=isl --with-libelf=/home/tk/local/arm-toolchain/.build --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-gold --with-local-prefix=/home/tk/local/arm-toolchain/arm-linux-gnueabihf/libc --enable-c99 --enable-long-long --with-mode=arm
- 'ptrdiff_t' does not name a type: after GCC 4.6, we need to include
manually in source files. - don't want to modify source files, add ``#include
'' in c++config.h - boring...
- when compiling 464.h264, there are error messages: ']' expect: vld4.i32 {d16, d18, d20, d22}, [sp:64] by assembler.
- as from binutils 2.19 and 2.21 does not understand this because it use double-space register, which types are double floating point
- solution: use 2.23. however, it is now the loader ld has assertion error (WTF!).
- don't know how to get rid of this assertion error, I comments out the assertion in bfd/elf32-arm.c:11757
- the compiled executable can run successfully on pandaboard
- Run native with test inputs
- CINT: h264ref verification error
- 434.zeusmp, 447.dealII, 450.soplex, 459.GemsFDTD, 482.sphinx3 runtime error.
- 434.zeusmp: over 1G bss (0x45df2c7c bytes), fixed after add 4G swap
- 447.dealII: segfault; fixed after changing binutils from 2.19 to 2.23
- 482.sphinx3: program error, program misbehaved.
- Summary: all good except 482.sphinx3
2012年10月25日 星期四
multi-byte no ops in x86
http://www.asmpedia.org/index.php?title=NOP
90 nop
6690 xchg ax,ax ; 66: switch to 16-bit operand 90: opcode
0f1f00 nop dword ptr [eax] ; 0f1f: 2-byte opcode 00: mod=00 reg=000 rm=000 [EAX]
0f1f4000 nop dword ptr [eax] ; 0f1f: 2-byte opcode 40: mod=01 reg=000 rm=000 [EAX+0x00]
0f1f440000 nop dword ptr [eax+eax] ; 0f1f: 2-byte opcode 44: mod=01 reg=000 rm=100 SIB + 0x00
660f1f440000 nop word ptr [eax+eax] ; 66: switch to 16-bit operand 0f1f: 2-byte opcode 44: mod=01 reg=000 rm=100 SIB + 0x00
0f1f8000000000 nop dword ptr [eax] ; 0f1f: 2-byte opcode 80: mod=10 reg=000 rm=000 [EAX+0x00000000]
90 nop
6690 xchg ax,ax ; 66: switch to 16-bit operand 90: opcode
0f1f00 nop dword ptr [eax] ; 0f1f: 2-byte opcode 00: mod=00 reg=000 rm=000 [EAX]
0f1f4000 nop dword ptr [eax] ; 0f1f: 2-byte opcode 40: mod=01 reg=000 rm=000 [EAX+0x00]
0f1f440000 nop dword ptr [eax+eax] ; 0f1f: 2-byte opcode 44: mod=01 reg=000 rm=100 SIB + 0x00
660f1f440000 nop word ptr [eax+eax] ; 66: switch to 16-bit operand 0f1f: 2-byte opcode 44: mod=01 reg=000 rm=100 SIB + 0x00
0f1f8000000000 nop dword ptr [eax] ; 0f1f: 2-byte opcode 80: mod=10 reg=000 rm=000 [EAX+0x00000000]
2012年10月23日 星期二
ARM architecture slides and doc
Lessons from the ARM Architecture by Richard Grisenthwaite
http://www.eit.lth.se/fileadmin/eit/courses/eitf20/ARM_RG.pdf [backup]
ARMv8 Technology Preview
Talk :
http://armdevices.net/2011/11/05/armv8-technology-preview-a-highly-technical-presentation-video/
Slide:
http://www.arm.com/files/downloads/ARMv8_Architecture.pdf
ARMv8 reference manual
http://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf
Article: CPUs Have Been Doing GPU Computing Badly for Years
http://blogs.arm.com/multimedia/305-cpus-have-been-doing-gpu-computing-badly-for-years/
2012年10月10日 星期三
Cross compile LnQ: Building LnQ to ARM executatble in x86 platform
Pre-requests:
- llvm-gcc-arm:
- ../configure --prefix=/home/tk/research/llvm-qemu/tool/llvm-gcc-4.2-2.9-arm --program-prefix=llvm- --enable-llvm=/home/tk/research/llvm-qemu/llvm/install/llvm-2.9-official --target=arm-none-linux-gnueabi --with-sysroot=/home/tk/research/llvm-qemu/tool/arm-2012.03/arm-none-linux-gnueabi/libc --enable-languages=c,c++
- Make sure that /home/tk/research/llvm-qemu/tool/arm-2012.03/bin in PATH
- DO USE official LLVM 2.9, NOT LnQ's LLVM 2.9.
- llvm-2.9 x86 version, LnQ's version
- host and target set to i686-pc-linux-gnu
- llvm-2.9 ARM version, LnQ's version
- host and target set to arm-none-linux-gnueabi
- ARM toolchain
- download from
Environment setting:
- LLVM_ARM=llvm-2.9-arm/bin
- LLVM=llvm-2.9/bin
- LLVM_GCC=llvm-gcc-4.2-2.9-arm/bin
- 在 PATH 設定成 LLVM_ARM 先,LLVM,再 LLVM_GCC,
- LLVM_ARM 中的 llvm-link 跟 opt 要先設成非執行檔。
- 理由:我們需要用 $LLVM_ARM/llvm-config 來設定 LD_FLAGS,但我們也需要 $LLVM/llvm-link 跟 $LLVM/opt 這兩個檔。所以這兩個都要在 PATH 上。
LnQ configure:
- 加入 --cross-prefix='arm-linux-gnueabi-' --cpu=armv7l
- configure --target-list=i386-linux-user --prefix=$INSTALL --enable-lnq --disable-strip --cross-prefix='arm-linux-gnueabi-' --cpu=armv7l
- waste time to chase a ghost...
- Question: number of workers can slow down Xalancbmk????
- train input, merge used
- #1: 334.756721
- #2: 342.477593
- #3: 376.607939
- #1: 348.298319
- #1: 333.039942
- #3: 357.604264
- #1: 331.753310
- #3: 359.777587
- Generate code duration? NO!
- performance different, trace, i386
- -3.37, perlbench
- -1.25, bzip2
- 4.82, gcc
- -1.64, mcf
- 0.00, gobmk
- -2.72,
- -2.54
- -6.59
- -6.90
- -4.65
- 2.50
- 3.63
- performance difference, region, i386
- -6.61, sjeng
- -6.40, h264ref
- -5.31, omnetpp
- -2.83, libquantum
- -2.30, bzip2
- -2.11, mcf
- -1.65, perlbench
- -1.49, gobmk
- -1.11, hmmer
- 7.11, gcc
- 9.26, astar
- 2.00, xalancbmk
- performance differece, trace, ARM
- 483.xalancbmk -12.53
- 471.omnetpp -9.11
- 456.hmmer -3.41
- 429.mcf -2.43
- 473.astar -2.02
- 403.gcc -1.33
- 445.gobmk -1.21
- 400.perlbench -1.04
- 464.h264ref 0.00
- 401.bzip2 2.22
- 458.sjeng 2.84
- 462.libquantum 7.46
- performance differece, region, ARM
- 471.omnetpp -16.53
- 445.gobmk -15.13
- 483.xalancbmk -12.62
- 458.sjeng -10.03
- 462.libquantum -2.49
- 473.astar -1.83
- 456.hmmer -1.70
- 401.bzip2 -0.47
- 403.gcc -0.16
- 429.mcf 0.00
- 400.perlbench 0.58
2012年10月5日 星期五
Latex Symbols
Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\leq | \geq | \equiv | \models | \prec | |||||||||
\succ | \sim | \perp | \preceq | \succeq | |||||||||
\simeq | \mid | \ll | \gg | \asymp | |||||||||
\parallel | \subset | \supset | \approx | \bowtie | |||||||||
\subseteq | \supseteq | \cong | \sqsubset | \sqsupset | |||||||||
\neq | \smile | \sqsubseteq | \sqsupseteq | \doteq | |||||||||
\frown | \in | \ni | \notin | \propto | |||||||||
\vdash | \dashv | < | > | = |
Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | |||
---|---|---|---|---|---|---|---|---|---|---|
\pm | \cap | \diamond | \oplus | |||||||
\mp | \cup | \bigtriangleup | \ominus | |||||||
\times | \uplus | \bigtriangledown | \otimes | |||||||
\div | \sqcap | \triangleleft | \oslash | |||||||
\ast | \sqcup | \triangleright | \odot | |||||||
\star | \vee | \bigcirc | \circ | |||||||
\dagger | \wedge | \bullet | \setminus | |||||||
\ddagger | \cdot | \wr | \amalg |
Symbol | Script | Symbol | Script | |
---|---|---|---|---|
\exists | \rightarrow or \to | |||
\nexists | \leftarrow or \gets | |||
\forall | \mapsto | |||
\neg | \implies | |||
\subset | \Rightarrow (preferred for implication) | |||
\supset | \leftrightarrow | |||
\in | \iff | |||
\notin | \Leftrightarrow (preferred for equivalence (iff)) | |||
\ni | \top | |||
\land | \bot | |||
\lor | and | \emptyset and \varnothing |
Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | |||
---|---|---|---|---|---|---|---|---|---|---|
| | \| | / | \backslash | |||||||
\{ | \} | \langle | \rangle | |||||||
\uparrow | \Uparrow | \lceil | \rceil | |||||||
\downarrow | \Downarrow | \lfloor | \rfloor |
Symbol | Script | Symbol | Script | |
---|---|---|---|---|
and | \Alpha and \alpha | and | \Nu and \nu | |
and | \Beta and \beta | and | \Xi and \xi | |
and | \Gamma and \gamma | and | \Omicron and \omicron | |
and | \Delta and \delta | , and | \Pi, \pi and \varpi | |
, and | \Epsilon, \epsilon and \varepsilon | , and | \Rho, \rho and \varrho | |
and | \Zeta and \zeta | , and | \Sigma, \sigma and \varsigma | |
and | \Eta and \eta | and | \Tau and \tau | |
, and | \Theta, \theta and \vartheta | and | \Upsilon and \upsilon | |
and | \Iota and \iota | , , and | \Phi, \phi and \varphi | |
, and | \Kappa, \kappa and \varkappa | and | \Chi and \chi | |
and | \Lambda and \lambda | and | \Psi and \psi | |
and | \Mu and \mu | and | \Omega and \omega |
Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\partial | \imath | \Re | \nabla | \aleph | |||||||||
\eth | \jmath | \Im | \Box | \beth | |||||||||
\hbar | \ell | \wp | \infty | \gimel |
Symbol | Script | Symbol | Script | Symbol | Script | Symbol | Script | |||
---|---|---|---|---|---|---|---|---|---|---|
\sin | \arcsin | \sinh | \sec | |||||||
\cos | \arccos | \cosh | \csc | |||||||
\tan | \arctan | \tanh | ||||||||
\cot | \arccot | \coth |
訂閱:
文章 (Atom)