find equivalent neon instruction for the following SSE instruction:
PCMPEQBrr
PCMPEQDrr
PSLLDri
PSUBUSBrr
PUNPCKHBWrr
PUNPCKHWDrr
PUNPCKLBWrr
PUNPCKLDQrr
PUNPCKLQDQrr
PUNPCKLWDrr
2012年11月30日 星期五
Trace Mode, Ref Input
Trace Mode, CINT2006 reference input:
400.perlbench 9770 6657 1.47 *
401.bzip2 9650 2623 VE
403.gcc 8050 6133 RE
429.mcf 9120 9.01 RE
445.gobmk 10490 10356 1.01 *
456.hmmer 9330 6044 1.54 *
458.sjeng 12100 10927 1.11 *
462.libquantum 20720 22428 0.924 *
464.h264ref 22130 11599 1.91 *
471.omnetpp 6250 204 RE
473.astar 7020 4235 1.66 *
483.xalancbmk 6900 6767 1.02 *
Fail to run 401, 403, 429, 471
429.mcf needs 839 MB buf we only have 913MB in ARM, not enough memory.
After creating a swap of 1G, mcf can run successfully.
in top, only 17MB in SWAP area.
400.perlbench 9770 6657 1.47 *
401.bzip2 9650 2623 VE
403.gcc 8050 6133 RE
429.mcf 9120 9.01 RE
445.gobmk 10490 10356 1.01 *
456.hmmer 9330 6044 1.54 *
458.sjeng 12100 10927 1.11 *
462.libquantum 20720 22428 0.924 *
464.h264ref 22130 11599 1.91 *
471.omnetpp 6250 204 RE
473.astar 7020 4235 1.66 *
483.xalancbmk 6900 6767 1.02 *
Fail to run 401, 403, 429, 471
429.mcf needs 839 MB buf we only have 913MB in ARM, not enough memory.
After creating a swap of 1G, mcf can run successfully.
in top, only 17MB in SWAP area.
2012年11月29日 星期四
IBTC + Victim Performance Evaluation
IBTC IBTC+Victim
2^11 0.997770 0.999741
2^12 0.999387 0.999932
2^13 0.999725 0.999981
2^14 0.999857 0.999992
2^15 0.999887
2^16 0.999949
2^17 0.999955
2^18 0.999995
Block Mode
Performance (No inline)
IBTC(2^11)+Victim
T: 369.863281 0.545624 138.762726 0.000000 230.276672 0.278259
IBTC(2^18)
T: 375.984528 0.372223 138.313660 0.000000 237.024841 0.273804
Performance - Inline
IBTC(2^11)+Victim
T: 367.764587 0.378113 141.984161 0.000000 225.127594 0.274719
IBTC(2^18)
T: 373.440033 0.549896 145.442383 0.000000 227.171478 0.276276
IBTC(2^12)+Victim (make ibtc table 2^16 KB)
T: 4m20s
2^11 0.997770 0.999741
2^12 0.999387 0.999932
2^13 0.999725 0.999981
2^14 0.999857 0.999992
2^15 0.999887
2^16 0.999949
2^17 0.999955
2^18 0.999995
Block Mode
Performance (No inline)
IBTC(2^11)+Victim
T: 369.863281 0.545624 138.762726 0.000000 230.276672 0.278259
IBTC(2^18)
T: 375.984528 0.372223 138.313660 0.000000 237.024841 0.273804
Performance - Inline
IBTC(2^11)+Victim
T: 367.764587 0.378113 141.984161 0.000000 225.127594 0.274719
IBTC(2^18)
T: 373.440033 0.549896 145.442383 0.000000 227.171478 0.276276
IBTC(2^12)+Victim (make ibtc table 2^16 KB)
T: 4m20s
Hanging at SSH2_MSG_SERVICE_ACCEPT received
Question: The connection to remote server is slow, and after use -v, it hangs at SSH2_MSG_SERVICE_ACCEPT received.
Solution:
edit /etc/ssh/sshd_config and add one line:
UseDNS no
restart ssh service and done.
Solution:
edit /etc/ssh/sshd_config and add one line:
UseDNS no
restart ssh service and done.
2012年11月28日 星期三
work log
*** longjmp causes uninitialized stack frame ***
longjmp corrupt stack exception
abort execution
maybe cpu state content got wrongly overwritten.
R7 base register is overwritten!
Mark R7 as ReservedReg in ARMBaseRegisterInfo.cpp
longjmp corrupt stack exception
abort execution
maybe cpu state content got wrongly overwritten.
R7 base register is overwritten!
Mark R7 as ReservedReg in ARMBaseRegisterInfo.cpp
- mmap memory manager is broken; stop using it until we fix it.
Constant Pool related information and bugs
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
constant pool bug:
Trace fragments are usually much bigger than block fragments. LLVM ARM JIT fails to put constant pool within the range of load instructions with immediate offset (ranging from 4096 to -4096 bytes), which is referred to as the out-of-range bug.
My first thought is the ARM JIT forget to take out-of-range bug into consideration, but ARMConstantIslandPass does take this into consideration. And, immediately, I found that the source of this bug is the wrongly calculated offset.
Why would this happen in trace mode? Well, this is because I add one intrinsic BLOCKLINK which takes 24 bytes but I didn't update GetInstSizeInBytes() in ARMBaseInstrInfo. So, after I add this information in GetInstSizeInBytes(), the offset is correctly calculated.
constant pool bug:
Trace fragments are usually much bigger than block fragments. LLVM ARM JIT fails to put constant pool within the range of load instructions with immediate offset (ranging from 4096 to -4096 bytes), which is referred to as the out-of-range bug.
My first thought is the ARM JIT forget to take out-of-range bug into consideration, but ARMConstantIslandPass does take this into consideration. And, immediately, I found that the source of this bug is the wrongly calculated offset.
Why would this happen in trace mode? Well, this is because I add one intrinsic BLOCKLINK which takes 24 bytes but I didn't update GetInstSizeInBytes() in ARMBaseInstrInfo. So, after I add this information in GetInstSizeInBytes(), the offset is correctly calculated.
2012年11月27日 星期二
Work Log
Initialize() -> StartAll() -> Create Queues and Start each threads QCond.Initialize()->
Loop() -> TryGenerateTrace() -> QCond.Wait() - until start_=false
^|----------------------------------------------------|
Before Fork
StopAll() -> set start_ to false for all threads -> -> QCond.Destroy()
After Fork
StartAll()
After inserting tasks into queue, call QCond.Wake()
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
Loop() -> TryGenerateTrace() -> QCond.Wait() - until start_=false
^|----------------------------------------------------|
Before Fork
StopAll() -> set start_ to false for all threads -> -> QCond.Destroy()
After Fork
StartAll()
After inserting tasks into queue, call QCond.Wake()
http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html
2012年11月16日 星期五
Related Works
[TODO: Add paper links to all related papers and top 10 to read ]
- IBM PowerVM Lx86 http://www.ibm.com/developerworks/linux/lx86/index.html
- PDF file: http://www.redbooks.ibm.com/redpapers/pdfs/redp4298.pdf
- FX!32
- Dynamo
- Advances and Future Challenges in Binary Translation and Optimization, PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
- Design and Engineering of a Dynamic Binary Optimizer, PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
- Precise Exception Semantics in Dynamic Compilation, Proceeding CC '02 Proceedings of the 11th International Conference on Compiler Construction
- Transmeta
- UQBT, Walkabout
- DynamoRIO
- Persistent code cache
- PIN
- Valgrind
- StartDBT, HDTrans
- Dr.Memory
- QEMU
2012年11月9日 星期五
work log
- @20121109 - 10:50 AM, Run CINT2006, train input, ARM host, block mode.
- expect results: Perlbench, Omnetpp fails, others should be OK.
- waiting...
- 400.perlbench -- 1091 -- S
- 401.bzip2 -- 730 -- S
- 403.gcc -- 427 -- S
- 429.mcf -- 314 -- S
- 445.gobmk -- 3792 -- S
- 456.hmmer -- 786 -- S
- 458.sjeng -- 3158 -- S
- 462.libquantum -- 71.3 -- S
- 464.h264ref -- 1647 -- S
- 471.omnetpp NR
- 473.astar -- 1075 -- S
- 483.xalancbmk -- 2544 -- S
- Same as above except in trace mode.
- expected result: hmmer hang in
- 462.libquantum hang!
- =========================================
- 400.perlbench -- 892 -- S
- 401.bzip2 -- 656 -- S
- 403.gcc -- 363 -- S
- 429.mcf -- 303 -- S
- 445.gobmk -- 3022 VE
- 456.hmmer NR
- 458.sjeng -- 2378 -- S
- 462.libquantum -- 13802 RE
- 464.h264ref -- 250 RE
- 471.omnetpp -- 96.7 RE
- 473.astar -- 984 -- S
- 483.xalancbmk -- 1688 -- S
- =========================================
- @20121110 Run ref input
- Block mode.
- Perlbench hang! input splitmail; in I == E error.
- bzip2 miscompared
- gcc mis-compared
- mcf hang!
- -----------------------------------------------------------------------------------
- Error: 1x400.perlbench 1x401.bzip2 1x403.gcc 1x429.mcf 1x464.h264ref 1x471.omnetpp
- Success: 1x445.gobmk 1x456.hmmer 1x458.sjeng 1x462.libquantum 1x473.astar 1x483.xalancbmk
- -----------------------------------------------------------------------------------
- 400.perlbench 9770 42463 RE
- 401.bzip2 9650 3163 VE
- 403.gcc 8050 8501 RE
- 429.mcf 9120 26049 RE
- 445.gobmk 10490 19462 0.539 *
- 456.hmmer 9330 7932 1.18 *
- 458.sjeng 12100 24805 0.488 *
- 462.libquantum 20720 22558 0.918 *
- 464.h264ref 22130 1937 RE
- 471.omnetpp 6250 238 RE
- 473.astar 7020 5281 1.33 *
- 483.xalancbmk 6900 7353 0.938 *
2012年11月8日 星期四
work log
- Optimization threads use polling to probe tasks in task queue.
- It uses 17% CPU just polling empty task queue continuously.
- Should change to conditional wait approach!
- 456.hmmer is trapped in a infinite loop when running in trace mode.
- Is it because of traces? Or it is due to the ``O0'' compiled code?
- Just execution ``O0'' in block mode, hmmer can successfully complete.
- So, it is traces' fault!!! NOT GOOD!
- Status of trace mode:
- 401.bzip2: OK, 142s
- 403.gcc: OK, 416s
- 429.mcf: OK, 56s
- 445.gobmk: OK, 804s
- 456.hmmer, NOT OK, infinite loop
- Due to generated traces.
- 458.sjeng -- 116 -- S
- 462.libquantum -- 14.0 -- S
- 464.h264ref -- 256 RE (SegFault)
- 473.astar -- 100 -- S
- 483.xalancbmk -- 171 -- S
- Debug 456.hmmer
- Check MI used by traces and compared with those used in blocks.
- Fail! they are the same
- will debuggingix h264ref be slightly easier?
2012年11月7日 星期三
weblog
The Problem:
When set to llvm::CodeGenOpt::None, some execution can cause segfault on ARM host.Reduced Test Case:
Found in 483.xalanc benchmarkmovd %edi,%xmm1
pshufd $0x0,%xmm1,%xmm0
mov 0x24(%esp),%ebx
lea (%ebx,%ecx,4),%ecx
mov %ecx,0x14(%esp)
xor %ecx,%ecx
mov 0x14(%esp),%ebx
movdqa %xmm0,(%ebx)
add $0x1,%ecx
add $0x10,%ebx
cmp %ebp,%ecx
jb _end
Reason:
The generated ARM code contains the instruction:
vld1.64 {d0-d1}, [sp, :128]
which requires $sp to be 16-byte (128bit) aligned.
BUT! $sp does not 16-byte aligned!
The interesting thing is, after execution this instruction, it did not throw any exception. Instead, the value of $sp changes! Therefore, any instruction that accesses the stack cause segfault.
Solution:
make sure the $sp is at least 32-byte aligned in the prologue.
Code in prologue generated by TCG ARM:
Before:
push {r4, r5, r6, r8, r9, r10, r11, lr}
---------------------------------------------------------
push {r4, r5, r6, r8, r9, r10, r11, lr}
push {r4, r5, r6, r8, r9, r10, r11, lr}
sub sp, sp, #128 ; 0x80 # reserve some space
bx r0 # go to code cache
bx r0 # go to code cache
pop {r4, r5, r6, r8, r9, r10, r11, pc}
----------------------------------------------------------
After
----------------------------------------------------------
After
----------------------------------------------------------
push {r4, r5, r6, r8, r9, r10, r11, lr}
st sp, [r7, xxx] # store stack pointer
sub sp, sp, #65536 ; 0x10000 # reserve some space
bic sp, sp, 0x1f # align to 32-byte
bic sp, sp, 0x1f # align to 32-byte
bx r0 # go to code cache
ld sp, [r7, xxx] # restore stack pointer
pop {r4, r5, r6, r8, r9, r10, r11, pc}
CP1
======================================================
======================================================
訂閱:
文章 (Atom)