2012年11月30日 星期五

Worklog

find equivalent neon instruction for the following SSE instruction:
PCMPEQBrr
 PCMPEQDrr
 PSLLDri
 PSUBUSBrr
 PUNPCKHBWrr
 PUNPCKHWDrr
 PUNPCKLBWrr
 PUNPCKLDQrr
 PUNPCKLQDQrr
 PUNPCKLWDrr

Trace Mode, Ref Input

Trace Mode, CINT2006 reference input:
400.perlbench    9770    6657         1.47  *
401.bzip2        9650    2623               VE
403.gcc          8050    6133               RE
429.mcf          9120       9.01            RE
445.gobmk       10490   10356         1.01  *
456.hmmer        9330    6044         1.54  *
458.sjeng       12100   10927         1.11  *
462.libquantum  20720   22428         0.924 *
464.h264ref     22130   11599         1.91  *
471.omnetpp      6250     204               RE
473.astar        7020    4235         1.66  *
483.xalancbmk    6900    6767         1.02  *

Fail to run 401, 403, 429, 471
429.mcf needs 839 MB buf we only have 913MB in ARM, not enough memory.
After creating a swap of 1G, mcf can run successfully.
in top, only 17MB in SWAP area.

2012年11月29日 星期四

IBTC + Victim Performance Evaluation

     IBTC      IBTC+Victim
2^11 0.997770  0.999741
2^12 0.999387  0.999932
2^13 0.999725  0.999981
2^14 0.999857  0.999992
2^15 0.999887
2^16 0.999949
2^17 0.999955
2^18 0.999995


Block Mode
Performance (No inline)
IBTC(2^11)+Victim
T: 369.863281   0.545624        138.762726      0.000000        230.276672      0.278259
IBTC(2^18)
T: 375.984528   0.372223        138.313660      0.000000        237.024841      0.273804

Performance - Inline
IBTC(2^11)+Victim
T: 367.764587   0.378113        141.984161      0.000000        225.127594      0.274719
IBTC(2^18)
T: 373.440033   0.549896        145.442383      0.000000        227.171478      0.276276
IBTC(2^12)+Victim (make ibtc table 2^16 KB)
T: 4m20s



Hanging at SSH2_MSG_SERVICE_ACCEPT received

Question: The connection to remote server is slow, and after use -v, it hangs at SSH2_MSG_SERVICE_ACCEPT received.
Solution:
edit /etc/ssh/sshd_config and add one line:
UseDNS no

restart ssh service and done.

2012年11月28日 星期三

work log

 *** longjmp causes uninitialized stack frame ***

longjmp corrupt stack exception
abort execution
maybe cpu state content got wrongly overwritten.
R7 base register is overwritten!
Mark R7 as ReservedReg in ARMBaseRegisterInfo.cpp

  • mmap memory manager is broken; stop using it until we fix it.

Constant Pool related information and bugs

http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html

constant pool bug:
Trace fragments are usually much bigger than block fragments. LLVM ARM JIT fails to put constant pool within the range of load instructions with immediate offset (ranging from 4096 to -4096 bytes), which is referred to as the out-of-range bug.
My first thought is the ARM JIT forget to take out-of-range bug into consideration, but ARMConstantIslandPass does take this into consideration. And, immediately, I found that the source of this bug is the wrongly calculated offset.
Why would this happen in trace mode? Well, this is because I add one intrinsic BLOCKLINK which takes 24 bytes but I didn't update GetInstSizeInBytes() in ARMBaseInstrInfo. So, after I add this information in GetInstSizeInBytes(), the offset is correctly calculated.

2012年11月27日 星期二

Work Log

Initialize() -> StartAll() -> Create Queues and Start each threads QCond.Initialize()->
Loop() -> TryGenerateTrace() -> QCond.Wait() - until start_=false
          ^|----------------------------------------------------|

Before Fork
StopAll() -> set start_ to false for all threads -> -> QCond.Destroy()

After Fork
StartAll()

After inserting tasks into queue, call QCond.Wake()

http://weblogs.java.net/blog/mlam/archive/2008/03/cvm_jit_constan.html

2012年11月16日 星期五

Related Works

[TODO: Add paper links to all related papers and top 10 to read ]
  • IBM PowerVM Lx86 http://www.ibm.com/developerworks/linux/lx86/index.html
    • PDF file: http://www.redbooks.ibm.com/redpapers/pdfs/redp4298.pdf
  • FX!32 
  • Dynamo
  • Advances and Future Challenges in Binary Translation and Optimization, PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
  • Design and Engineering of a Dynamic Binary Optimizer, PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
  • Precise Exception Semantics in Dynamic Compilation, Proceeding CC '02 Proceedings of the 11th International Conference on Compiler Construction
  • Transmeta
  • UQBT, Walkabout
  • DynamoRIO
    • Persistent code cache
  • PIN
  • Valgrind
  • StartDBT, HDTrans
  • Dr.Memory
  • QEMU


2012年11月9日 星期五

work log


  • @20121109 - 10:50 AM, Run CINT2006, train input, ARM host, block mode.
    • expect results: Perlbench, Omnetpp fails, others should be OK.
    • waiting...
    • 400.perlbench      --     1091           -- S
    • 401.bzip2          --      730           -- S                                  
    • 403.gcc            --      427           -- S                                  
    • 429.mcf            --      314           -- S                                  
    • 445.gobmk          --     3792           -- S                                  
    • 456.hmmer          --      786           -- S                                  
    • 458.sjeng          --     3158           -- S                                  
    • 462.libquantum     --       71.3         -- S                                  
    • 464.h264ref        --     1647           -- S                                  
    • 471.omnetpp                                 NR                                 
    • 473.astar          --     1075           -- S                                  
    • 483.xalancbmk      --     2544           -- S                                  
  • Same as above except in trace mode.
    • expected result: hmmer hang in 
    • 462.libquantum hang!
    • =========================================
    • 400.perlbench      --      892           -- S
    • 401.bzip2          --      656           -- S
    • 403.gcc            --      363           -- S
    • 429.mcf            --      303           -- S
    • 445.gobmk          --     3022              VE
    • 456.hmmer                                   NR
    • 458.sjeng          --     2378           -- S
    • 462.libquantum     --    13802              RE
    • 464.h264ref        --      250              RE
    • 471.omnetpp        --       96.7            RE
    • 473.astar          --      984           -- S
    • 483.xalancbmk      --     1688           -- S
    • =========================================
  • @20121110 Run ref input
    • Block mode.
    • Perlbench hang! input splitmail; in I == E error.
    • bzip2 miscompared
    • gcc mis-compared
    • mcf hang!
    • -----------------------------------------------------------------------------------
    • Error: 1x400.perlbench 1x401.bzip2 1x403.gcc 1x429.mcf 1x464.h264ref 1x471.omnetpp
    • Success: 1x445.gobmk 1x456.hmmer 1x458.sjeng 1x462.libquantum 1x473.astar 1x483.xalancbmk
    • -----------------------------------------------------------------------------------
    • 400.perlbench    9770      42463            RE
    • 401.bzip2        9650       3163            VE
    • 403.gcc          8050       8501            RE
    • 429.mcf          9120      26049            RE
    • 445.gobmk       10490      19462      0.539 *
    • 456.hmmer        9330       7932      1.18  *
    • 458.sjeng       12100      24805      0.488 *
    • 462.libquantum  20720      22558      0.918 *
    • 464.h264ref     22130       1937            RE
    • 471.omnetpp      6250        238            RE
    • 473.astar        7020       5281      1.33  *
    • 483.xalancbmk    6900       7353      0.938 *

2012年11月8日 星期四

work log


  • Optimization threads use polling to probe tasks in task queue.
    • It uses 17% CPU just polling empty task queue continuously.
    • Should change to conditional wait approach!
  • 456.hmmer is trapped in a infinite loop when running in trace mode.
    • Is it because of traces? Or it is due to the ``O0'' compiled code?
    • Just execution ``O0'' in block mode, hmmer can successfully complete.
    • So, it is traces' fault!!! NOT GOOD!
  • Status of trace mode:
    • 401.bzip2: OK, 142s
    • 403.gcc: OK, 416s
    • 429.mcf: OK, 56s
    • 445.gobmk: OK, 804s
    • 456.hmmer, NOT OK, infinite loop
      • Due to generated traces.
    • 458.sjeng          --      116           -- S
    • 462.libquantum     --       14.0         -- S
    • 464.h264ref        --      256              RE (SegFault)
    • 473.astar          --      100           -- S
    • 483.xalancbmk      --      171           -- S
  • Debug 456.hmmer
    • Check MI used by traces and compared with those used in blocks.
      • Fail! they are the same
  • will debuggingix h264ref be slightly easier?

2012年11月7日 星期三

weblog


The Problem:

When set to llvm::CodeGenOpt::None, some execution can cause segfault on ARM host.

Reduced Test Case:

Found in 483.xalanc benchmark

    movd   %edi,%xmm1
    pshufd $0x0,%xmm1,%xmm0
    mov    0x24(%esp),%ebx
    lea    (%ebx,%ecx,4),%ecx
    mov    %ecx,0x14(%esp)
    xor    %ecx,%ecx
    mov    0x14(%esp),%ebx
    movdqa %xmm0,(%ebx)
    add    $0x1,%ecx
    add    $0x10,%ebx
    cmp    %ebp,%ecx
    jb     _end

Reason:

The generated ARM code contains the instruction:
vld1.64 {d0-d1}, [sp, :128]
which requires $sp to be 16-byte (128bit) aligned.
BUT! $sp does not 16-byte aligned! 
The interesting thing is, after execution this instruction, it did not throw any exception. Instead, the value of $sp changes! Therefore, any instruction that accesses the stack cause segfault.

Solution:

make sure the $sp is at least 32-byte aligned in the prologue.

Code in prologue generated by TCG ARM:

Before:

---------------------------------------------------------
push    {r4, r5, r6, r8, r9, r10, r11, lr}
sub    sp, sp, #128  ; 0x80 # reserve some space
bx      r0 # go to code cache
pop     {r4, r5, r6, r8, r9, r10, r11, pc}
----------------------------------------------------------
After
----------------------------------------------------------

push    {r4, r5, r6, r8, r9, r10, r11, lr}
st       sp, [r7, xxx] # store stack pointer
sub    sp, sp, #65536  ; 0x10000 # reserve some space
bic     sp, sp, 0x1f  # align to 32-byte
bx      r0 # go to code cache
ld       sp, [r7, xxx] # restore stack pointer
pop     {r4, r5, r6, r8, r9, r10, r11, pc}
CP1
======================================================