2012年6月28日 星期四

Think flow


  1. Implement Opt2 for qemu_st
  2. go home!

Work Log


  1. About zero length array: http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html .
  2. https://wiki.linaro.org/PeterMaydell/QemuVersatileExpress Linaro QEMU V.Express support
    1. (IMG=vexpress.img ; if [ -e "$IMG" ] ; then sudo mount -o loop,offset="$(file "$IMG" | awk 'BEGIN { RS=";"; } /partition 2/ { print $7*512; }')" -t auto "$IMG" /mnt/mnt; else echo "$IMG not found"; fi )
  3. Linaro Android QEMU V.Express: https://wiki.linaro.org/KenWerner/Sandbox/AndroidQEMU
    1. vmlinuz and initrd.gz is in uImage and uInitrd: dd if=uImage skip=64 bs=1 to extract them
  4. Use reboot to shutdown Andriod
  5. ARM-VExpress image: http://releases.linaro.org/12.05/ubuntu/vexpress/

2012年6月27日 星期三

Think Flow


  1. tcg_livness_analysis, if opcode is qemu_ld or qemu_st, set all globals alive.
  2. qemu_ld is OK now.
  3. restore to morning status
  4. qemu 1.01 can exit qemu when poweroff
  5. chaos
  6. qemu_st, fail, no output, trap in some sort of loops?
  7. just forget what I'm going to do after viewing some web pages....

2012年6月22日 星期五

  • Change the location of SAVE_DIRTY_STATES
  • Change the conditional branch of tcg_out_tlb_load from 
  • It seems PAGE FAULT is just a result, because wrong PATH taken
    • WRONG! it is a mistake.
  • STRANGE THINK HAPPEN, BE CAREFUL, FOCUS

Think flow


  1. I modify the conditional branch of load_tlb from JNE to JE, and related code.
    1. Running in original QEMU can boot ARM-Linux, so far so good.
  2. I change the location of SAVE_DIRTY_STATES, call it QEMU_TK;
  3. QEMU_TK die after first page fault happen
    1. Question: what's the difference between QEMU and QEMU_TK ?
    2. difference means the state of 
  4. OK, we need to restore states back 

2012年6月21日 星期四

Thinking Flow

study qemu code: tcg_reg_alloc_op(): 1708
  1. what is fixed_reg TCGemp 
    1. in tcg_global_reg_new_internal, fixed_reg is set 1
    2. in tcg_global_mem_new_internal, fixed_reg is set 0
    3. so, it seems it indicates whether this temp is register or not
  2. In TCGContext, what is reg_to_temp?
    1. in tcg_reg_alloc(), s->reg_to_temp[reg] decides whether the HOST register is mapped to any TCGTemp.
    2. So I think, reg_to_temp indicates current HOST reg represents reg_to_temp[reg].
  3. What is val_type in TCGTemp?
    1. NOT CLEAR
    2. It seems it indicates the current type of this temp.
    3. It is possible that ts->fixed_reg && ts->val_type == TEMP_VAL_MEM; or                            NOT ts->fixed_reg and ts->val_type == TEMP_VAL_REG.
  4. When does TCGArgDef args_ct set?


TCG: tcg_gen_code_common


In TCGContext:
/* liveness analysis */
uint16_t *op_dead_iargs;
/* for each operation, each bit tells if the corresponding input argument is dead */

what is tcg_op_defs

In: tcg_liveness_analysis, tcg/tcg.c: 1187
backward scan

NOTE: tcg_opc.h: definition of TCG opcodes (a.k.a TCG IR)
So, remove qemu_ld/st TCG_OPF_CALL_CLOBBER here

In tcg_liveness_analysis:
1292                 } else if (def->flags & TCG_OPF_CALL_CLOBBER) {
1293                     /* globals are live */
1294                     memset(dead_temps, 0, s->nb_globals);
1295                 }
Question: if we remove TCG_OPF_CALL_CLOBBER of qemu_ld/st, will this be a problem?

In: tcg_reg_alloc_op:


1708         if (def->flags & TCG_OPF_CALL_CLOBBER) {
1709             /* XXX: permit generic clobber register list ? */
1710             for(reg ex= 0; reg < TCG_TARGET_NB_REGS; reg++) {
1711                 if (tcg_regset_test_reg(tcg_target_call_clobber_regs, reg)) {
1712                     tcg_reg_free(s, reg);
1713                 }
1714             }
1715             /* XXX: for load/store we could do that only for the slow path
1716                (i.e. when a memory callback is called) */
1717          
1718             /* store globals and free associated registers (we assume the insn
1719                can modify any global. */
1720             save_globals(s, allocated_regs);
1721         }

Question: what does Marsalis Wallace look like ? or
What does tcg_reg_free do?

It loops over tcg_target_call_clobber_regs and
if S->temps[reg]->mem_coherent is not true, store reg back to env->temp_buf

Question: what does save_globals do?

  1. What does ``globals'' mean?
  2. In tcg/README, A TCG "global" is a variable which is live in all the functions (equivalent of a C global variable). They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs  (not implemented yet).
  3. call temp_save to save temp
  4. In temp_save(), save temp to env->temp_buf
==================================================================
tcg_out_op() is called to generate code for the TCG opcode.
We are interested in tcg_out_qemu_ld/st

QUESTION:
Strange enough, I cannot find lines where to save guest register states back to their canonical locations.
I only saw save back to temp_buf in 1708.
That is exactly the place.

==================================================================
Remove TCG_OPF_CALL_CLOBBER in qemu_ld
move save_dirty_state when TLB miss
program fail when the first PAGE FAULT occurs.
should compare REG contents between my version and original version
==================================================================

















































2012年6月19日 星期二

QEMU ARM USE 
when using nographic, you cannot use --<2> to switch to monitor screen since there is no graphic anymore.
Instead, use -curses
then monitor will show

ALSO, use -monitor stdio


how to change runlevel through kernel parameter append

JUST ADD THE NUMBER OF RUNLEVEL
EXAMPLE:
 "root=/dev/sdb1 console=/dev/ttyAMA0 2 "


HOW to mount qcow image used by QEMU

HOW to mount qcow image used by QEMU
http://blog.loftninjas.org/2008/10/27/mounting-kvm-qcow2-qemu-disk-images/

2012年6月8日 星期五

2012年6月6日 星期三

ARM online reference site

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cjagdjbf.html

i7 currently RUNNING experiements

i7 currently RUNNING experiments:
TRACE_MERGE
TRACE
TRACE_NET_ORIG

Each configuration run 4 benchmark set: CINT-ARM, CINT-IA32, CFP-IA32, CFP_VECTOR-IA32
Each benchmark run 5 times.
There are 3 * 4 * 5 = 120 benchmarks need to run
estimate hours: 120 * 15000 sec = 20 days
6/26 will finish all runs!

Producing Wrong Data Without Doing Anything Obviously Wrong

Producing Wrong Data Without Doing Anything Obviously Wrong


LnQ Region Performance

r531 V.S r521



OMNETPP and XALANCBMK has performance down 15% and 8%.

SPECvirt_sc2010

SPECvirt_sc2010: SPEC's first benchmark addressing performance evaluation of datacenter servers used in virtualized server consolidation.





2012年6月5日 星期二

statically build OpenMP program

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39176#c7
we have to link pthread ourselves.
Add
-Wl,--whole-archive -lpthread -Wl,--no-whole-archive

gcc sse builtin functions

http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/X86-Built_002din-Functions.html


     v8qi __builtin_ia32_paddb (v8qi, v8qi)
     v4hi __builtin_ia32_paddw (v4hi, v4hi)
     v2si __builtin_ia32_paddd (v2si, v2si)
     v8qi __builtin_ia32_psubb (v8qi, v8qi)
     v4hi __builtin_ia32_psubw (v4hi, v4hi)
     v2si __builtin_ia32_psubd (v2si, v2si)
     v8qi __builtin_ia32_paddsb (v8qi, v8qi)
     v4hi __builtin_ia32_paddsw (v4hi, v4hi)
     v8qi __builtin_ia32_psubsb (v8qi, v8qi)
     v4hi __builtin_ia32_psubsw (v4hi, v4hi)
     v8qi __builtin_ia32_paddusb (v8qi, v8qi)
     v4hi __builtin_ia32_paddusw (v4hi, v4hi)
     v8qi __builtin_ia32_psubusb (v8qi, v8qi)
     v4hi __builtin_ia32_psubusw (v4hi, v4hi)
     v4hi __builtin_ia32_pmullw (v4hi, v4hi)
     v4hi __builtin_ia32_pmulhw (v4hi, v4hi)
     di __builtin_ia32_pand (di, di)
     di __builtin_ia32_pandn (di,di)
     di __builtin_ia32_por (di, di)
     di __builtin_ia32_pxor (di, di)
     v8qi __builtin_ia32_pcmpeqb (v8qi, v8qi)
     v4hi __builtin_ia32_pcmpeqw (v4hi, v4hi)
     v2si __builtin_ia32_pcmpeqd (v2si, v2si)
     v8qi __builtin_ia32_pcmpgtb (v8qi, v8qi)
     v4hi __builtin_ia32_pcmpgtw (v4hi, v4hi)
     v2si __builtin_ia32_pcmpgtd (v2si, v2si)
     v8qi __builtin_ia32_punpckhbw (v8qi, v8qi)
     v4hi __builtin_ia32_punpckhwd (v4hi, v4hi)
     v2si __builtin_ia32_punpckhdq (v2si, v2si)
     v8qi __builtin_ia32_punpcklbw (v8qi, v8qi)
     v4hi __builtin_ia32_punpcklwd (v4hi, v4hi)
     v2si __builtin_ia32_punpckldq (v2si, v2si)
     v8qi __builtin_ia32_packsswb (v4hi, v4hi)
     v4hi __builtin_ia32_packssdw (v2si, v2si)
     v8qi __builtin_ia32_packuswb (v4hi, v4hi)
The following built-in functions are made available either with -msse, or with a combination of -m3dnow and -march=athlon. All of them generate the machine instruction that is part of the name.
     v4hi __builtin_ia32_pmulhuw (v4hi, v4hi)
     v8qi __builtin_ia32_pavgb (v8qi, v8qi)
     v4hi __builtin_ia32_pavgw (v4hi, v4hi)
     v4hi __builtin_ia32_psadbw (v8qi, v8qi)
     v8qi __builtin_ia32_pmaxub (v8qi, v8qi)
     v4hi __builtin_ia32_pmaxsw (v4hi, v4hi)
     v8qi __builtin_ia32_pminub (v8qi, v8qi)
     v4hi __builtin_ia32_pminsw (v4hi, v4hi)
     int __builtin_ia32_pextrw (v4hi, int)
     v4hi __builtin_ia32_pinsrw (v4hi, int, int)
     int __builtin_ia32_pmovmskb (v8qi)
     void __builtin_ia32_maskmovq (v8qi, v8qi, char *)
     void __builtin_ia32_movntq (di *, di)
     void __builtin_ia32_sfence (void)
The following built-in functions are available when -msse is used. All of them generate the machine instruction that is part of the name.
     int __builtin_ia32_comieq (v4sf, v4sf)
     int __builtin_ia32_comineq (v4sf, v4sf)
     int __builtin_ia32_comilt (v4sf, v4sf)
     int __builtin_ia32_comile (v4sf, v4sf)
     int __builtin_ia32_comigt (v4sf, v4sf)
     int __builtin_ia32_comige (v4sf, v4sf)
     int __builtin_ia32_ucomieq (v4sf, v4sf)
     int __builtin_ia32_ucomineq (v4sf, v4sf)
     int __builtin_ia32_ucomilt (v4sf, v4sf)
     int __builtin_ia32_ucomile (v4sf, v4sf)
     int __builtin_ia32_ucomigt (v4sf, v4sf)
     int __builtin_ia32_ucomige (v4sf, v4sf)
     v4sf __builtin_ia32_addps (v4sf, v4sf)
     v4sf __builtin_ia32_subps (v4sf, v4sf)
     v4sf __builtin_ia32_mulps (v4sf, v4sf)
     v4sf __builtin_ia32_divps (v4sf, v4sf)
     v4sf __builtin_ia32_addss (v4sf, v4sf)
     v4sf __builtin_ia32_subss (v4sf, v4sf)
     v4sf __builtin_ia32_mulss (v4sf, v4sf)
     v4sf __builtin_ia32_divss (v4sf, v4sf)
     v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
     v4si __builtin_ia32_cmpltps (v4sf, v4sf)
     v4si __builtin_ia32_cmpleps (v4sf, v4sf)
     v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
     v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
     v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
     v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
     v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
     v4si __builtin_ia32_cmpnleps (v4sf, v4sf)
     v4si __builtin_ia32_cmpngtps (v4sf, v4sf)
     v4si __builtin_ia32_cmpngeps (v4sf, v4sf)
     v4si __builtin_ia32_cmpordps (v4sf, v4sf)
     v4si __builtin_ia32_cmpeqss (v4sf, v4sf)
     v4si __builtin_ia32_cmpltss (v4sf, v4sf)
     v4si __builtin_ia32_cmpless (v4sf, v4sf)
     v4si __builtin_ia32_cmpunordss (v4sf, v4sf)
     v4si __builtin_ia32_cmpneqss (v4sf, v4sf)
     v4si __builtin_ia32_cmpnlts (v4sf, v4sf)
     v4si __builtin_ia32_cmpnless (v4sf, v4sf)
     v4si __builtin_ia32_cmpordss (v4sf, v4sf)
     v4sf __builtin_ia32_maxps (v4sf, v4sf)
     v4sf __builtin_ia32_maxss (v4sf, v4sf)
     v4sf __builtin_ia32_minps (v4sf, v4sf)
     v4sf __builtin_ia32_minss (v4sf, v4sf)
     v4sf __builtin_ia32_andps (v4sf, v4sf)
     v4sf __builtin_ia32_andnps (v4sf, v4sf)
     v4sf __builtin_ia32_orps (v4sf, v4sf)
     v4sf __builtin_ia32_xorps (v4sf, v4sf)
     v4sf __builtin_ia32_movss (v4sf, v4sf)
     v4sf __builtin_ia32_movhlps (v4sf, v4sf)
     v4sf __builtin_ia32_movlhps (v4sf, v4sf)
     v4sf __builtin_ia32_unpckhps (v4sf, v4sf)
     v4sf __builtin_ia32_unpcklps (v4sf, v4sf)
     v4sf __builtin_ia32_cvtpi2ps (v4sf, v2si)
     v4sf __builtin_ia32_cvtsi2ss (v4sf, int)
     v2si __builtin_ia32_cvtps2pi (v4sf)
     int __builtin_ia32_cvtss2si (v4sf)
     v2si __builtin_ia32_cvttps2pi (v4sf)
     int __builtin_ia32_cvttss2si (v4sf)
     v4sf __builtin_ia32_rcpps (v4sf)
     v4sf __builtin_ia32_rsqrtps (v4sf)
     v4sf __builtin_ia32_sqrtps (v4sf)
     v4sf __builtin_ia32_rcpss (v4sf)
     v4sf __builtin_ia32_rsqrtss (v4sf)
     v4sf __builtin_ia32_sqrtss (v4sf)
     v4sf __builtin_ia32_shufps (v4sf, v4sf, int)
     void __builtin_ia32_movntps (float *, v4sf)
     int __builtin_ia32_movmskps (v4sf)
The following built-in functions are available when -msse is used.
v4sf __builtin_ia32_loadaps (float *)
Generates the movaps machine instruction as a load from memory. 
void __builtin_ia32_storeaps (float *, v4sf)
Generates the movaps machine instruction as a store to memory. 
v4sf __builtin_ia32_loadups (float *)
Generates the movups machine instruction as a load from memory. 
void __builtin_ia32_storeups (float *, v4sf)
Generates the movups machine instruction as a store to memory. 
v4sf __builtin_ia32_loadsss (float *)
Generates the movss machine instruction as a load from memory. 
void __builtin_ia32_storess (float *, v4sf)
Generates the movss machine instruction as a store to memory. 
v4sf __builtin_ia32_loadhps (v4sf, v2si *)
Generates the movhps machine instruction as a load from memory. 
v4sf __builtin_ia32_loadlps (v4sf, v2si *)
Generates the movlps machine instruction as a load from memory 
void __builtin_ia32_storehps (v4sf, v2si *)
Generates the movhps machine instruction as a store to memory. 
void __builtin_ia32_storelps (v4sf, v2si *)
Generates the movlps machine instruction as a store to memory.
The following built-in functions are available when -msse2 is used. All of them generate the machine instruction that is part of the name.
     int __builtin_ia32_comisdeq (v2df, v2df)
     int __builtin_ia32_comisdlt (v2df, v2df)
     int __builtin_ia32_comisdle (v2df, v2df)
     int __builtin_ia32_comisdgt (v2df, v2df)
     int __builtin_ia32_comisdge (v2df, v2df)
     int __builtin_ia32_comisdneq (v2df, v2df)
     int __builtin_ia32_ucomisdeq (v2df, v2df)
     int __builtin_ia32_ucomisdlt (v2df, v2df)
     int __builtin_ia32_ucomisdle (v2df, v2df)
     int __builtin_ia32_ucomisdgt (v2df, v2df)
     int __builtin_ia32_ucomisdge (v2df, v2df)
     int __builtin_ia32_ucomisdneq (v2df, v2df)
     v2df __builtin_ia32_cmpeqpd (v2df, v2df)
     v2df __builtin_ia32_cmpltpd (v2df, v2df)
     v2df __builtin_ia32_cmplepd (v2df, v2df)
     v2df __builtin_ia32_cmpgtpd (v2df, v2df)
     v2df __builtin_ia32_cmpgepd (v2df, v2df)
     v2df __builtin_ia32_cmpunordpd (v2df, v2df)
     v2df __builtin_ia32_cmpneqpd (v2df, v2df)
     v2df __builtin_ia32_cmpnltpd (v2df, v2df)
     v2df __builtin_ia32_cmpnlepd (v2df, v2df)
     v2df __builtin_ia32_cmpngtpd (v2df, v2df)
     v2df __builtin_ia32_cmpngepd (v2df, v2df)
     v2df __builtin_ia32_cmpordpd (v2df, v2df)
     v2df __builtin_ia32_cmpeqsd (v2df, v2df)
     v2df __builtin_ia32_cmpltsd (v2df, v2df)
     v2df __builtin_ia32_cmplesd (v2df, v2df)
     v2df __builtin_ia32_cmpunordsd (v2df, v2df)
     v2df __builtin_ia32_cmpneqsd (v2df, v2df)
     v2df __builtin_ia32_cmpnltsd (v2df, v2df)
     v2df __builtin_ia32_cmpnlesd (v2df, v2df)
     v2df __builtin_ia32_cmpordsd (v2df, v2df)
     v2di __builtin_ia32_paddq (v2di, v2di)
     v2di __builtin_ia32_psubq (v2di, v2di)
     v2df __builtin_ia32_addpd (v2df, v2df)
     v2df __builtin_ia32_subpd (v2df, v2df)
     v2df __builtin_ia32_mulpd (v2df, v2df)
     v2df __builtin_ia32_divpd (v2df, v2df)
     v2df __builtin_ia32_addsd (v2df, v2df)
     v2df __builtin_ia32_subsd (v2df, v2df)
     v2df __builtin_ia32_mulsd (v2df, v2df)
     v2df __builtin_ia32_divsd (v2df, v2df)
     v2df __builtin_ia32_minpd (v2df, v2df)
     v2df __builtin_ia32_maxpd (v2df, v2df)
     v2df __builtin_ia32_minsd (v2df, v2df)
     v2df __builtin_ia32_maxsd (v2df, v2df)
     v2df __builtin_ia32_andpd (v2df, v2df)
     v2df __builtin_ia32_andnpd (v2df, v2df)
     v2df __builtin_ia32_orpd (v2df, v2df)
     v2df __builtin_ia32_xorpd (v2df, v2df)
     v2df __builtin_ia32_movsd (v2df, v2df)
     v2df __builtin_ia32_unpckhpd (v2df, v2df)
     v2df __builtin_ia32_unpcklpd (v2df, v2df)
     v16qi __builtin_ia32_paddb128 (v16qi, v16qi)
     v8hi __builtin_ia32_paddw128 (v8hi, v8hi)
     v4si __builtin_ia32_paddd128 (v4si, v4si)
     v2di __builtin_ia32_paddq128 (v2di, v2di)
     v16qi __builtin_ia32_psubb128 (v16qi, v16qi)
     v8hi __builtin_ia32_psubw128 (v8hi, v8hi)
     v4si __builtin_ia32_psubd128 (v4si, v4si)
     v2di __builtin_ia32_psubq128 (v2di, v2di)
     v8hi __builtin_ia32_pmullw128 (v8hi, v8hi)
     v8hi __builtin_ia32_pmulhw128 (v8hi, v8hi)
     v2di __builtin_ia32_pand128 (v2di, v2di)
     v2di __builtin_ia32_pandn128 (v2di, v2di)
     v2di __builtin_ia32_por128 (v2di, v2di)
     v2di __builtin_ia32_pxor128 (v2di, v2di)
     v16qi __builtin_ia32_pavgb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pavgw128 (v8hi, v8hi)
     v16qi __builtin_ia32_pcmpeqb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pcmpeqw128 (v8hi, v8hi)
     v4si __builtin_ia32_pcmpeqd128 (v4si, v4si)
     v16qi __builtin_ia32_pcmpgtb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pcmpgtw128 (v8hi, v8hi)
     v4si __builtin_ia32_pcmpgtd128 (v4si, v4si)
     v16qi __builtin_ia32_pmaxub128 (v16qi, v16qi)
     v8hi __builtin_ia32_pmaxsw128 (v8hi, v8hi)
     v16qi __builtin_ia32_pminub128 (v16qi, v16qi)
     v8hi __builtin_ia32_pminsw128 (v8hi, v8hi)
     v16qi __builtin_ia32_punpckhbw128 (v16qi, v16qi)
     v8hi __builtin_ia32_punpckhwd128 (v8hi, v8hi)
     v4si __builtin_ia32_punpckhdq128 (v4si, v4si)
     v2di __builtin_ia32_punpckhqdq128 (v2di, v2di)
     v16qi __builtin_ia32_punpcklbw128 (v16qi, v16qi)
     v8hi __builtin_ia32_punpcklwd128 (v8hi, v8hi)
     v4si __builtin_ia32_punpckldq128 (v4si, v4si)
     v2di __builtin_ia32_punpcklqdq128 (v2di, v2di)
     v16qi __builtin_ia32_packsswb128 (v16qi, v16qi)
     v8hi __builtin_ia32_packssdw128 (v8hi, v8hi)
     v16qi __builtin_ia32_packuswb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pmulhuw128 (v8hi, v8hi)
     void __builtin_ia32_maskmovdqu (v16qi, v16qi)
     v2df __builtin_ia32_loadupd (double *)
     void __builtin_ia32_storeupd (double *, v2df)
     v2df __builtin_ia32_loadhpd (v2df, double *)
     v2df __builtin_ia32_loadlpd (v2df, double *)
     int __builtin_ia32_movmskpd (v2df)
     int __builtin_ia32_pmovmskb128 (v16qi)
     void __builtin_ia32_movnti (int *, int)
     void __builtin_ia32_movntpd (double *, v2df)
     void __builtin_ia32_movntdq (v2df *, v2df)
     v4si __builtin_ia32_pshufd (v4si, int)
     v8hi __builtin_ia32_pshuflw (v8hi, int)
     v8hi __builtin_ia32_pshufhw (v8hi, int)
     v2di __builtin_ia32_psadbw128 (v16qi, v16qi)
     v2df __builtin_ia32_sqrtpd (v2df)
     v2df __builtin_ia32_sqrtsd (v2df)
     v2df __builtin_ia32_shufpd (v2df, v2df, int)
     v2df __builtin_ia32_cvtdq2pd (v4si)
     v4sf __builtin_ia32_cvtdq2ps (v4si)
     v4si __builtin_ia32_cvtpd2dq (v2df)
     v2si __builtin_ia32_cvtpd2pi (v2df)
     v4sf __builtin_ia32_cvtpd2ps (v2df)
     v4si __builtin_ia32_cvttpd2dq (v2df)
     v2si __builtin_ia32_cvttpd2pi (v2df)
     v2df __builtin_ia32_cvtpi2pd (v2si)
     int __builtin_ia32_cvtsd2si (v2df)
     int __builtin_ia32_cvttsd2si (v2df)
     long long __builtin_ia32_cvtsd2si64 (v2df)
     long long __builtin_ia32_cvttsd2si64 (v2df)
     v4si __builtin_ia32_cvtps2dq (v4sf)
     v2df __builtin_ia32_cvtps2pd (v4sf)
     v4si __builtin_ia32_cvttps2dq (v4sf)
     v2df __builtin_ia32_cvtsi2sd (v2df, int)
     v2df __builtin_ia32_cvtsi642sd (v2df, long long)
     v4sf __builtin_ia32_cvtsd2ss (v4sf, v2df)
     v2df __builtin_ia32_cvtss2sd (v2df, v4sf)
     void __builtin_ia32_clflush (const void *)
     void __builtin_ia32_lfence (void)
     void __builtin_ia32_mfence (void)
     v16qi __builtin_ia32_loaddqu (const char *)
     void __builtin_ia32_storedqu (char *, v16qi)
     unsigned long long __builtin_ia32_pmuludq (v2si, v2si)
     v2di __builtin_ia32_pmuludq128 (v4si, v4si)
     v8hi __builtin_ia32_psllw128 (v8hi, v2di)
     v4si __builtin_ia32_pslld128 (v4si, v2di)
     v2di __builtin_ia32_psllq128 (v4si, v2di)
     v8hi __builtin_ia32_psrlw128 (v8hi, v2di)
     v4si __builtin_ia32_psrld128 (v4si, v2di)
     v2di __builtin_ia32_psrlq128 (v2di, v2di)
     v8hi __builtin_ia32_psraw128 (v8hi, v2di)
     v4si __builtin_ia32_psrad128 (v4si, v2di)
     v2di __builtin_ia32_pslldqi128 (v2di, int)
     v8hi __builtin_ia32_psllwi128 (v8hi, int)
     v4si __builtin_ia32_pslldi128 (v4si, int)
     v2di __builtin_ia32_psllqi128 (v2di, int)
     v2di __builtin_ia32_psrldqi128 (v2di, int)
     v8hi __builtin_ia32_psrlwi128 (v8hi, int)
     v4si __builtin_ia32_psrldi128 (v4si, int)
     v2di __builtin_ia32_psrlqi128 (v2di, int)
     v8hi __builtin_ia32_psrawi128 (v8hi, int)
     v4si __builtin_ia32_psradi128 (v4si, int)
     v4si __builtin_ia32_pmaddwd128 (v8hi, v8hi)
The following built-in functions are available when -msse3 is used. All of them generate the machine instruction that is part of the name.
     v2df __builtin_ia32_addsubpd (v2df, v2df)
     v4sf __builtin_ia32_addsubps (v4sf, v4sf)
     v2df __builtin_ia32_haddpd (v2df, v2df)
     v4sf __builtin_ia32_haddps (v4sf, v4sf)
     v2df __builtin_ia32_hsubpd (v2df, v2df)
     v4sf __builtin_ia32_hsubps (v4sf, v4sf)
     v16qi __builtin_ia32_lddqu (char const *)
     void __builtin_ia32_monitor (void *, unsigned int, unsigned int)
     v2df __builtin_ia32_movddup (v2df)
     v4sf __builtin_ia32_movshdup (v4sf)
     v4sf __builtin_ia32_movsldup (v4sf)
     void __builtin_ia32_mwait (unsigned int, unsigned int)
The following built-in functions are available when -msse3 is used.
v2df __builtin_ia32_loadddup (double const *)
Generates the movddup machine instruction as a load from memory.
The following built-in functions are available when -m3dnow is used. All of them generate the machine instruction that is part of the name.
     void __builtin_ia32_femms (void)
     v8qi __builtin_ia32_pavgusb (v8qi, v8qi)
     v2si __builtin_ia32_pf2id (v2sf)
     v2sf __builtin_ia32_pfacc (v2sf, v2sf)
     v2sf __builtin_ia32_pfadd (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpeq (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpge (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpgt (v2sf, v2sf)
     v2sf __builtin_ia32_pfmax (v2sf, v2sf)
     v2sf __builtin_ia32_pfmin (v2sf, v2sf)
     v2sf __builtin_ia32_pfmul (v2sf, v2sf)
     v2sf __builtin_ia32_pfrcp (v2sf)
     v2sf __builtin_ia32_pfrcpit1 (v2sf, v2sf)
     v2sf __builtin_ia32_pfrcpit2 (v2sf, v2sf)
     v2sf __builtin_ia32_pfrsqrt (v2sf)
     v2sf __builtin_ia32_pfrsqrtit1 (v2sf, v2sf)
     v2sf __builtin_ia32_pfsub (v2sf, v2sf)
     v2sf __builtin_ia32_pfsubr (v2sf, v2sf)
     v2sf __builtin_ia32_pi2fd (v2si)
     v4hi __builtin_ia32_pmulhrw (v4hi, v4hi)
The following built-in functions are available when both -m3dnow and -march=athlon are used. All of them generate the machine instruction that is part of the name.
     v2si __builtin_ia32_pf2iw (v2sf)
     v2sf __builtin_ia32_pfnacc (v2sf, v2sf)
     v2sf __builtin_ia32_pfpnacc (v2sf, v2sf)
     v2sf __builtin_ia32_pi2fw (v2si)
     v2sf __builtin_ia32_pswapdsf (v2sf)
     v2si __builtin_ia32_pswapdsi (v2si)

2012年6月3日 星期日

build parsec for ARM

reference document: http://www.cs.utexas.edu/~parsec_m5/TR-09-32.pdf
cross-compilation environment:
1. HOSTTYPE=arm
2. PATH=/path/to/fake/uname/bin:$PATH
content of /path/to/fake/uname/bin/uname:
===============================
$ cat ~/research/benchmarks/parsec-2.1-arm/fake-uname/uname
#!/bin/sh

/bin/uname $* | sed 's/i686/armv7l/g'
===============================
3. cross compilation tools: arm-linux-gnueabi-*
4. host machine is i686

Steps:
1. compile tools natively
$ parsecmgmt -a build -p tools
Note: for now, use native i686 compilation flags in gcc.bldconf

2. compile apps to ARM binary:
1. set BINARY_PREFIX options in gcc.bldconf
2.1 blackscholes : OK
2.2 bodytrack:
  2.2.1 In pkgs/apps/bodytrack/src/config.h.in, comment out #undef malloc
before change:
/* Define to rpl_malloc if the replacement function should be used. */
#undef malloc
after change:
/* Define to rpl_malloc if the replacement function should be used. */
//#undef malloc

  2.2.2 In  pkgs/apps/bodytrack/parsec/gcc-pthread.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--enable-threads --disable-openmp --disable-tbb"

after:
# Arguments to pass to the configure script, if it exists
build_conf="--enable-threads --disable-openmp --disable-tbb --build=i686-linux-gnu --host=arm-linux-gnueabi"

2.3: facesim: OK
2.4: ferret: depends on gsl and imagick, so build them first, see 2.5, and 2.6. OK
2.5: gsl: 
  2.5.1 In  pkgs/libs/gsl/parsec/gcc.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared"

after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --build=i686-linux-gnu --host=arm-linux-gnueabi"

2.6: imagick: In  pkgs/libs/imagick/parsec/gcc.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --without-perl --without-magick-plus-plus --without-bzlib --without-dps --without-djvu --without-fpx --without-gslib --without-jbig --with-jpeg --without-jp2 --without-tiff --without-wmf --without-zlib --without-x --without-fontconfig --without-freetype --without-lcms --without-png --without-gvc --without-openexr --without-rsvg --without-xml"

after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --without-perl --without-magick-plus-plus --without-bzlib --without-dps --without-djvu --without-fpx --without-gslib --without-jbig --with-jpeg --without-jp2 --without-tiff --without-wmf --without-zlib --without-x --without-fontconfig --without-freetype --without-lcms --without-png --without-gvc --without-openexr --without-rsvg --without-xml  --build=i686-linux-gnu --host=arm-linux-gnueabi"

2.7: freqmine: OK
2.8: raytrace: SKIP. In order to compile raytrace, libX11 must be cross-compiled which requires cross-compiling the following libraries:
libX11
libXmu
libXext
libxcb
xproto
xextproto
xtrans
libpthread_stubs
libXau
kbproto
inputproto
jpeg

2.9: swaptions: OK
2.10: fluidanimate: OK
2.11: vips: depends on glib and libxml2. libxml2 and vips only need to add --build and --host.
  2.11.1: remove -L${CC_HOME}/lib in config/gcc.bldconf
before:

export LDFLAGS="$STATIC -pthread -L${CC_HOME}/lib"


after:

export LDFLAGS="$STATIC -pthread"


2.12: glib: add --host and --build in pkgs/libs/glib/parsec/gcc.bldconf.
  2.12.1
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --enable-threads --with-threads=posix"

after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --enable-threads --with-threads=posix --build=i686-linux-gnu --host=arm-linux-gnueabi"
  2.12.2 In pkgs/libs/glib/src/configure, add following line at line 43:

ac_cv_func_posix_getpwuid_r=no$
glib_cv_stack_grows=no$
glib_cv_uscore=no$

2.13: dedup: OK.  depends on ssl, see 2.14
2.14: ssl: OK.
  2.14.1change gcc to arm-linux-gnueabi-gcc in pkgs/libs/ssl/src/Configure.pl line 323
before:

"linux-generic32","gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",

after:
"linux-generic32","arm-linux-gnueabi-gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",


  2.14.2 comment out line 975
before:
$cflags .= " -m32 ";

after:
#$cflags .= " -m32 ";

2.15: streamcluster: OK.
2.16: canneal: OK. need pkgs/kernels/canneal/src/atomic/arm/atomic.h.
  2.16.1: pkgs/kernels/canneal/src/atomic/atomic.h, add following lines:
before:

#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
#  include "alpha/atomic.h"
#else
#  error Architecture not supported by atomic.h
#endif

after
#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
#  include "alpha/atomic.h"
#elif defined(__arm__) || defined(__arm) || defined(arm) || defined(__ARM__)
#  include "arm/atomic.h"
#else
#  error Architecture not supported by atomic.h
#endif
and put to pkgs/kernels/canneal/src/atomic/arm/atomic.h
  2.16.3: add following lines at line 49
before:
#ifndef _KERNEL
#include
#endif

#ifndef I32_bit
after:
#ifndef _KERNEL
#include
#endif

#define ARM_VECTORS_HIGH       0xffff0000U
#define ARM_TP_ADDRESS         (ARM_VECTORS_HIGH + 0x1000)
#define ARM_RAS_START          (ARM_TP_ADDRESS + 4)
#define ARM_RAS_END            (ARM_TP_ADDRESS + 8)

#ifndef I32_bit
  2.16.4: add following lines at line 353
before:
#define atomic_store_rel_ptr            atomic_store_ptr

after:
#define atomic_store_rel_ptr            atomic_store_ptr
#define atomic_load_acq_ptr             atomic_load_acq_long

conlusion:
12 of 13 benchmarks built successfully.
fail applications:
1. raytrace, depends on several libX libraries, which need to be compiled in ARM.
native run: ferret failed
canneal: segfault
dedup: malloc fail