- Implement Opt2 for qemu_st
- go home!
2012年6月28日 星期四
Work Log
- About zero length array: http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html .
- https://wiki.linaro.org/PeterMaydell/QemuVersatileExpress Linaro QEMU V.Express support
- (IMG=vexpress.img ; if [ -e "$IMG" ] ; then sudo mount -o loop,offset="$(file "$IMG" | awk 'BEGIN { RS=";"; } /partition 2/ { print $7*512; }')" -t auto "$IMG" /mnt/mnt; else echo "$IMG not found"; fi )
- Linaro Android QEMU V.Express: https://wiki.linaro.org/KenWerner/Sandbox/AndroidQEMU
- vmlinuz and initrd.gz is in uImage and uInitrd: dd if=uImage skip=64 bs=1 to extract them
- Use reboot to shutdown Andriod
- ARM-VExpress image: http://releases.linaro.org/12.05/ubuntu/vexpress/
2012年6月27日 星期三
Think Flow
- tcg_livness_analysis, if opcode is qemu_ld or qemu_st, set all globals alive.
- qemu_ld is OK now.
- restore to morning status
- qemu 1.01 can exit qemu when poweroff
- chaos
- qemu_st, fail, no output, trap in some sort of loops?
- just forget what I'm going to do after viewing some web pages....
2012年6月22日 星期五
Think flow
- I modify the conditional branch of load_tlb from JNE to JE, and related code.
- Running in original QEMU can boot ARM-Linux, so far so good.
- I change the location of SAVE_DIRTY_STATES, call it QEMU_TK;
- QEMU_TK die after first page fault happen
- Question: what's the difference between QEMU and QEMU_TK ?
- difference means the state of
- OK, we need to restore states back
2012年6月21日 星期四
Thinking Flow
study qemu code: tcg_reg_alloc_op(): 1708
- what is fixed_reg TCGemp
- in tcg_global_reg_new_internal, fixed_reg is set 1
- in tcg_global_mem_new_internal, fixed_reg is set 0
- so, it seems it indicates whether this temp is register or not
- In TCGContext, what is reg_to_temp?
- in tcg_reg_alloc(), s->reg_to_temp[reg] decides whether the HOST register is mapped to any TCGTemp.
- So I think, reg_to_temp indicates current HOST reg represents reg_to_temp[reg].
- What is val_type in TCGTemp?
- NOT CLEAR
- It seems it indicates the current type of this temp.
- It is possible that ts->fixed_reg && ts->val_type == TEMP_VAL_MEM; or NOT ts->fixed_reg and ts->val_type == TEMP_VAL_REG.
- When does TCGArgDef args_ct set?
TCG: tcg_gen_code_common
In TCGContext:
/* liveness analysis */
uint16_t *op_dead_iargs;
/* for each operation, each bit tells if the corresponding input argument is dead */
what is tcg_op_defs
In: tcg_liveness_analysis, tcg/tcg.c: 1187
backward scan
NOTE: tcg_opc.h: definition of TCG opcodes (a.k.a TCG IR)
So, remove qemu_ld/st TCG_OPF_CALL_CLOBBER here
In tcg_liveness_analysis:
1292 } else if (def->flags & TCG_OPF_CALL_CLOBBER) {
1293 /* globals are live */
1294 memset(dead_temps, 0, s->nb_globals);
1295 }
Question: if we remove TCG_OPF_CALL_CLOBBER of qemu_ld/st, will this be a problem?
In: tcg_reg_alloc_op:
1708 if (def->flags & TCG_OPF_CALL_CLOBBER) {
1709 /* XXX: permit generic clobber register list ? */
1710 for(reg ex= 0; reg < TCG_TARGET_NB_REGS; reg++) {
1711 if (tcg_regset_test_reg(tcg_target_call_clobber_regs, reg)) {
1712 tcg_reg_free(s, reg);
1713 }
1714 }
1715 /* XXX: for load/store we could do that only for the slow path
1716 (i.e. when a memory callback is called) */
1717
1718 /* store globals and free associated registers (we assume the insn
1719 can modify any global. */
1720 save_globals(s, allocated_regs);
1721 }
Question: what does Marsalis Wallace look like ? or
What does tcg_reg_free do?
It loops over tcg_target_call_clobber_regs and
if S->temps[reg]->mem_coherent is not true, store reg back to env->temp_buf
Question: what does save_globals do?
- What does ``globals'' mean?
- In tcg/README, A TCG "global" is a variable which is live in all the functions (equivalent of a C global variable). They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet).
- call temp_save to save temp
- In temp_save(), save temp to env->temp_buf
==================================================================
tcg_out_op() is called to generate code for the TCG opcode.
We are interested in tcg_out_qemu_ld/st
QUESTION:
Strange enough, I cannot find lines where to save guest register states back to their canonical locations.
I only saw save back to temp_buf in 1708.
That is exactly the place.
==================================================================
Remove TCG_OPF_CALL_CLOBBER in qemu_ld
move save_dirty_state when TLB miss
program fail when the first PAGE FAULT occurs.
should compare REG contents between my version and original version
==================================================================
2012年6月20日 星期三
2012年6月19日 星期二
how to change runlevel through kernel parameter append
JUST ADD THE NUMBER OF RUNLEVEL
EXAMPLE:
"root=/dev/sdb1 console=/dev/ttyAMA0 2 "
EXAMPLE:
"root=/dev/sdb1 console=/dev/ttyAMA0 2 "
HOW to mount qcow image used by QEMU
HOW to mount qcow image used by QEMU
http://blog.loftninjas.org/2008/10/27/mounting-kvm-qcow2-qemu-disk-images/
http://blog.loftninjas.org/2008/10/27/mounting-kvm-qcow2-qemu-disk-images/
2012年6月8日 星期五
2012年6月6日 星期三
i7 currently RUNNING experiements
i7 currently RUNNING experiments:
TRACE_MERGE
TRACE
TRACE_NET_ORIG
Each configuration run 4 benchmark set: CINT-ARM, CINT-IA32, CFP-IA32, CFP_VECTOR-IA32
Each benchmark run 5 times.
There are 3 * 4 * 5 = 120 benchmarks need to run
estimate hours: 120 * 15000 sec = 20 days
6/26 will finish all runs!
TRACE_MERGE
TRACE
TRACE_NET_ORIG
Each configuration run 4 benchmark set: CINT-ARM, CINT-IA32, CFP-IA32, CFP_VECTOR-IA32
Each benchmark run 5 times.
There are 3 * 4 * 5 = 120 benchmarks need to run
estimate hours: 120 * 15000 sec = 20 days
6/26 will finish all runs!
Producing Wrong Data Without Doing Anything Obviously Wrong
Producing Wrong Data Without Doing Anything Obviously Wrong
SPECvirt_sc2010
SPECvirt_sc2010: SPEC's first benchmark addressing performance evaluation of datacenter servers used in virtualized server consolidation.
2012年6月5日 星期二
statically build OpenMP program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39176#c7
we have to link pthread ourselves.
Add
-Wl,--whole-archive -lpthread -Wl,--no-whole-archive
we have to link pthread ourselves.
Add
-Wl,--whole-archive -lpthread -Wl,--no-whole-archive
gcc sse builtin functions
http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/X86-Built_002din-Functions.html
v8qi __builtin_ia32_paddb (v8qi, v8qi) v4hi __builtin_ia32_paddw (v4hi, v4hi) v2si __builtin_ia32_paddd (v2si, v2si) v8qi __builtin_ia32_psubb (v8qi, v8qi) v4hi __builtin_ia32_psubw (v4hi, v4hi) v2si __builtin_ia32_psubd (v2si, v2si) v8qi __builtin_ia32_paddsb (v8qi, v8qi) v4hi __builtin_ia32_paddsw (v4hi, v4hi) v8qi __builtin_ia32_psubsb (v8qi, v8qi) v4hi __builtin_ia32_psubsw (v4hi, v4hi) v8qi __builtin_ia32_paddusb (v8qi, v8qi) v4hi __builtin_ia32_paddusw (v4hi, v4hi) v8qi __builtin_ia32_psubusb (v8qi, v8qi) v4hi __builtin_ia32_psubusw (v4hi, v4hi) v4hi __builtin_ia32_pmullw (v4hi, v4hi) v4hi __builtin_ia32_pmulhw (v4hi, v4hi) di __builtin_ia32_pand (di, di) di __builtin_ia32_pandn (di,di) di __builtin_ia32_por (di, di) di __builtin_ia32_pxor (di, di) v8qi __builtin_ia32_pcmpeqb (v8qi, v8qi) v4hi __builtin_ia32_pcmpeqw (v4hi, v4hi) v2si __builtin_ia32_pcmpeqd (v2si, v2si) v8qi __builtin_ia32_pcmpgtb (v8qi, v8qi) v4hi __builtin_ia32_pcmpgtw (v4hi, v4hi) v2si __builtin_ia32_pcmpgtd (v2si, v2si) v8qi __builtin_ia32_punpckhbw (v8qi, v8qi) v4hi __builtin_ia32_punpckhwd (v4hi, v4hi) v2si __builtin_ia32_punpckhdq (v2si, v2si) v8qi __builtin_ia32_punpcklbw (v8qi, v8qi) v4hi __builtin_ia32_punpcklwd (v4hi, v4hi) v2si __builtin_ia32_punpckldq (v2si, v2si) v8qi __builtin_ia32_packsswb (v4hi, v4hi) v4hi __builtin_ia32_packssdw (v2si, v2si) v8qi __builtin_ia32_packuswb (v4hi, v4hi)The following built-in functions are made available either with -msse, or with a combination of -m3dnow and -march=athlon. All of them generate the machine instruction that is part of the name.
v4hi __builtin_ia32_pmulhuw (v4hi, v4hi) v8qi __builtin_ia32_pavgb (v8qi, v8qi) v4hi __builtin_ia32_pavgw (v4hi, v4hi) v4hi __builtin_ia32_psadbw (v8qi, v8qi) v8qi __builtin_ia32_pmaxub (v8qi, v8qi) v4hi __builtin_ia32_pmaxsw (v4hi, v4hi) v8qi __builtin_ia32_pminub (v8qi, v8qi) v4hi __builtin_ia32_pminsw (v4hi, v4hi) int __builtin_ia32_pextrw (v4hi, int) v4hi __builtin_ia32_pinsrw (v4hi, int, int) int __builtin_ia32_pmovmskb (v8qi) void __builtin_ia32_maskmovq (v8qi, v8qi, char *) void __builtin_ia32_movntq (di *, di) void __builtin_ia32_sfence (void)The following built-in functions are available when -msse is used. All of them generate the machine instruction that is part of the name.
int __builtin_ia32_comieq (v4sf, v4sf) int __builtin_ia32_comineq (v4sf, v4sf) int __builtin_ia32_comilt (v4sf, v4sf) int __builtin_ia32_comile (v4sf, v4sf) int __builtin_ia32_comigt (v4sf, v4sf) int __builtin_ia32_comige (v4sf, v4sf) int __builtin_ia32_ucomieq (v4sf, v4sf) int __builtin_ia32_ucomineq (v4sf, v4sf) int __builtin_ia32_ucomilt (v4sf, v4sf) int __builtin_ia32_ucomile (v4sf, v4sf) int __builtin_ia32_ucomigt (v4sf, v4sf) int __builtin_ia32_ucomige (v4sf, v4sf) v4sf __builtin_ia32_addps (v4sf, v4sf) v4sf __builtin_ia32_subps (v4sf, v4sf) v4sf __builtin_ia32_mulps (v4sf, v4sf) v4sf __builtin_ia32_divps (v4sf, v4sf) v4sf __builtin_ia32_addss (v4sf, v4sf) v4sf __builtin_ia32_subss (v4sf, v4sf) v4sf __builtin_ia32_mulss (v4sf, v4sf) v4sf __builtin_ia32_divss (v4sf, v4sf) v4si __builtin_ia32_cmpeqps (v4sf, v4sf) v4si __builtin_ia32_cmpltps (v4sf, v4sf) v4si __builtin_ia32_cmpleps (v4sf, v4sf) v4si __builtin_ia32_cmpgtps (v4sf, v4sf) v4si __builtin_ia32_cmpgeps (v4sf, v4sf) v4si __builtin_ia32_cmpunordps (v4sf, v4sf) v4si __builtin_ia32_cmpneqps (v4sf, v4sf) v4si __builtin_ia32_cmpnltps (v4sf, v4sf) v4si __builtin_ia32_cmpnleps (v4sf, v4sf) v4si __builtin_ia32_cmpngtps (v4sf, v4sf) v4si __builtin_ia32_cmpngeps (v4sf, v4sf) v4si __builtin_ia32_cmpordps (v4sf, v4sf) v4si __builtin_ia32_cmpeqss (v4sf, v4sf) v4si __builtin_ia32_cmpltss (v4sf, v4sf) v4si __builtin_ia32_cmpless (v4sf, v4sf) v4si __builtin_ia32_cmpunordss (v4sf, v4sf) v4si __builtin_ia32_cmpneqss (v4sf, v4sf) v4si __builtin_ia32_cmpnlts (v4sf, v4sf) v4si __builtin_ia32_cmpnless (v4sf, v4sf) v4si __builtin_ia32_cmpordss (v4sf, v4sf) v4sf __builtin_ia32_maxps (v4sf, v4sf) v4sf __builtin_ia32_maxss (v4sf, v4sf) v4sf __builtin_ia32_minps (v4sf, v4sf) v4sf __builtin_ia32_minss (v4sf, v4sf) v4sf __builtin_ia32_andps (v4sf, v4sf) v4sf __builtin_ia32_andnps (v4sf, v4sf) v4sf __builtin_ia32_orps (v4sf, v4sf) v4sf __builtin_ia32_xorps (v4sf, v4sf) v4sf __builtin_ia32_movss (v4sf, v4sf) v4sf __builtin_ia32_movhlps (v4sf, v4sf) v4sf __builtin_ia32_movlhps (v4sf, v4sf) v4sf __builtin_ia32_unpckhps (v4sf, v4sf) v4sf __builtin_ia32_unpcklps (v4sf, v4sf) v4sf __builtin_ia32_cvtpi2ps (v4sf, v2si) v4sf __builtin_ia32_cvtsi2ss (v4sf, int) v2si __builtin_ia32_cvtps2pi (v4sf) int __builtin_ia32_cvtss2si (v4sf) v2si __builtin_ia32_cvttps2pi (v4sf) int __builtin_ia32_cvttss2si (v4sf) v4sf __builtin_ia32_rcpps (v4sf) v4sf __builtin_ia32_rsqrtps (v4sf) v4sf __builtin_ia32_sqrtps (v4sf) v4sf __builtin_ia32_rcpss (v4sf) v4sf __builtin_ia32_rsqrtss (v4sf) v4sf __builtin_ia32_sqrtss (v4sf) v4sf __builtin_ia32_shufps (v4sf, v4sf, int) void __builtin_ia32_movntps (float *, v4sf) int __builtin_ia32_movmskps (v4sf)The following built-in functions are available when -msse is used.
v4sf __builtin_ia32_loadaps (float *)
- Generates the
movaps
machine instruction as a load from memory. void __builtin_ia32_storeaps (float *, v4sf)
- Generates the
movaps
machine instruction as a store to memory. v4sf __builtin_ia32_loadups (float *)
- Generates the
movups
machine instruction as a load from memory. void __builtin_ia32_storeups (float *, v4sf)
- Generates the
movups
machine instruction as a store to memory. v4sf __builtin_ia32_loadsss (float *)
- Generates the
movss
machine instruction as a load from memory. void __builtin_ia32_storess (float *, v4sf)
- Generates the
movss
machine instruction as a store to memory. v4sf __builtin_ia32_loadhps (v4sf, v2si *)
- Generates the
movhps
machine instruction as a load from memory. v4sf __builtin_ia32_loadlps (v4sf, v2si *)
- Generates the
movlps
machine instruction as a load from memory void __builtin_ia32_storehps (v4sf, v2si *)
- Generates the
movhps
machine instruction as a store to memory. void __builtin_ia32_storelps (v4sf, v2si *)
- Generates the
movlps
machine instruction as a store to memory.
int __builtin_ia32_comisdeq (v2df, v2df) int __builtin_ia32_comisdlt (v2df, v2df) int __builtin_ia32_comisdle (v2df, v2df) int __builtin_ia32_comisdgt (v2df, v2df) int __builtin_ia32_comisdge (v2df, v2df) int __builtin_ia32_comisdneq (v2df, v2df) int __builtin_ia32_ucomisdeq (v2df, v2df) int __builtin_ia32_ucomisdlt (v2df, v2df) int __builtin_ia32_ucomisdle (v2df, v2df) int __builtin_ia32_ucomisdgt (v2df, v2df) int __builtin_ia32_ucomisdge (v2df, v2df) int __builtin_ia32_ucomisdneq (v2df, v2df) v2df __builtin_ia32_cmpeqpd (v2df, v2df) v2df __builtin_ia32_cmpltpd (v2df, v2df) v2df __builtin_ia32_cmplepd (v2df, v2df) v2df __builtin_ia32_cmpgtpd (v2df, v2df) v2df __builtin_ia32_cmpgepd (v2df, v2df) v2df __builtin_ia32_cmpunordpd (v2df, v2df) v2df __builtin_ia32_cmpneqpd (v2df, v2df) v2df __builtin_ia32_cmpnltpd (v2df, v2df) v2df __builtin_ia32_cmpnlepd (v2df, v2df) v2df __builtin_ia32_cmpngtpd (v2df, v2df) v2df __builtin_ia32_cmpngepd (v2df, v2df) v2df __builtin_ia32_cmpordpd (v2df, v2df) v2df __builtin_ia32_cmpeqsd (v2df, v2df) v2df __builtin_ia32_cmpltsd (v2df, v2df) v2df __builtin_ia32_cmplesd (v2df, v2df) v2df __builtin_ia32_cmpunordsd (v2df, v2df) v2df __builtin_ia32_cmpneqsd (v2df, v2df) v2df __builtin_ia32_cmpnltsd (v2df, v2df) v2df __builtin_ia32_cmpnlesd (v2df, v2df) v2df __builtin_ia32_cmpordsd (v2df, v2df) v2di __builtin_ia32_paddq (v2di, v2di) v2di __builtin_ia32_psubq (v2di, v2di) v2df __builtin_ia32_addpd (v2df, v2df) v2df __builtin_ia32_subpd (v2df, v2df) v2df __builtin_ia32_mulpd (v2df, v2df) v2df __builtin_ia32_divpd (v2df, v2df) v2df __builtin_ia32_addsd (v2df, v2df) v2df __builtin_ia32_subsd (v2df, v2df) v2df __builtin_ia32_mulsd (v2df, v2df) v2df __builtin_ia32_divsd (v2df, v2df) v2df __builtin_ia32_minpd (v2df, v2df) v2df __builtin_ia32_maxpd (v2df, v2df) v2df __builtin_ia32_minsd (v2df, v2df) v2df __builtin_ia32_maxsd (v2df, v2df) v2df __builtin_ia32_andpd (v2df, v2df) v2df __builtin_ia32_andnpd (v2df, v2df) v2df __builtin_ia32_orpd (v2df, v2df) v2df __builtin_ia32_xorpd (v2df, v2df) v2df __builtin_ia32_movsd (v2df, v2df) v2df __builtin_ia32_unpckhpd (v2df, v2df) v2df __builtin_ia32_unpcklpd (v2df, v2df) v16qi __builtin_ia32_paddb128 (v16qi, v16qi) v8hi __builtin_ia32_paddw128 (v8hi, v8hi) v4si __builtin_ia32_paddd128 (v4si, v4si) v2di __builtin_ia32_paddq128 (v2di, v2di) v16qi __builtin_ia32_psubb128 (v16qi, v16qi) v8hi __builtin_ia32_psubw128 (v8hi, v8hi) v4si __builtin_ia32_psubd128 (v4si, v4si) v2di __builtin_ia32_psubq128 (v2di, v2di) v8hi __builtin_ia32_pmullw128 (v8hi, v8hi) v8hi __builtin_ia32_pmulhw128 (v8hi, v8hi) v2di __builtin_ia32_pand128 (v2di, v2di) v2di __builtin_ia32_pandn128 (v2di, v2di) v2di __builtin_ia32_por128 (v2di, v2di) v2di __builtin_ia32_pxor128 (v2di, v2di) v16qi __builtin_ia32_pavgb128 (v16qi, v16qi) v8hi __builtin_ia32_pavgw128 (v8hi, v8hi) v16qi __builtin_ia32_pcmpeqb128 (v16qi, v16qi) v8hi __builtin_ia32_pcmpeqw128 (v8hi, v8hi) v4si __builtin_ia32_pcmpeqd128 (v4si, v4si) v16qi __builtin_ia32_pcmpgtb128 (v16qi, v16qi) v8hi __builtin_ia32_pcmpgtw128 (v8hi, v8hi) v4si __builtin_ia32_pcmpgtd128 (v4si, v4si) v16qi __builtin_ia32_pmaxub128 (v16qi, v16qi) v8hi __builtin_ia32_pmaxsw128 (v8hi, v8hi) v16qi __builtin_ia32_pminub128 (v16qi, v16qi) v8hi __builtin_ia32_pminsw128 (v8hi, v8hi) v16qi __builtin_ia32_punpckhbw128 (v16qi, v16qi) v8hi __builtin_ia32_punpckhwd128 (v8hi, v8hi) v4si __builtin_ia32_punpckhdq128 (v4si, v4si) v2di __builtin_ia32_punpckhqdq128 (v2di, v2di) v16qi __builtin_ia32_punpcklbw128 (v16qi, v16qi) v8hi __builtin_ia32_punpcklwd128 (v8hi, v8hi) v4si __builtin_ia32_punpckldq128 (v4si, v4si) v2di __builtin_ia32_punpcklqdq128 (v2di, v2di) v16qi __builtin_ia32_packsswb128 (v16qi, v16qi) v8hi __builtin_ia32_packssdw128 (v8hi, v8hi) v16qi __builtin_ia32_packuswb128 (v16qi, v16qi) v8hi __builtin_ia32_pmulhuw128 (v8hi, v8hi) void __builtin_ia32_maskmovdqu (v16qi, v16qi) v2df __builtin_ia32_loadupd (double *) void __builtin_ia32_storeupd (double *, v2df) v2df __builtin_ia32_loadhpd (v2df, double *) v2df __builtin_ia32_loadlpd (v2df, double *) int __builtin_ia32_movmskpd (v2df) int __builtin_ia32_pmovmskb128 (v16qi) void __builtin_ia32_movnti (int *, int) void __builtin_ia32_movntpd (double *, v2df) void __builtin_ia32_movntdq (v2df *, v2df) v4si __builtin_ia32_pshufd (v4si, int) v8hi __builtin_ia32_pshuflw (v8hi, int) v8hi __builtin_ia32_pshufhw (v8hi, int) v2di __builtin_ia32_psadbw128 (v16qi, v16qi) v2df __builtin_ia32_sqrtpd (v2df) v2df __builtin_ia32_sqrtsd (v2df) v2df __builtin_ia32_shufpd (v2df, v2df, int) v2df __builtin_ia32_cvtdq2pd (v4si) v4sf __builtin_ia32_cvtdq2ps (v4si) v4si __builtin_ia32_cvtpd2dq (v2df) v2si __builtin_ia32_cvtpd2pi (v2df) v4sf __builtin_ia32_cvtpd2ps (v2df) v4si __builtin_ia32_cvttpd2dq (v2df) v2si __builtin_ia32_cvttpd2pi (v2df) v2df __builtin_ia32_cvtpi2pd (v2si) int __builtin_ia32_cvtsd2si (v2df) int __builtin_ia32_cvttsd2si (v2df) long long __builtin_ia32_cvtsd2si64 (v2df) long long __builtin_ia32_cvttsd2si64 (v2df) v4si __builtin_ia32_cvtps2dq (v4sf) v2df __builtin_ia32_cvtps2pd (v4sf) v4si __builtin_ia32_cvttps2dq (v4sf) v2df __builtin_ia32_cvtsi2sd (v2df, int) v2df __builtin_ia32_cvtsi642sd (v2df, long long) v4sf __builtin_ia32_cvtsd2ss (v4sf, v2df) v2df __builtin_ia32_cvtss2sd (v2df, v4sf) void __builtin_ia32_clflush (const void *) void __builtin_ia32_lfence (void) void __builtin_ia32_mfence (void) v16qi __builtin_ia32_loaddqu (const char *) void __builtin_ia32_storedqu (char *, v16qi) unsigned long long __builtin_ia32_pmuludq (v2si, v2si) v2di __builtin_ia32_pmuludq128 (v4si, v4si) v8hi __builtin_ia32_psllw128 (v8hi, v2di) v4si __builtin_ia32_pslld128 (v4si, v2di) v2di __builtin_ia32_psllq128 (v4si, v2di) v8hi __builtin_ia32_psrlw128 (v8hi, v2di) v4si __builtin_ia32_psrld128 (v4si, v2di) v2di __builtin_ia32_psrlq128 (v2di, v2di) v8hi __builtin_ia32_psraw128 (v8hi, v2di) v4si __builtin_ia32_psrad128 (v4si, v2di) v2di __builtin_ia32_pslldqi128 (v2di, int) v8hi __builtin_ia32_psllwi128 (v8hi, int) v4si __builtin_ia32_pslldi128 (v4si, int) v2di __builtin_ia32_psllqi128 (v2di, int) v2di __builtin_ia32_psrldqi128 (v2di, int) v8hi __builtin_ia32_psrlwi128 (v8hi, int) v4si __builtin_ia32_psrldi128 (v4si, int) v2di __builtin_ia32_psrlqi128 (v2di, int) v8hi __builtin_ia32_psrawi128 (v8hi, int) v4si __builtin_ia32_psradi128 (v4si, int) v4si __builtin_ia32_pmaddwd128 (v8hi, v8hi)The following built-in functions are available when -msse3 is used. All of them generate the machine instruction that is part of the name.
v2df __builtin_ia32_addsubpd (v2df, v2df) v4sf __builtin_ia32_addsubps (v4sf, v4sf) v2df __builtin_ia32_haddpd (v2df, v2df) v4sf __builtin_ia32_haddps (v4sf, v4sf) v2df __builtin_ia32_hsubpd (v2df, v2df) v4sf __builtin_ia32_hsubps (v4sf, v4sf) v16qi __builtin_ia32_lddqu (char const *) void __builtin_ia32_monitor (void *, unsigned int, unsigned int) v2df __builtin_ia32_movddup (v2df) v4sf __builtin_ia32_movshdup (v4sf) v4sf __builtin_ia32_movsldup (v4sf) void __builtin_ia32_mwait (unsigned int, unsigned int)The following built-in functions are available when -msse3 is used.
v2df __builtin_ia32_loadddup (double const *)
- Generates the
movddup
machine instruction as a load from memory.
void __builtin_ia32_femms (void) v8qi __builtin_ia32_pavgusb (v8qi, v8qi) v2si __builtin_ia32_pf2id (v2sf) v2sf __builtin_ia32_pfacc (v2sf, v2sf) v2sf __builtin_ia32_pfadd (v2sf, v2sf) v2si __builtin_ia32_pfcmpeq (v2sf, v2sf) v2si __builtin_ia32_pfcmpge (v2sf, v2sf) v2si __builtin_ia32_pfcmpgt (v2sf, v2sf) v2sf __builtin_ia32_pfmax (v2sf, v2sf) v2sf __builtin_ia32_pfmin (v2sf, v2sf) v2sf __builtin_ia32_pfmul (v2sf, v2sf) v2sf __builtin_ia32_pfrcp (v2sf) v2sf __builtin_ia32_pfrcpit1 (v2sf, v2sf) v2sf __builtin_ia32_pfrcpit2 (v2sf, v2sf) v2sf __builtin_ia32_pfrsqrt (v2sf) v2sf __builtin_ia32_pfrsqrtit1 (v2sf, v2sf) v2sf __builtin_ia32_pfsub (v2sf, v2sf) v2sf __builtin_ia32_pfsubr (v2sf, v2sf) v2sf __builtin_ia32_pi2fd (v2si) v4hi __builtin_ia32_pmulhrw (v4hi, v4hi)The following built-in functions are available when both -m3dnow and -march=athlon are used. All of them generate the machine instruction that is part of the name.
v2si __builtin_ia32_pf2iw (v2sf) v2sf __builtin_ia32_pfnacc (v2sf, v2sf) v2sf __builtin_ia32_pfpnacc (v2sf, v2sf) v2sf __builtin_ia32_pi2fw (v2si) v2sf __builtin_ia32_pswapdsf (v2sf) v2si __builtin_ia32_pswapdsi (v2si)
2012年6月3日 星期日
build parsec for ARM
reference document: http://www.cs.utexas.edu/~parsec_m5/TR-09-32.pdf
cross-compilation environment:
1. HOSTTYPE=arm
2. PATH=/path/to/fake/uname/bin:$PATH
content of /path/to/fake/uname/bin/uname:
===============================
$ cat ~/research/benchmarks/parsec-2.1-arm/fake-uname/uname
#!/bin/sh
/bin/uname $* | sed 's/i686/armv7l/g'
cross-compilation environment:
1. HOSTTYPE=arm
2. PATH=/path/to/fake/uname/bin:$PATH
content of /path/to/fake/uname/bin/uname:
===============================
$ cat ~/research/benchmarks/parsec-2.1-arm/fake-uname/uname
#!/bin/sh
/bin/uname $* | sed 's/i686/armv7l/g'
===============================
3. cross compilation tools: arm-linux-gnueabi-*
4. host machine is i686
Steps:
1. compile tools natively
$ parsecmgmt -a build -p tools
Note: for now, use native i686 compilation flags in gcc.bldconf
2. compile apps to ARM binary:
1. set BINARY_PREFIX options in gcc.bldconf
2.1 blackscholes : OK
2.2 bodytrack:
2.2.1 In pkgs/apps/bodytrack/src/config.h.in, comment out #undef malloc
before change:
/* Define to rpl_malloc if the replacement function should be used. */
#undef malloc
after change:
/* Define to rpl_malloc if the replacement function should be used. */
//#undef malloc
2.2.2 In pkgs/apps/bodytrack/parsec/gcc-pthread.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--enable-threads --disable-openmp --disable-tbb"
after:
# Arguments to pass to the configure script, if it exists
build_conf="--enable-threads --disable-openmp --disable-tbb --build=i686-linux-gnu --host=arm-linux-gnueabi"
2.3: facesim: OK
2.4: ferret: depends on gsl and imagick, so build them first, see 2.5, and 2.6. OK
2.5: gsl:
2.5.1 In pkgs/libs/gsl/parsec/gcc.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared"
after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --build=i686-linux-gnu --host=arm-linux-gnueabi"
2.6: imagick: In pkgs/libs/imagick/parsec/gcc.bldconf, add --host and --build.
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --without-perl --without-magick-plus-plus --without-bzlib --without-dps --without-djvu --without-fpx --without-gslib --without-jbig --with-jpeg --without-jp2 --without-tiff --without-wmf --without-zlib --without-x --without-fontconfig --without-freetype --without-lcms --without-png --without-gvc --without-openexr --without-rsvg --without-xml"
after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --without-perl --without-magick-plus-plus --without-bzlib --without-dps --without-djvu --without-fpx --without-gslib --without-jbig --with-jpeg --without-jp2 --without-tiff --without-wmf --without-zlib --without-x --without-fontconfig --without-freetype --without-lcms --without-png --without-gvc --without-openexr --without-rsvg --without-xml --build=i686-linux-gnu --host=arm-linux-gnueabi"
2.7: freqmine: OK
2.8: raytrace: SKIP. In order to compile raytrace, libX11 must be cross-compiled which requires cross-compiling the following libraries:
libX11
libXmu
libXext
libxcb
xproto
xextproto
xtrans
libpthread_stubs
libXau
kbproto
inputproto
jpeg
2.9: swaptions: OK
2.10: fluidanimate: OK
2.11: vips: depends on glib and libxml2. libxml2 and vips only need to add --build and --host.
2.11.1: remove -L${CC_HOME}/lib in config/gcc.bldconf
before:
export LDFLAGS="$STATIC -pthread -L${CC_HOME}/lib"
after:
export LDFLAGS="$STATIC -pthread"
2.11.1: remove -L${CC_HOME}/lib in config/gcc.bldconf
before:
export LDFLAGS="$STATIC -pthread -L${CC_HOME}/lib"
after:
export LDFLAGS="$STATIC -pthread"
2.12: glib: add --host and --build in pkgs/libs/glib/parsec/gcc.bldconf.
2.12.1
before:
before:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --enable-threads --with-threads=posix"
after:
# Arguments to pass to the configure script, if it exists
build_conf="--disable-shared --enable-threads --with-threads=posix --build=i686-linux-gnu --host=arm-linux-gnueabi"
2.12.2 In pkgs/libs/glib/src/configure, add following line at line 43:
ac_cv_func_posix_getpwuid_r=no$
glib_cv_stack_grows=no$
glib_cv_uscore=no$
2.13: dedup: OK. depends on ssl, see 2.14
2.14: ssl: OK.
2.14.1change gcc to arm-linux-gnueabi-gcc in pkgs/libs/ssl/src/Configure.pl line 323
before:
"linux-generic32","gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
2.12.2 In pkgs/libs/glib/src/configure, add following line at line 43:
ac_cv_func_posix_getpwuid_r=no$
glib_cv_stack_grows=no$
glib_cv_uscore=no$
2.14: ssl: OK.
2.14.1change gcc to arm-linux-gnueabi-gcc in pkgs/libs/ssl/src/Configure.pl line 323
before:
"linux-generic32","gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
after:
"linux-generic32","arm-linux-gnueabi-gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
2.14.2 comment out line 975
before:
$cflags .= " -m32 ";
after:
#$cflags .= " -m32 ";
2.15: streamcluster: OK.
2.16: canneal: OK. need pkgs/kernels/canneal/src/atomic/arm/atomic.h.
2.16.1: pkgs/kernels/canneal/src/atomic/atomic.h, add following lines:
before:
#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
# include "alpha/atomic.h"
#else
# error Architecture not supported by atomic.h
#endif
2.16: canneal: OK. need pkgs/kernels/canneal/src/atomic/arm/atomic.h.
2.16.1: pkgs/kernels/canneal/src/atomic/atomic.h, add following lines:
before:
#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
# include "alpha/atomic.h"
#else
# error Architecture not supported by atomic.h
#endif
after
#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
# include "alpha/atomic.h"
#elif defined(__arm__) || defined(__arm) || defined(arm) || defined(__ARM__)
# include "arm/atomic.h"
#else
# error Architecture not supported by atomic.h
#endif
2.16.2: download from ftp://ftp.tw.freebsd.org/pub/FreeBSD-current/src/sys/arm/include/atomic.h.
and put to pkgs/kernels/canneal/src/atomic/arm/atomic.h
2.16.3: add following lines at line 49
before:
#ifndef _KERNEL
#include
#endif
#ifndef I32_bit
after:
#ifndef _KERNEL
#include
#endif
#define ARM_VECTORS_HIGH 0xffff0000U
#define ARM_TP_ADDRESS (ARM_VECTORS_HIGH + 0x1000)
#define ARM_RAS_START (ARM_TP_ADDRESS + 4)
#define ARM_RAS_END (ARM_TP_ADDRESS + 8)
#ifndef I32_bit
2.16.4: add following lines at line 353
before:
#define atomic_store_rel_ptr atomic_store_ptr
after:
#define atomic_store_rel_ptr atomic_store_ptr
#define atomic_load_acq_ptr atomic_load_acq_long
conlusion:
12 of 13 benchmarks built successfully.
fail applications:
1. raytrace, depends on several libX libraries, which need to be compiled in ARM.
native run: ferret failed
canneal: segfault
dedup: malloc fail
12 of 13 benchmarks built successfully.
fail applications:
1. raytrace, depends on several libX libraries, which need to be compiled in ARM.
native run: ferret failed
canneal: segfault
dedup: malloc fail
2012年5月31日 星期四
statically build PARSEC
statically build PARSEC
1. cmake is difficult to statically built and is UN-NECESSARY since it just a tool to build several benchmarks.
So, to make life simpler, build cmake dynamically linked.
parsecmgmt -a build -p tools
2. Add -static in you CFLAGS, CXXFLAGS
type parsecmgmt -a build -p apps kernels
3. for three benchmarks, they are still dynamically linked due to libtool
just manually linked them:
go to log directory, find out the log file for the last build.
search "-o bodytrack", "-o facesim" and "-o vips" strings to locate the link commands.
and go to the right directory to statically link those benchmarks, and manually copy them to the installed directory.
4. DONE
1. cmake is difficult to statically built and is UN-NECESSARY since it just a tool to build several benchmarks.
So, to make life simpler, build cmake dynamically linked.
parsecmgmt -a build -p tools
2. Add -static in you CFLAGS, CXXFLAGS
type parsecmgmt -a build -p apps kernels
3. for three benchmarks, they are still dynamically linked due to libtool
just manually linked them:
go to log directory, find out the log file for the last build.
search "-o bodytrack", "-o facesim" and "-o vips" strings to locate the link commands.
and go to the right directory to statically link those benchmarks, and manually copy them to the installed directory.
4. DONE
2012年5月27日 星期日
SPEC OMP 2001 318.galgel, fail to compile by gfortran 4.4
SPEC OMP 2001 318.galgel, fail to compile by gfortran 4.4
1. add -ffixed-form
2. /data/Benchmarks/SPEC/OMP2001_v3.2_x86/benchspec/OMPM2001/318.galgel_m/src/bifg21.f90
change
Poj2(NKY*(L-1)+M,1:K) = - MATMUL( LPOP(1:K,1:N), VI(K+1:K+N) )
to
1. add -ffixed-form
2. /data/Benchmarks/SPEC/OMP2001_v3.2_x86/benchspec/OMPM2001/318.galgel_m/src/bifg21.f90
change
Poj2(NKY*(L-1)+M,1:K) = - MATMUL( LPOP(1:K,1:N), VI(K+1:K+N) )
to
Poj2(NKY*(L-1)+M,1:K) = - MATMUL( LPOP(1:K,1:N), VI(K+1:K+N))
remove the fucking space.
2012年5月14日 星期一
20120513 - 503
Error: 1x318.galgel_m 1x324.apsi_m 1x326.gafort_m
Success: 1x310.wupwise_m 1x312.swim_m 1x314.mgrid_m 1x316.applu_m 1x320.equake_m 1x328.fma3d_m 1x330.art_m 1x332.ammp_m
ARM SPEC Regression test, 503 V.S 467
ARM
-14.08 445.gobmk
-6.85 456.hmmer
-10.92 458.sjeng
-3.10 471.omnetpp
-3.02 473.astar
-4.51 483.xalancbmk
4.63 403.gcc
-5.12 458.sjeng
-3.89 462.libquantum
-5.87 464.h264ref
-5.18 471.omnetpp
-14.08 445.gobmk
-6.85 456.hmmer
-10.92 458.sjeng
-3.10 471.omnetpp
-3.02 473.astar
-4.51 483.xalancbmk
4.63 403.gcc
-5.12 458.sjeng
-3.89 462.libquantum
-5.87 464.h264ref
-5.18 471.omnetpp
2012年5月12日 星期六
2012年5月11日 星期五
Compile 314.mgrid fail, resolved!
Compile 314.mgrid error:
==========================================================================
/usr/bin/gfortran-4.4.2 -fopenmp -O3 -m32 -march=prescott -mmmx -msse -msse2 -msse3 -msse4 -mfpmath=sse -fforce-addr -fivopts -fsee -ftree-vectorize -pipe mgrid.f -o mgrid
Error from make 'specmake build 2> make.err | tee make.out':
mgrid.f: In function 'resid':
mgrid.f:365: error: lastprivate variable 'i2' is private in outer context
mgrid.f:365: error: lastprivate variable 'i1' is private in outer context
mgrid.f: In function 'psinv':
mgrid.f:408: error: lastprivate variable 'i2' is private in outer context
mgrid.f:408: error: lastprivate variable 'i1' is private in outer context
Related Post:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33904
And in the following post, OPENMP confirms that is a bug in mgrid.f
http://openmp.org/pipermail/omp/2007/001101.html
Bug description:
==========================================================================
/usr/bin/gfortran-4.4.2 -fopenmp -O3 -m32 -march=prescott -mmmx -msse -msse2 -msse3 -msse4 -mfpmath=sse -fforce-addr -fivopts -fsee -ftree-vectorize -pipe mgrid.f -o mgrid
Error from make 'specmake build 2> make.err | tee make.out':
mgrid.f: In function 'resid':
mgrid.f:365: error: lastprivate variable 'i2' is private in outer context
mgrid.f:365: error: lastprivate variable 'i1' is private in outer context
mgrid.f: In function 'psinv':
mgrid.f:408: error: lastprivate variable 'i2' is private in outer context
mgrid.f:408: error: lastprivate variable 'i1' is private in outer context
==========================================================================
Related Post:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33904
And in the following post, OPENMP confirms that is a bug in mgrid.f
http://openmp.org/pipermail/omp/2007/001101.html
Bug description:
> >Hi! > >Is > > SUBROUTINE foo(a, b, n) > DOUBLE PRECISION a, b > INTEGER*8 i1, i2, i3, n > DIMENSION a(n,n,n), b(n,n,n) >!$OMP PARALLEL >!$OMP+DEFAULT(SHARED) >!$OMP+PRIVATE(I3) >!$OMP DO >!$OMP+LASTPRIVATE(I1,I2) > DO i3 = 2, n-1, 1 > DO i2 = 2, n-1, 1 > DO i1 = 2, n-1, 1 > a(i1, i2, i3) = b(i1, i2, i3); > 600 CONTINUE > ENDDO > ENDDO > ENDDO >!$OMP END DO NOWAIT >!$OMP END PARALLEL > RETURN > END > >valid? My reading of the standard is it is not, because both I1 >and I2 are sequential loop iterator vars in a parallel construct >and as such should be predetermined private rather than implicitly >determined shared (OpenMP 2.5, 2.8.1.1). It is not present >in any of the clauses on the parallel construct which could possibly >override it. 2.8.3.5 about the lastprivate clause in the >first restriction >says that the vars can't be private in the parallel region. >Several other compilers accept this code though. > >In OpenMP 3.0 draft the wording is even clear, because it talks there >about the loop iterators being predetermined private in a task region, >and !$omp do doesn't create a new task region. > >Or am I wrong with this? > >Thanks. > > Jakub
He is right!!!
Solution:
!$OMP+DEFAULT(SHARED) with: !$OMP+SHARED(I1,I2) makes the code compile successfully with gfortran. Alternatively, keeping DEFAULT(SHARED) and fusing the OMP PARALLEL clause with the OMP DO clause (i.e. using OMP PARALLEL DO) also solves the problem.
2012年5月8日 星期二
SPEC OMP 2001
Name Remarks
310.wupwise_m Quantum chromodynamics
312.swim_m Shallow water modeling
314.mgrid_m Multi-grid solver in 3D potential field
316.applu_m Parabolic/elliptic partial differential equations
318.galgel_m Fluid dynamics: analysis of oscillatory instability
320.equake_m Finite element simulation; earthquake modeling
324.apsi_m Solves problems regarding temperature, wind, velocity and distribution of pollutants
326.gafort_m Genetic algorithm
328.fma3d_m Finite element crash simulation
330.art_m Neural network simulation; adaptive resonance theory
332.ammp_m Computational Chemistry
http://www.spec.org/omp2001/docs/runspec.html
2012年4月27日 星期五
QEMU memory problem:
simulate 32 bit in 64 bit environment, we can use MAP_32BIT to force mmap to allocate memory address within 4G.
However, according to http://lxr.free-electrons.com/source/arch/x86/kernel/sys_x86_64.c, we can only get memory between 0x40000000 and 0x80000000,
simulate 32 bit in 64 bit environment, we can use MAP_32BIT to force mmap to allocate memory address within 4G.
However, according to http://lxr.free-electrons.com/source/arch/x86/kernel/sys_x86_64.c, we can only get memory between 0x40000000 and 0x80000000,
atomic region
This project aims to efficiently handle guest instruction exception through atomic region (software approach).
Atomic region: blocked execution mode?
what is blocked execution mode
Atomic region: blocked execution mode?
what is blocked execution mode
Installing Debian ARM under QEMU
In this post I will explain how to install Debian Armel under QEMU.
Well, there are various reasons to you have a ARM Linux distro inside a Virtual Machine. One of them would be to have a test environment to you validate your programs before to release to an embedded linux (ARM).
Debian was choosed because is the most supported ARM distro and you will have an compatible environment with your embedded system (eglibc).
Hey! Ho! Let’s go:
Well, there are various reasons to you have a ARM Linux distro inside a Virtual Machine. One of them would be to have a test environment to you validate your programs before to release to an embedded linux (ARM).
Debian was choosed because is the most supported ARM distro and you will have an compatible environment with your embedded system (eglibc).
Hey! Ho! Let’s go:
- Download a ARM kernel and a vmlinuz image under debian.org FTP site (I choosed squeeze/testing flavor):
wget http://ftp.debian.org/debian/dists/testing/main/installer-armel/current/images/versatile/netboot/vmlinuz-2.6.32-5-versatile wget http://ftp.debian.org/debian/dists/testing/main/installer-armel/current/images/versatile/netboot/initrd.gz
- Create a disk image (please, create a raw disk! It will be useful future):
qemu-img create -f raw debian.img 10G
- Start Debian image using qemu:
qemu-system-arm -m 256 -M versatilepb -kernel vmlinuz-2.6.32-5-versatile -initrd initrd.gz -hda debian.img -append "root=/dev/ram"
- After install Debian in the disk image, we will mount the contents of disk image:
sudo kpartx -av debian.img sudo mount /dev/mapper/loop1p1 ./mnt/ -o loopPS: I installed Debian with one only partition (/dev/sda1 is my root filesystem). Kpartx is needed because my disk image has 2 partitions (root filesystem and swap)
- Copy the initrd image and the kernel to your system (outside of mountpoint). These two files will be used to start our Debian installation:
cp ./mnt/boot/initrd.img . cp ./mnt/boot/vmlinuz-2.6.32-5-versatile .
- Now, we can start Debian image installed in the disk image:
qemu-system-arm -m 256 -M versatilepb -kernel vmlinuz-2.6.32-5-versatile -initrd initrd.img -hda debian.img -append "root=/dev/sda1"In the next posts I will talk about ARM architecture, toolchains, cross compilers and embedded linux. ;)
Peter Maydell 在2010年11月29日有一个解释
http://comments.gmane.org/gmane.comp.emulators.qemu/86388
‘versatilepb’ also supports only 256MB of RAM, for the
same reason (system registers starting at 0×10000000).
You might try one of the realview models, which have a
special case for putting more RAM at a high memory address.
same reason (system registers starting at 0×10000000).
You might try one of the realview models, which have a
special case for putting more RAM at a high memory address.
versatilepb 开发版也只支持256M 内存, 因为同样的原因(系统寄存器开始于 0×10000000), 你可以试试 realview模型,它可以将 更多的内存放置于高端地址
PARSEC
NOT RUN Benchmark:
ferret: can build, can run
raytrace: cannot build, need 32bit libXmu libX11 libGL libGLU
vips: cannot build, python2.6 cannot agree with LONG_BIT:
use a list to
ferret: can build, can run
raytrace: cannot build, need 32bit libXmu libX11 libGL libGLU
vips: cannot build, python2.6 cannot agree with LONG_BIT:
In file included from /usr/include/python2.6/Python.h:58,
from /local/tk/research/llvm-qemu/benchmarks/parsec-2.1/pkgs/libs/libxml2/src/python/libxml_wrap.h:2,
from /local/tk/research/llvm-qemu/benchmarks/parsec-2.1/pkgs/libs/libxml2/src/python/types.c:9:
/usr/include/python2.6/pyport.h:685:2: error: #error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
========================================================
All 13 benchmarks except vips, all can build successfully
we have run 8 benchmarks, still have 4 benchmarks to try
ferret, raytrace, x264, dedup,
raytrace, x264, OK
ferret and dedup cannot run
raytrace, x264, OK
ferret and dedup cannot run
use a list to
2012年4月23日 星期一
2012年4月19日 星期四
helper functions
8 $adc_cc
1991 $add_cc
71 $clz
885 $cpsr_read
1547 $cpsr_write
773 $exception
21 $get_cp15
8 $get_user_reg
36 $sar
44 $sbc_cc
122 $set_cp15
29 $set_user_reg
479 $shl
62 $shl_cc
191 $shr
8 $shr_cc
32183 $sub_cc
1 $wfi
2012年3月11日 星期日
mibench http://www.eecs.umich.edu/mibench/
coremark http://www.coremark.org/home.php
i73,
TRACE_MERGE=1 TRACE_MERGE_USE_DBO=1 FUNCTION_OPT=0 TRACE_OPT=1 NUM_TRACE_WORKER=3 TRACE_MERGE_NUM_TARGETS=2
hmmer 1547 1254 -18.939883645766, really bad!
Wrong setting, indirect exit handling is used!
2012年3月9日 星期五
build 32-bit PARSEC binaries on 64-bit host
Here are the directions I use to build 32-bit PARSEC binaries on 64-bit
versions of Fedora/RHEL. I don't guarantee that these are the most
efficient directions for getting everything to compile properly, but
they work for me.
Note that I temporarily replace 'uname', which requires root access.
There were a number of scripts (I don't remember which ones) that use
uname to detect the architecture. If it returns x86_64, these scripts
will always attempt to compile the 64-bit versions of libraries (or
whatever). If you don't have root access, you'll have to find where
these checks are and replace them yourself. :)
1. Modify the GCC build config:
* Open ./config/gcc.bldconf
* Change CC_HOME="/n/fs/parsec/local/gcc-4.4.0-static" to CC_HOME="/usr"
* Change BINUTIL_HOME="/usr/local" to BINUTIL_HOME="/usr"
* Make sure GNUTOOL_HOME is set to ="/usr"
* Make sure BINARY_PREFIX is set to =""
* Add '-m32' to CFLAGS, CXXFLAGS, CPPFLAGS, CXXCPPFLAGS, and LDFLAGS
* Add in 'export INCLUDES="-m32"'
* Remove '-L${CC_HOME}/lib64' from LDFLAGS
2. Change the environment variable HOSTTYPE to i386
* In bash: 'HOSTTYPE=i386' 'export HOSTTYPE'
* In csh: 'setenv HOSTTYPE i386'
3. Make sure /usr/lib/libXmu.so and /usr/lib/libX11.so exist. If they don't:
* 'ln -s /usr/lib/libXmu.so.6 /usr/lib/libXmu.so'
* 'ln -s /usr/lib/libX11.so.6 /usr/lib/libX11.so'
4. Make a copy of uname
* 'sudo mv /bin/uname /bin/uname.orig'
5. Make a wrapper shell script to make uname return i686 instaed of x86_64:
* Open a new file /bin/uname and add in:
!/bin/sh
/bin/uname.orig $* | sed 's/x86_64/i686/g'
* 'sudo chmod a+x uname'
6. Change ./pkgs/apps/facesim/src/TaskQ/lib/Makefile 'CXXARGS' to 'CXXFLAGS'
7. Modify ./pkgs/libs/ssl/src/Configure.pl
* Add '$cflags .= " -m32 ";' to line 976 (below the big list of "my"
variable declarations.
8. Change ./pkgs/libs/mesa/src/configure line 4685 to 'asm_arch=x86'
9. Change ./pkgs/libs/glib/src/configure line 40390 to 'G_ATOMIC_I486'
10. Run compilation
* ./bin/parsecmgmt -a build -p all -c gcc
* Note that you'll need to do "./bin/parsecmgmt -a build -p freqmine
-c gcc-openmp" if you want freqmine to compile, as it doesn't use pthreads.
11. Return the original uname
* 'sudo mv /bin/uname.orig /bin/uname'
Good luck!
-Joe
Added by TK:
In parsec-2.1/pkgs/libs/ssl/parsec/gcc.bldconf, change line 20 to build_env="PATH=\"${PATH}\""
If something goes wrong, try not to use -jxxx when make
2012年1月6日 星期五
2012 January 6
- bwaves; performance drops 10%~15% after adding volatile modifier to load/store,
- possible cause should be related to guest CPU FP register RLSO.
- both trace and procedure have the same effect.
- so far, only see difference due to code motion between these two version, perhaps we should see generated code.
- we have observed over 10% mem loads for volatiled version
- it is difficult to find exactly structure difference.
- so, observer floating point operations: use FP_COMP_OPS_EXE:X87
- no difference in the number of floating point operations
- increased memory operations should be the cause the performance degradation of volatile memory.
- 20.1% performance degradation with volatile; 1390 -> 1661
- 11.86%, i.e. 305,180,492,006, extra memory loads
- 16.35%, i.e .216,346,459,794, extra memory stores
- RUN both x86 CINT CFP, ARM CINT benchmarks again before
2011年11月16日 星期三
- vector?
development in i722gogogovector state load/store elimination successfully!need to clean up!- Need to implement NEON guest instructions
- arm host?
- need more time
- development at haru
- partial behavior?
- why no effect? why good? why bad?
- seems no burst anymore; do research
- do it now 2011-11-21 14:46
- it seems partial inlining is failed.
- astar is good
- hmmer is extremely bad! It is true!
- need to know which functions are partial inlined and see why it performs bad!!
- somehow, it degrade to trace mode. Run mostly in trace but not in functions.
- side exit? NO! it's a stupid bug that I forgot link "RETURN" back to the procedure
- using stack to find call path
- but we can find early exit in other benchmarks, such as perlbench.
- NEED to re-run ARM benchmarks
- debug tonto - segfault, should be easy to fix
- 11:28
- done!!!
- Sometimes it is difficult to say it's easy! when
- No! don't open stupid tab in browser!
2011年10月31日 星期一
Experiments
- i722 run NET Trace baseline, with workers 1 2 3, for x86 guest ISA
- i722 run NET Trace baseline with workers 12 3, for arm guest ISA
2011年10月27日 星期四
TODO
- remove unnecessary entry when there exists a trace that go back to this entry point
- gather number of function translated for each benchmark;
- if there are multiple inputs, calculates the average
- use google doc to gather these data
- trace guided layout has no influence for performance, why?
- maybe wrong benchmarks, I've tried gcc, which one should I try?
QUESTION:
- GCC, trace-method performs bad than method; Evidence shows that T-M used more than 4 times IBTC than method. WHY??? The problem may be in the transition between
- Running test inputs and output trace profile; see what we got...
- shit happen: how to stop optimization thread when execution thread logout...
- Any IDEA?
- It is because there is too many trace/method to build such that we still spend too much time on block fragment.
TODO:
1' prove that FP has been fix
2' prove trace-guilded layout works
3' prove partial inline works
3' prove chained partial inline works
Question:
gcc is strange: 1. does it trace-threshold-sensitive? 2. does it function-threshold-sensitive?
2011年8月9日 星期二
TFTP + NFS boot openrd-ultimate
TFTP + NFS boot openrd-ultimate
1, install Debian in USB as described in http://www.cyrius.com/debian/kirkwood/openrd/install.html
2, after success installation, login to Debian system
3, edit /etc/initramfs-tools/initramfs.conf, find "BOOT=local", change it to "BOOT=nfs";
4, issue "update-initramfs -u" to get new /boot/uImage and /boot/uInitrd.
5, shutdown the system, copy the base system to /opt/openrd/nfs, which is exported via NFS.
5.1, Edit /opt/openrd/nfs/etc/fstab to comment out all local mount since there is no local file system after NFS boot. Otherwise, debian would try to mount root file system again after NFS boot
6, copy /opt/openrd/nfs/boot/{uImage,uInitrd} to /opt/openrd/tftp/{uImage,uInitrd}
7, start Openrd-Ultimate, enter u-boot, set environments
set mainlineLinux yes
set arcNumber 2884
set ipaddr=192.168.1.2
set serverip=192.168.1.1
set console 'console=ttyS0,115200n8'
set nfs 'mtdparts=orion_nand:0x400000@0x100000(uImage),0x1fb00000@0x500000(rootfs) rw root=/dev/nfs rw nfsroot=192.168.1.1:/opt/openrd/nfs'
set ip 'ip=192.168.1.2:192.168.1.1:192.168.1.1:255.255.255.0:DB88FXX81:eth0:none'
set bootargs $(console) $(nfs) $(ip)
set bootcmd 'tftpboot 0x01100000 uInitrd; tftpboot 0x00800000 uImage; bootm 0x00800000 0x01100000'
saveenv
reset
8, After restart, should be able use tftp to load uImage and uInitrd, and mount NFS directory /opt/openrd/nfs as root filesystem
1, install Debian in USB as described in http://www.cyrius.com/debian/kirkwood/openrd/install.html
2, after success installation, login to Debian system
3, edit /etc/initramfs-tools/initramfs.conf, find "BOOT=local", change it to "BOOT=nfs";
4, issue "update-initramfs -u" to get new /boot/uImage and /boot/uInitrd.
5, shutdown the system, copy the base system to /opt/openrd/nfs, which is exported via NFS.
5.1, Edit /opt/openrd/nfs/etc/fstab to comment out all local mount since there is no local file system after NFS boot. Otherwise, debian would try to mount root file system again after NFS boot
6, copy /opt/openrd/nfs/boot/{uImage,uInitrd} to /opt/openrd/tftp/{uImage,uInitrd}
7, start Openrd-Ultimate, enter u-boot, set environments
set mainlineLinux yes
set arcNumber 2884
set ipaddr=192.168.1.2
set serverip=192.168.1.1
set console 'console=ttyS0,115200n8'
set nfs 'mtdparts=orion_nand:0x400000@0x100000(uImage),0x1fb00000@0x500000(rootfs) rw root=/dev/nfs rw nfsroot=192.168.1.1:/opt/openrd/nfs'
set ip 'ip=192.168.1.2:192.168.1.1:192.168.1.1:255.255.255.0:DB88FXX81:eth0:none'
set bootargs $(console) $(nfs) $(ip)
set bootcmd 'tftpboot 0x01100000 uInitrd; tftpboot 0x00800000 uImage; bootm 0x00800000 0x01100000'
saveenv
reset
8, After restart, should be able use tftp to load uImage and uInitrd, and mount NFS directory /opt/openrd/nfs as root filesystem
2011年8月4日 星期四
libquantum problem
libquantum has performance lost
why?
Check ibtc and shack hit ratio
they are the same!
don't know
check all version!
why?
Check ibtc and shack hit ratio
they are the same!
don't know
check all version!
2011年8月3日 星期三
segmentation fault:
1' where?
0x00007fffe0000158
0x00007fffe8000168
0x00007fffe8000078
0x00007fffe0000158
This address seems strange
jit_event_listener
try debug version
turn off AddShackPush in trace, wait and see...
still seg fault, re-implement
should always try both single thread version, multi-thread version
Reason:
timing plus inappropriate block_map
when translating a block, the block info is inserted into block_map
at beginning of translation process, which violates the assumption of block map.
We assume all infos in block map are valid, which means they have host code address where the translated code at. However, at the beginning of translation, we didn't know the exact address of the translated block. As a result, when the trace builder doing AddShackPush, it found the incomplete block in the block map, and uses its address.
Therefore, we modified the code as the following:
1. block info is added immoderately after the location of the translated block is known, which is in NotifyFunctionEmitted of jit_event_listener.h.
2. the query is delayed until we are going to patch the shack point.
However, in this experiment, I found shadow stack is quietly in-effect to performance,
plus it consumes more memory, and makes code complex. It seems more reasonable not to use shack when trace is available.
1' where?
0x00007fffe0000158
0x00007fffe8000168
0x00007fffe8000078
0x00007fffe0000158
This address seems strange
jit_event_listener
try debug version
turn off AddShackPush in trace, wait and see...
still seg fault, re-implement
should always try both single thread version, multi-thread version
Reason:
timing plus inappropriate block_map
when translating a block, the block info is inserted into block_map
at beginning of translation process, which violates the assumption of block map.
We assume all infos in block map are valid, which means they have host code address where the translated code at. However, at the beginning of translation, we didn't know the exact address of the translated block. As a result, when the trace builder doing AddShackPush, it found the incomplete block in the block map, and uses its address.
Therefore, we modified the code as the following:
1. block info is added immoderately after the location of the translated block is known, which is in NotifyFunctionEmitted of jit_event_listener.h.
2. the query is delayed until we are going to patch the shack point.
However, in this experiment, I found shadow stack is quietly in-effect to performance,
plus it consumes more memory, and makes code complex. It seems more reasonable not to use shack when trace is available.
2011年7月11日 星期一
Debug cactusADM. Died in signal 9.
Cannot move on
Keep gen trace_
1. add print_pc
2. assert re-genereated trace
3. possible reason:
somehow it traps within a block...., why?
self loop?
Fight!
Timing!
why....
a -> b -> c -> c -> c -> c -> c -> c ...
a->b->c->b->c->b->c->b->c->b
Possibility:
1' trace gen die. or no response.
which ca
Trace: Path/Cycle with duplicate nodes
CFG: Graph with unique nodes
Need convert Trace to Graph
Cannot move on
Keep gen trace_
1. add print_pc
2. assert re-genereated trace
3. possible reason:
somehow it traps within a block...., why?
self loop?
Fight!
Timing!
why....
a -> b -> c -> c -> c -> c -> c -> c ...
a->b->c->b->c->b->c->b->c->b
Possibility:
1' trace gen die. or no response.
which ca
Trace: Path/Cycle with duplicate nodes
CFG: Graph with unique nodes
Need convert Trace to Graph
2011年2月8日 星期二
2011年1月5日 星期三
2010年9月21日 星期二
2010年7月31日 星期六
2010年7月29日 星期四
TODO
1' Debug cc lazy; run perl fail; chop.t
2' Do block linking; let jitter can recompile code;
3' profile time; use previous profile time framework; use tick as time info
4' add entry code to profile block execution; can we have a generic framework for add and retrieve
statical info. how about use MACRO, and use tag to indicate type of statical, such as invoke times,
execution time
2' do block linking; let jitter can recompile code at the same address.
2.1 Locate files where we should modify.
ExecutionEngine/JIT/JITEmitter.cpp
two files
add recompileFunction in JIT.cpp
should
================================
No use; In JITEmitter::startFunction(), CurBufferPtr will be reset;
Need to implement our MemoryManagement
我可以想成這個 jitter 只有我在用,所以我設計一個我自己用的 memory manager
回家!
2' Do block linking; let jitter can recompile code;
3' profile time; use previous profile time framework; use tick as time info
4' add entry code to profile block execution; can we have a generic framework for add and retrieve
statical info. how about use MACRO, and use tag to indicate type of statical, such as invoke times,
execution time
2' do block linking; let jitter can recompile code at the same address.
2.1 Locate files where we should modify.
ExecutionEngine/JIT/JITEmitter.cpp
two files
add recompileFunction in JIT.cpp
should
================================
No use; In JITEmitter::startFunction(), CurBufferPtr will be reset;
Need to implement our MemoryManagement
我可以想成這個 jitter 只有我在用,所以我設計一個我自己用的 memory manager
回家!
2010年7月26日 星期一
Fuck!
TODO:
1. cc-lazy still has trouble running perl, need debug!
2. do block linking
block linking:
1' add an exit block;
2' modify that exit block
change to 20100726
addPointerToBasicBlock doesn't work
it die at DominatorPass, DFSPass,
I don't know why, but since that function canno use, I don't have any reason use llvm-2.8svn
change back to llvm-2.7svn
use svn command:
svn update
svn merge -r HEAD:7381
svn commit -m "Roll back to llvm-2.7svn"
TODO:
1. cc-lazy still has trouble running perl, need debug!
2. do block linking
block linking:
1' add an exit block;
2' modify that exit block
change to 20100726
addPointerToBasicBlock doesn't work
it die at DominatorPass, DFSPass,
I don't know why, but since that function canno use, I don't have any reason use llvm-2.8svn
change back to llvm-2.7svn
use svn command:
svn update
svn merge -r HEAD:7381
svn commit -m "Roll back to llvm-2.7svn"
2010年7月21日 星期三
Hot trace model
today, you need to figure out the model of building a hot trace
and you will have a meeting with Prof. Liu to discuss this.
First, collect the posts of Prof. Liu for hot trace, qutoe as follows.
maybe we can add possibility into this problem.
and you will have a meeting with Prof. Liu to discuss this.
First, collect the posts of Prof. Liu for hot trace, qutoe as follows.
We encountered a hot-trace-finding problem last Thursday. BasicallyThe above discussion did not include possibility of conditional branches.
this problem asks for a good hot trace so that a large number of
execution sequences will remain in the hot trace. Formally we are
given a directed graph G (call graph?) and a set of simple paths P in
G. Each simple path is a sequence of at least two different nodes in
G so that consecutive nodes in the path has a directed edge in G. Note
that we assume that the nodes in a simple path must be different. I am
not sure if this is the case in our system but let us assume it for
the moment. Now we want to find another simple path h (this is the hot
trace we are looking for) so that at least k of the paths in P are
contained in H. A path p is contained in another path q if and only if
q is a substring of p. Now we can define the problem -- given G, P,
and k, is there a hot trace h that cotains at least k paths in P? We
will refere to this problem as k-HOT-TRACE.
It seems that k-HOT-TRACE is NP-complete by reducing from Hamitonian
path. Given a graph H we would like to ask whethere there is a
Hamitonian path in H, we simple transform the priblem instance into a
k-HOT-TRACE problem instance. We simply use H as G, and let P be the
set of path of two nodes, i.e., all edges in G, and set k to be the
number of nodes in G minus 1. As a result if there is a solution for
the (n-1)-HOT-TRACE problem for G, there is a Hamitonian path in H,
and vice versa.
Two followups may be possible. Anyone interested in these please come
talk to me.
There could be an efficient dynamic programming solution when G is a
tree, even when we further restrict the length of the hot trace.
This seems easy. Now the problem is that we limit the length of the
hot trace, and try to find the one that contains the maximum number of
paths. We define a function P(v, l) to be the number of paths that are
contained by the hot trace that ends at v and has length l. It is easy
to write down the recursive formula that P(v, l) = P(w, l-1) + the
number of path that ends at v nad has length no more than l, where w
is the parent of v. We have N times L cells to fill, where N is the
number of tree nodes and L is the maximum hot trace length allowed.
Each cell needs no more than \log n(v) operations where n(v) is the
number of paths end at v. Roughly the total cost is no more than O(N L
\log(N)). Of course this is very rough and I need to work harder to
get a closer bound.
There could be a dynamic programming solution when G is a serial-
parallel graph.
FROM TK:
I think the function P(v, l) should be defined as the MAXIMUM number of paths that are
contained by a hot trace that ends at v and has length l since there are more
than one traces that are end at v and have length l.
Also, about the recursive function, it seems the formula only contains "one half" cases.
It only considers hot traces that pass through w and end at v.
However, there could be other ended-at-v-with-length-l hot traces that are contained only within the subtree rooted at v,
rather passing through v's parent.
However, I still need some time to figure the correct recursive function. come back later.
I write down the problem and recursive function for tree in http://www.iis.sinica.edu.tw/~tk/k-hot-path.pdf
Chun-Chen
NOTE: FROM ICE
In our system based on Qemu with LLVM, we generate the traces(i.e.
dynamic basic blocks)
which have only one entrance on the head of these blocks, so far.
Moreover, especially we want to translate a hot trace with the
optimization options of LLVM such that
the translated instructions may be reordered.
Consequently, the host code of the hot traces cannot have multiple
entrances, either.
In other word, different jump target addresses imply different traces.
Therefore, the same (static) basic blocks may be included in different
hot traces.
Furthermore, the assumption, "the nodes in a simple path must be
different", may not be suitable for our system.
Our approach prefers to construct the hot traces whose lengths are as
long as possible
since we believe that longer traces have more opportunities to be
optimized.
However, the longer traces will incur the more overhead to generate
optimized code.
We should limit the length of a hot trace.
Considering definition of a hot trace, we should consider the
frequency of the execution.
We should not only find the longer hot traces on the CFG but also make
the hot code segments included in the hot traces.
Due to these reasons,
I think that the hot trace problem should be modified as follows to
match the practice of our system.
We still have a directed graph G.
A node in G represents a (static) basic block.
An edge connecting two node in G represents that one of these two node
may be executed after the other one is executed.
An edge in G has a weight to represent the frequency of the execution
between the two nodes connected by this edge.
We limit the lengths of hot traces under a given k to consider the
overhead of generate optimized code.
We want to find the minimum number of the hot traces with length
constraint to include the hot edges
where hot edges are the edges which have the weight larger than a
given threshold.
(Note that the same nodes can be included in different hot traces.)
If we remove the weights on the edges but give each node a weight to
represent the execution frequency of this node, instead,
we may have another similar version of this problem as follows.
We want to find the minimum number of the hot traces with length
constraint to include the hot nodes
where hot nodes are the nodes which have the weight larger than a
given threshold.
ice
Sorry that I do not understand what you are saying here.
Do you mean you want to find a minimum set of length-limited paths
that cover a set of specified (hot) edges?
I will refine the model according to the description of ice. Correct
me if I am wrong.
We will have weight information on edges and nodes, instead of a set
of paths as I described earlier. So we will be given a directed graph
G, a constant l, and a constant k. The edges of G are partitioned into
two subsets -- important edges and unimportant edges. We want to find
a set S of at most k directed paths where each of them is at most of
length l, and the paths in S cover all important edges. We will call
this problem edge-covering-path-set problem. Similarly we can define
node-covering-path-set problem, where we partition nodes in G into
important and unimportant ones, and define the problem similarly.
It is easy to see that node-covering-path-set is NPC, since we simply
make every node important, set k to 1 and l to n - 1, where n is the
number of nodes in G, then we have a Hamiltonian path problem. I are
wondering if edge-covering-path-set is also NPC.
These two problems on tree should be easy as long as the all edges are
directed away from the root. However, this is only theoretically
interesting since in practice the graph will not be a tree.
The edge-covering-path-set problem seems to be NPC as well. Here is
the proof. We will reduce from Hamiltonian path again. Given a
directed graph H as an instance of Hamiltonian path problem, we
construct G as follows. Every node in H node splits into two nodes
called head and tail. We then add an directed edge (called internal
edge) from head to tail. Now if there is an edge from v to w in H, we
put a directed edge from the tail of v to the head of w. Now we make
all internal edges "important", l to be 2n - 1, where n is the number
of nodes in H, and k = 1. It should be easy to see that a hot path in
G will be a Hamiltonian path in H, and vice versa.
We have three problems now -- information given as paths, on the node,
or on the edges. All of them are NPC, but all of them are sovlable on
trees, with edgings going away from the root. The path version can be
solved by a dynamic programming, and the node and edge version can be
solved by greedy methods.
maybe we can add possibility into this problem.
2010年6月25日 星期五
2010年6月18日 星期五
2010年6月3日 星期四
build hadoop
2010/06/03 build hadoop, problem:
ivy-download:
/home/tk/Software/hadoop/hadoop-svn-0.20.2/build.xml:1630: java.net.SocketException: Network is unreachable
mentioned in this post that :
Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf on my debian squeeze installation seems to have fixed the problem. Sorry for littering the list.
This actually is a special issue in Debina squeeze/sid version as explained in another post.
So what to do is:
1. Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf
2. restart procps: sudo invoke-rc.d procps restart
now we have problem for java version
ivy-download:
/home/tk/Software/hadoop/hadoop-svn-0.20.2/build.xml:1630: java.net.SocketException: Network is unreachable
mentioned in this post that :
Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf on my debian squeeze installation seems to have fixed the problem. Sorry for littering the list.
This actually is a special issue in Debina squeeze/sid version as explained in another post.
So what to do is:
1. Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf
2. restart procps: sudo invoke-rc.d procps restart
now we have problem for java version
2010年6月1日 星期二
2010年5月25日 星期二
what's the major challenge of retargetable binary translator
I think the binary translator currently does not has enough tool to run optimization
what's the benefit to convert binary code to high level intermediate code?
now we have LLVM tool to manipulate binary at runtime.
to map guest register to host register at runtime.
gogogogogogogogogogogogogogogogogo
I think the binary translator currently does not has enough tool to run optimization
what's the benefit to convert binary code to high level intermediate code?
now we have LLVM tool to manipulate binary at runtime.
to map guest register to host register at runtime.
gogogogogogogogogogogogogogogogogo
2010年5月3日 星期一
2010年4月24日 星期六
2010年4月12日 星期一
qemu 在 user mode 下是用 static code gen buffer,size為32mb
in ./exec.c line 408 it mentioned:
Currently it is not recommended to allocate big chunks of data in
user mode. It will change when a dedicated libc will be used
but don't know why. Why not use mmap as system mode use?
先弄一個 exit basic block,然後再將 pass 給去掉。
現在要focus 在 indirect branch 上,要知道的是第一個參數的 address 是什麼?
FP->getPassName()
lib/CodeGen/StackProtector.cpp
DefaultJITMemoryManager
lib/CodeGen/LLVMTargetMachine.cpp:
LLVMTargetMachine::addPassesToEmitMachineCode
LLVMTargetMachine::addCommonCodeGenPasses
=============================
Prevent spill register code gen.
enable to emit epilog for branch and indirect brance
=============================
錯在這:0x40095872: mov (%edx),%eax
%edx 為 init_stack+c 的位置,其中的值
在 loader_exec 中就放了!
在 create_elf_tables() 之後!
在 loader_build_argptr() 之後!
在 put_user_ual(stringp, envp) 之後!
in ./exec.c line 408 it mentioned:
Currently it is not recommended to allocate big chunks of data in
user mode. It will change when a dedicated libc will be used
but don't know why. Why not use mmap as system mode use?
Segment selector Format:
15 ...... 3 | 2 1 | 0 |
---|---|---|
index | TI | RPL |
- TI Table indicator:
- 0 means selector indexes into GDT
1 means selector indexes into LDT - RPL Privelege level. Linux uses only two privelege levels.
- 0 means kernel
3 means user
Examples:
- Kernel code segment
- TI=0, index=1, RPL=0, therefore selector = 0x08 (GDT[1])
- User data segment
- TI=1, index=2, RPL=3, therefore selector = 0x17 (LDT[2])
先弄一個 exit basic block,然後再將 pass 給去掉。
現在要focus 在 indirect branch 上,要知道的是第一個參數的 address 是什麼?
FP->getPassName()
lib/CodeGen/StackProtector.cpp
DefaultJITMemoryManager
lib/CodeGen/LLVMTargetMachine.cpp:
LLVMTargetMachine::addPassesToEmitMachineCode
LLVMTargetMachine::addCommonCodeGenPasses
=============================
Prevent spill register code gen.
enable to emit epilog for branch and indirect brance
=============================
錯在這:0x40095872: mov (%edx),%eax
%edx 為 init_stack+c 的位置,其中的值
在 loader_exec 中就放了!
在 create_elf_tables() 之後!
在 loader_build_argptr() 之後!
在 put_user_ual(stringp, envp) 之後!
2010年4月7日 星期三
2010年4月6日 星期二
what are the meanings of cf,of and af flags in eflag?
CF: This flag indicates an overflow condition for unsigned-integer arithmetic.
Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise.
OF: This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.
Set if the integer result is too large a positive number or too small a negative number (excludingthe sign-bit) to fit in the destination operand; cleared otherwise.
AF: This flag is used in binary-coded decimal (BCD) arithmetic.
Set if an arithmetic operation generates a carry
or a borrow out of bit 3 of the result; cleared otherwise.
CF: This flag indicates an overflow condition for unsigned-integer arithmetic.
Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise.
OF: This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.
Set if the integer result is too large a positive number or too small a negative number (excludingthe sign-bit) to fit in the destination operand; cleared otherwise.
AF: This flag is used in binary-coded decimal (BCD) arithmetic.
Set if an arithmetic operation generates a carry
or a borrow out of bit 3 of the result; cleared otherwise.
2010年4月1日 星期四
2010年3月26日 星期五
I summarize our discussion as follows.
As stated in the CACM article, the single master may pose some problems when tremendous amount of data are considered.
For example, the master is unable to store all the metadata in the memory.
In addition, the master would become the bandwidth bottleneck because it has to a large amount of connections.
Google now also wants to deal with this issue, and they propose to use multiple masters.
In their proposal, a file will be related to one of the masters in an static/unchangeable way.
Besides, it seems that they do not consider the load balance problem of masters in their design.
So, the following is our design, in which the load balance and bandwidth problems are simultaneously solved.
Assume that we have k masters, each of which takes care of certain chunkservers.
Each master stores k counting bloom filters, among which i-th bloom filter represents the association between the i-th master and the file-chunk mapping it is in charge of.
Note that each master broadcasts its association to the other masters so that each master can renew the counting bloom filters it stores.
When a client wants to access a file, what it should do is to randomly pick a master and then to query the corresponding k counting bloom filters.
If not hit happens in all the filters, it means that the file does not exist.
Otherwise, the client turns to ask the master that is in charge of the file the client would like to access.
Of course, there could be cases that more than one masters are responsible for the file the client would like to access according to the response of the filters.
This is due to the nature of bloom filter, and can be easily mitigated and handled.
Increasing the filter size can mitigate such rate of false hit.
Simply asking those two masters to make sure again can be helpful in handling this ambiguity.
As a whole, this design can solve the memory problem because k masters share the workload.
This design can solve the bandwidth problem because of the randomness in the choice made by the client.
For example, the master is unable to store all the metadata in the memory.
In addition, the master would become the bandwidth bottleneck because it has to a large amount of connections.
Google now also wants to deal with this issue, and they propose to use multiple masters.
In their proposal, a file will be related to one of the masters in an static/unchangeable way.
Besides, it seems that they do not consider the load balance problem of masters in their design.
So, the following is our design, in which the load balance and bandwidth problems are simultaneously solved.
Assume that we have k masters, each of which takes care of certain chunkservers.
Each master stores k counting bloom filters, among which i-th bloom filter represents the association between the i-th master and the file-chunk mapping it is in charge of.
Note that each master broadcasts its association to the other masters so that each master can renew the counting bloom filters it stores.
When a client wants to access a file, what it should do is to randomly pick a master and then to query the corresponding k counting bloom filters.
If not hit happens in all the filters, it means that the file does not exist.
Otherwise, the client turns to ask the master that is in charge of the file the client would like to access.
Of course, there could be cases that more than one masters are responsible for the file the client would like to access according to the response of the filters.
This is due to the nature of bloom filter, and can be easily mitigated and handled.
Increasing the filter size can mitigate such rate of false hit.
Simply asking those two masters to make sure again can be helpful in handling this ambiguity.
As a whole, this design can solve the memory problem because k masters share the workload.
This design can solve the bandwidth problem because of the randomness in the choice made by the client.
I think we can raise the following questions:
1. Why the engineers in Google insist that, in the single master case, all the metadata should be stored in the memory in the master?
A: First, storing some of metadata in the disk may slow down the overall access speed. Second, doing so still cannot solve the bandwidth problem.
2. The CACM ariticle briefs Google's plan on the multiple master setting in GFS. Do you have any idea about their design? Besides, Do you have any idea how to coordinate multiple masters?
A: (We can decribe our above idea. I draw a conceptional picture. Perhaps we can attach this picture to the question-raising mail if you would like to.)
The following are some questions I personally want to ask:
1. Memory and bandwidth problems can be simultaneously solved in the multiple master setting. However, the multiple master setting has been described in the CACM article, though, very briefly. So, I think what we do is to detail how masters work by using our idea?
2. Although namespace file partition the metadata in a static way, we may still directly modify the namespace file so that after a specific time, the metadata that belongs to master 2 before the specific time belongs master 3. So, maybe it is also dynamic?
Q1:
In the CACM article, they mention a "static namespace partition file" approach to enable multiple masters over a pool of chunkservers.
The description, however, is very rough and we are very interested in it.
So, we would like to describe that idea more clearly and identify its pro and con in the class.
Answer to Q1:
First, a static namespace partition file is just a mapping table of directory <-> master ID.
Clients will read this table to find the "right" master.
Master will write to this table for a directory creation request.
The corresponding master adds an entry to this table when a directory creation request arrives.
And we think the client chooses a master randomly to server this request.
The content of this table must be consistent to all clients and masters.
Therefore, we think the reasonable approach is only masters have this table.
And the scenario for a read request should be as follows.
1. Whenever a client send a read request, it chooses randomly a master and sends the request.
2. The master looks up the table for the corresponding master and redirect that request to it.
3. The corresponding master serves the read request and send the result back to client.
Therefore, each master has a "DISPATCH" functionality, i.e., to dispatch request to the right master according to the mapping table.
Things needed to be pointed out:
1. Although they call it namespace partition "file", we don't think this information is stored in file(in the disk).
It should resides in the memories of masters in order to keep faster response. After all, all requests must consult this table.
2. The granularity of partition is "directory". There are pro and con here.
pro 1: Large granularity results in less synchronization. Synchronization overhead can be small.
pro 2: This granularity simplifies the modification of the design of master.
con 1: Large granularity may result in load imbalance both in terms of memory usage and served requests.
Once a directory is created, all metadata of files and chunks in that directory will store in one master.
And all requests for all metadata of files and chunks in that directory will direct to one master.
3. Since they allow multiple masters over a pool of chunkservers. Then each chunkserver must maintain another table to map chunk ID to master ID.
So that the chunkserver knows which chunk should report to which master according to that table.
Concluding remarks:
The "static namespace partition file" approach for multiple masters is just a workaround approach for GFS.
The two pros makes multiple masters design to be done quickly and release the burden of single master design.
However, the con described above may be a problem.
So, what about more fine granularity: "FILE".
We will address this problem in our project proposal.
Q2:
Why the namespace partition file is "STATIC" rather than "DYNAMIC"?
Again, we would like to describe what we have discussed in the class.
Answer to Q2:
First, we give our definition about STATIC and DYNAMIC:
"STATIC" means once the entry (directory to master ID) is added into the file, this mapping won't change in the future.
On the other hand, "DYNAMIC" means a directory, say "/A", can be mapped to master 1 at a time, and then mapped to master 2.
The reason why not allow dynamic:
If we allowed the mapping to be changed, we must support metadata migration, which means to move metadata from one master to another master.
Since the granularity of partition is directory, one can imagine that there will be LOTS of metadata need to migrate when the mapping is changed.
So, the overhead of migration should be large since all requests to that directory must be suspend until the migration is done, and the mapping information of chunkservers should be change accordingly.
we believe the reason why they don't allow dynamic is that they want to simplify design since this is just a workaround approach.
訂閱:
文章 (Atom)