工作日誌

2012年6月28日星期四

Work Log

About zero length array: http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html .
https://wiki.linaro.org/PeterMaydell/QemuVersatileExpress Linaro QEMU V.Express support

(IMG=vexpress.img ; if [ -e "$IMG" ] ; then sudo mount -o loop,offset="$(file "$IMG" | awk 'BEGIN { RS=";"; } /partition 2/ { print $7*512; }')" -t auto "$IMG" /mnt/mnt; else echo "$IMG not found"; fi )

Linaro Android QEMU V.Express: https://wiki.linaro.org/KenWerner/Sandbox/AndroidQEMU

vmlinuz and initrd.gz is in uImage and uInitrd: dd if=uImage skip=64 bs=1 to extract them

Use reboot to shutdown Andriod
ARM-VExpress image: http://releases.linaro.org/12.05/ubuntu/vexpress/

2012年6月27日星期三

Think Flow

tcg_livness_analysis, if opcode is qemu_ld or qemu_st, set all globals alive.
qemu_ld is OK now.
restore to morning status
qemu 1.01 can exit qemu when poweroff
chaos
qemu_st, fail, no output, trap in some sort of loops?
just forget what I'm going to do after viewing some web pages....

2012年6月22日星期五

Change the location of SAVE_DIRTY_STATES
Change the conditional branch of tcg_out_tlb_load from
~~It seems PAGE FAULT is just a result, because wrong PATH taken~~

WRONG! it is a mistake.

STRANGE THINK HAPPEN, BE CAREFUL, FOCUS

Think flow

I modify the conditional branch of load_tlb from JNE to JE, and related code.

Running in original QEMU can boot ARM-Linux, so far so good.

I change the location of SAVE_DIRTY_STATES, call it QEMU_TK;
QEMU_TK die after first page fault happen

Question: what's the difference between QEMU and QEMU_TK ?
difference means the state of

OK, we need to restore states back

2012年6月21日星期四

Thinking Flow

study qemu code: tcg_reg_alloc_op(): 1708

what is fixed_reg TCGemp

in tcg_global_reg_new_internal, fixed_reg is set 1
in tcg_global_mem_new_internal, fixed_reg is set 0
so, it seems it indicates whether this temp is register or not

In TCGContext, what is reg_to_temp?

in tcg_reg_alloc(), s->reg_to_temp[reg] decides whether the HOST register is mapped to any TCGTemp.
So I think, reg_to_temp indicates current HOST reg represents reg_to_temp[reg].

What is val_type in TCGTemp?

NOT CLEAR
It seems it indicates the current type of this temp.
It is possible that ts->fixed_reg && ts->val_type == TEMP_VAL_MEM; or NOT ts->fixed_reg and ts->val_type == TEMP_VAL_REG.

When does TCGArgDef args_ct set?

TCG: tcg_gen_code_common

In TCGContext:
/* liveness analysis */
uint16_t *op_dead_iargs;
/* for each operation, each bit tells if the corresponding input argument is dead */

what is tcg_op_defs

In: tcg_liveness_analysis, tcg/tcg.c: 1187
backward scan

NOTE: tcg_opc.h: definition of TCG opcodes (a.k.a TCG IR)
So, remove qemu_ld/st TCG_OPF_CALL_CLOBBER here

In tcg_liveness_analysis:
1292 } else if (def->flags & TCG_OPF_CALL_CLOBBER) {
1293 /* globals are live */
1294 memset(dead_temps, 0, s->nb_globals);
1295 }

Question: if we remove TCG_OPF_CALL_CLOBBER of qemu_ld/st, will this be a problem?

In: tcg_reg_alloc_op:

1708 if (def->flags & TCG_OPF_CALL_CLOBBER) {
1709 /* XXX: permit generic clobber register list ? */
1710 for(reg ex= 0; reg < TCG_TARGET_NB_REGS; reg++) {
1711 if (tcg_regset_test_reg(tcg_target_call_clobber_regs, reg)) {
1712 tcg_reg_free(s, reg);
1713 }
1714 }
1715 /* XXX: for load/store we could do that only for the slow path
1716 (i.e. when a memory callback is called) */
1717
1718 /* store globals and free associated registers (we assume the insn
1719 can modify any global. */
1720 save_globals(s, allocated_regs);
1721 }

Question: what does Marsalis Wallace look like ? or
What does tcg_reg_free do?

It loops over tcg_target_call_clobber_regs and
if S->temps[reg]->mem_coherent is not true, store reg back to env->temp_buf

Question: what does save_globals do?

What does ``globals'' mean?
In tcg/README, A TCG "global" is a variable which is live in all the functions (equivalent of a C global variable). They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet).
call temp_save to save temp
In temp_save(), save temp to env->temp_buf

==================================================================

tcg_out_op() is called to generate code for the TCG opcode.

We are interested in tcg_out_qemu_ld/st

QUESTION:

Strange enough, I cannot find lines where to save guest register states back to their canonical locations.

I only saw save back to temp_buf in 1708.

That is exactly the place.

==================================================================

Remove TCG_OPF_CALL_CLOBBER in qemu_ld

move save_dirty_state when TLB miss

program fail when the first PAGE FAULT occurs.

should compare REG contents between my version and original version

==================================================================

2012年6月20日星期三

ARM MMU introduce

ARM MMU introduce (FROM: http://www.liacs.nl/~krietvel/courses/aca2011/arm-mmu.pdf)

2012年6月19日星期二

QEMU ARM USE

when using nographic, you cannot use --<2> to switch to monitor screen since there is no graphic anymore.

Instead, use -curses

then monitor will show

ALSO, use -monitor stdio

how to change runlevel through kernel parameter append

JUST ADD THE NUMBER OF RUNLEVEL
EXAMPLE:
"root=/dev/sdb1 console=/dev/ttyAMA0 2 "

HOW to mount qcow image used by QEMU

HOW to mount qcow image used by QEMU
http://blog.loftninjas.org/2008/10/27/mounting-kvm-qcow2-qemu-disk-images/

2012年6月8日星期五

ARM v7 Instruction Manual

2012年6月6日星期三

ARM online reference site

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cjagdjbf.html

i7 currently RUNNING experiements

i7 currently RUNNING experiments:
TRACE_MERGE
TRACE
TRACE_NET_ORIG

Each configuration run 4 benchmark set: CINT-ARM, CINT-IA32, CFP-IA32, CFP_VECTOR-IA32
Each benchmark run 5 times.
There are 3 * 4 * 5 = 120 benchmarks need to run
estimate hours: 120 * 15000 sec = 20 days
6/26 will finish all runs!

Producing Wrong Data Without Doing Anything Obviously Wrong

LnQ Region Performance

r531 V.S r521

OMNETPP and XALANCBMK has performance down 15% and 8%.

SPECvirt_sc2010

SPECvirt_sc2010: SPEC's first benchmark addressing performance evaluation of datacenter servers used in virtualized server consolidation.

2012年6月5日星期二

statically build OpenMP program

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39176#c7
we have to link pthread ourselves.
Add
-Wl,--whole-archive -lpthread -Wl,--no-whole-archive

gcc sse builtin functions

http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/X86-Built_002din-Functions.html

     v8qi __builtin_ia32_paddb (v8qi, v8qi)
     v4hi __builtin_ia32_paddw (v4hi, v4hi)
     v2si __builtin_ia32_paddd (v2si, v2si)
     v8qi __builtin_ia32_psubb (v8qi, v8qi)
     v4hi __builtin_ia32_psubw (v4hi, v4hi)
     v2si __builtin_ia32_psubd (v2si, v2si)
     v8qi __builtin_ia32_paddsb (v8qi, v8qi)
     v4hi __builtin_ia32_paddsw (v4hi, v4hi)
     v8qi __builtin_ia32_psubsb (v8qi, v8qi)
     v4hi __builtin_ia32_psubsw (v4hi, v4hi)
     v8qi __builtin_ia32_paddusb (v8qi, v8qi)
     v4hi __builtin_ia32_paddusw (v4hi, v4hi)
     v8qi __builtin_ia32_psubusb (v8qi, v8qi)
     v4hi __builtin_ia32_psubusw (v4hi, v4hi)
     v4hi __builtin_ia32_pmullw (v4hi, v4hi)
     v4hi __builtin_ia32_pmulhw (v4hi, v4hi)
     di __builtin_ia32_pand (di, di)
     di __builtin_ia32_pandn (di,di)
     di __builtin_ia32_por (di, di)
     di __builtin_ia32_pxor (di, di)
     v8qi __builtin_ia32_pcmpeqb (v8qi, v8qi)
     v4hi __builtin_ia32_pcmpeqw (v4hi, v4hi)
     v2si __builtin_ia32_pcmpeqd (v2si, v2si)
     v8qi __builtin_ia32_pcmpgtb (v8qi, v8qi)
     v4hi __builtin_ia32_pcmpgtw (v4hi, v4hi)
     v2si __builtin_ia32_pcmpgtd (v2si, v2si)
     v8qi __builtin_ia32_punpckhbw (v8qi, v8qi)
     v4hi __builtin_ia32_punpckhwd (v4hi, v4hi)
     v2si __builtin_ia32_punpckhdq (v2si, v2si)
     v8qi __builtin_ia32_punpcklbw (v8qi, v8qi)
     v4hi __builtin_ia32_punpcklwd (v4hi, v4hi)
     v2si __builtin_ia32_punpckldq (v2si, v2si)
     v8qi __builtin_ia32_packsswb (v4hi, v4hi)
     v4hi __builtin_ia32_packssdw (v2si, v2si)
     v8qi __builtin_ia32_packuswb (v4hi, v4hi)

The following built-in functions are made available either with -msse, or with a combination of -m3dnow and -march=athlon. All of them generate the machine instruction that is part of the name.

     v4hi __builtin_ia32_pmulhuw (v4hi, v4hi)
     v8qi __builtin_ia32_pavgb (v8qi, v8qi)
     v4hi __builtin_ia32_pavgw (v4hi, v4hi)
     v4hi __builtin_ia32_psadbw (v8qi, v8qi)
     v8qi __builtin_ia32_pmaxub (v8qi, v8qi)
     v4hi __builtin_ia32_pmaxsw (v4hi, v4hi)
     v8qi __builtin_ia32_pminub (v8qi, v8qi)
     v4hi __builtin_ia32_pminsw (v4hi, v4hi)
     int __builtin_ia32_pextrw (v4hi, int)
     v4hi __builtin_ia32_pinsrw (v4hi, int, int)
     int __builtin_ia32_pmovmskb (v8qi)
     void __builtin_ia32_maskmovq (v8qi, v8qi, char *)
     void __builtin_ia32_movntq (di *, di)
     void __builtin_ia32_sfence (void)

The following built-in functions are available when -msse is used. All of them generate the machine instruction that is part of the name.

     int __builtin_ia32_comieq (v4sf, v4sf)
     int __builtin_ia32_comineq (v4sf, v4sf)
     int __builtin_ia32_comilt (v4sf, v4sf)
     int __builtin_ia32_comile (v4sf, v4sf)
     int __builtin_ia32_comigt (v4sf, v4sf)
     int __builtin_ia32_comige (v4sf, v4sf)
     int __builtin_ia32_ucomieq (v4sf, v4sf)
     int __builtin_ia32_ucomineq (v4sf, v4sf)
     int __builtin_ia32_ucomilt (v4sf, v4sf)
     int __builtin_ia32_ucomile (v4sf, v4sf)
     int __builtin_ia32_ucomigt (v4sf, v4sf)
     int __builtin_ia32_ucomige (v4sf, v4sf)
     v4sf __builtin_ia32_addps (v4sf, v4sf)
     v4sf __builtin_ia32_subps (v4sf, v4sf)
     v4sf __builtin_ia32_mulps (v4sf, v4sf)
     v4sf __builtin_ia32_divps (v4sf, v4sf)
     v4sf __builtin_ia32_addss (v4sf, v4sf)
     v4sf __builtin_ia32_subss (v4sf, v4sf)
     v4sf __builtin_ia32_mulss (v4sf, v4sf)
     v4sf __builtin_ia32_divss (v4sf, v4sf)
     v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
     v4si __builtin_ia32_cmpltps (v4sf, v4sf)
     v4si __builtin_ia32_cmpleps (v4sf, v4sf)
     v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
     v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
     v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
     v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
     v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
     v4si __builtin_ia32_cmpnleps (v4sf, v4sf)
     v4si __builtin_ia32_cmpngtps (v4sf, v4sf)
     v4si __builtin_ia32_cmpngeps (v4sf, v4sf)
     v4si __builtin_ia32_cmpordps (v4sf, v4sf)
     v4si __builtin_ia32_cmpeqss (v4sf, v4sf)
     v4si __builtin_ia32_cmpltss (v4sf, v4sf)
     v4si __builtin_ia32_cmpless (v4sf, v4sf)
     v4si __builtin_ia32_cmpunordss (v4sf, v4sf)
     v4si __builtin_ia32_cmpneqss (v4sf, v4sf)
     v4si __builtin_ia32_cmpnlts (v4sf, v4sf)
     v4si __builtin_ia32_cmpnless (v4sf, v4sf)
     v4si __builtin_ia32_cmpordss (v4sf, v4sf)
     v4sf __builtin_ia32_maxps (v4sf, v4sf)
     v4sf __builtin_ia32_maxss (v4sf, v4sf)
     v4sf __builtin_ia32_minps (v4sf, v4sf)
     v4sf __builtin_ia32_minss (v4sf, v4sf)
     v4sf __builtin_ia32_andps (v4sf, v4sf)
     v4sf __builtin_ia32_andnps (v4sf, v4sf)
     v4sf __builtin_ia32_orps (v4sf, v4sf)
     v4sf __builtin_ia32_xorps (v4sf, v4sf)
     v4sf __builtin_ia32_movss (v4sf, v4sf)
     v4sf __builtin_ia32_movhlps (v4sf, v4sf)
     v4sf __builtin_ia32_movlhps (v4sf, v4sf)
     v4sf __builtin_ia32_unpckhps (v4sf, v4sf)
     v4sf __builtin_ia32_unpcklps (v4sf, v4sf)
     v4sf __builtin_ia32_cvtpi2ps (v4sf, v2si)
     v4sf __builtin_ia32_cvtsi2ss (v4sf, int)
     v2si __builtin_ia32_cvtps2pi (v4sf)
     int __builtin_ia32_cvtss2si (v4sf)
     v2si __builtin_ia32_cvttps2pi (v4sf)
     int __builtin_ia32_cvttss2si (v4sf)
     v4sf __builtin_ia32_rcpps (v4sf)
     v4sf __builtin_ia32_rsqrtps (v4sf)
     v4sf __builtin_ia32_sqrtps (v4sf)
     v4sf __builtin_ia32_rcpss (v4sf)
     v4sf __builtin_ia32_rsqrtss (v4sf)
     v4sf __builtin_ia32_sqrtss (v4sf)
     v4sf __builtin_ia32_shufps (v4sf, v4sf, int)
     void __builtin_ia32_movntps (float *, v4sf)
     int __builtin_ia32_movmskps (v4sf)

The following built-in functions are available when -msse is used.

v4sf __builtin_ia32_loadaps (float *): Generates the movaps machine instruction as a load from memory.
void __builtin_ia32_storeaps (float *, v4sf): Generates the movaps machine instruction as a store to memory.
v4sf __builtin_ia32_loadups (float *): Generates the movups machine instruction as a load from memory.
void __builtin_ia32_storeups (float *, v4sf): Generates the movups machine instruction as a store to memory.
v4sf __builtin_ia32_loadsss (float *): Generates the movss machine instruction as a load from memory.
void __builtin_ia32_storess (float *, v4sf): Generates the movss machine instruction as a store to memory.
v4sf __builtin_ia32_loadhps (v4sf, v2si *): Generates the movhps machine instruction as a load from memory.
v4sf __builtin_ia32_loadlps (v4sf, v2si *): Generates the movlps machine instruction as a load from memory
void __builtin_ia32_storehps (v4sf, v2si *): Generates the movhps machine instruction as a store to memory.
void __builtin_ia32_storelps (v4sf, v2si *): Generates the movlps machine instruction as a store to memory.

The following built-in functions are available when -msse2 is used. All of them generate the machine instruction that is part of the name.

     int __builtin_ia32_comisdeq (v2df, v2df)
     int __builtin_ia32_comisdlt (v2df, v2df)
     int __builtin_ia32_comisdle (v2df, v2df)
     int __builtin_ia32_comisdgt (v2df, v2df)
     int __builtin_ia32_comisdge (v2df, v2df)
     int __builtin_ia32_comisdneq (v2df, v2df)
     int __builtin_ia32_ucomisdeq (v2df, v2df)
     int __builtin_ia32_ucomisdlt (v2df, v2df)
     int __builtin_ia32_ucomisdle (v2df, v2df)
     int __builtin_ia32_ucomisdgt (v2df, v2df)
     int __builtin_ia32_ucomisdge (v2df, v2df)
     int __builtin_ia32_ucomisdneq (v2df, v2df)
     v2df __builtin_ia32_cmpeqpd (v2df, v2df)
     v2df __builtin_ia32_cmpltpd (v2df, v2df)
     v2df __builtin_ia32_cmplepd (v2df, v2df)
     v2df __builtin_ia32_cmpgtpd (v2df, v2df)
     v2df __builtin_ia32_cmpgepd (v2df, v2df)
     v2df __builtin_ia32_cmpunordpd (v2df, v2df)
     v2df __builtin_ia32_cmpneqpd (v2df, v2df)
     v2df __builtin_ia32_cmpnltpd (v2df, v2df)
     v2df __builtin_ia32_cmpnlepd (v2df, v2df)
     v2df __builtin_ia32_cmpngtpd (v2df, v2df)
     v2df __builtin_ia32_cmpngepd (v2df, v2df)
     v2df __builtin_ia32_cmpordpd (v2df, v2df)
     v2df __builtin_ia32_cmpeqsd (v2df, v2df)
     v2df __builtin_ia32_cmpltsd (v2df, v2df)
     v2df __builtin_ia32_cmplesd (v2df, v2df)
     v2df __builtin_ia32_cmpunordsd (v2df, v2df)
     v2df __builtin_ia32_cmpneqsd (v2df, v2df)
     v2df __builtin_ia32_cmpnltsd (v2df, v2df)
     v2df __builtin_ia32_cmpnlesd (v2df, v2df)
     v2df __builtin_ia32_cmpordsd (v2df, v2df)
     v2di __builtin_ia32_paddq (v2di, v2di)
     v2di __builtin_ia32_psubq (v2di, v2di)
     v2df __builtin_ia32_addpd (v2df, v2df)
     v2df __builtin_ia32_subpd (v2df, v2df)
     v2df __builtin_ia32_mulpd (v2df, v2df)
     v2df __builtin_ia32_divpd (v2df, v2df)
     v2df __builtin_ia32_addsd (v2df, v2df)
     v2df __builtin_ia32_subsd (v2df, v2df)
     v2df __builtin_ia32_mulsd (v2df, v2df)
     v2df __builtin_ia32_divsd (v2df, v2df)
     v2df __builtin_ia32_minpd (v2df, v2df)
     v2df __builtin_ia32_maxpd (v2df, v2df)
     v2df __builtin_ia32_minsd (v2df, v2df)
     v2df __builtin_ia32_maxsd (v2df, v2df)
     v2df __builtin_ia32_andpd (v2df, v2df)
     v2df __builtin_ia32_andnpd (v2df, v2df)
     v2df __builtin_ia32_orpd (v2df, v2df)
     v2df __builtin_ia32_xorpd (v2df, v2df)
     v2df __builtin_ia32_movsd (v2df, v2df)
     v2df __builtin_ia32_unpckhpd (v2df, v2df)
     v2df __builtin_ia32_unpcklpd (v2df, v2df)
     v16qi __builtin_ia32_paddb128 (v16qi, v16qi)
     v8hi __builtin_ia32_paddw128 (v8hi, v8hi)
     v4si __builtin_ia32_paddd128 (v4si, v4si)
     v2di __builtin_ia32_paddq128 (v2di, v2di)
     v16qi __builtin_ia32_psubb128 (v16qi, v16qi)
     v8hi __builtin_ia32_psubw128 (v8hi, v8hi)
     v4si __builtin_ia32_psubd128 (v4si, v4si)
     v2di __builtin_ia32_psubq128 (v2di, v2di)
     v8hi __builtin_ia32_pmullw128 (v8hi, v8hi)
     v8hi __builtin_ia32_pmulhw128 (v8hi, v8hi)
     v2di __builtin_ia32_pand128 (v2di, v2di)
     v2di __builtin_ia32_pandn128 (v2di, v2di)
     v2di __builtin_ia32_por128 (v2di, v2di)
     v2di __builtin_ia32_pxor128 (v2di, v2di)
     v16qi __builtin_ia32_pavgb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pavgw128 (v8hi, v8hi)
     v16qi __builtin_ia32_pcmpeqb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pcmpeqw128 (v8hi, v8hi)
     v4si __builtin_ia32_pcmpeqd128 (v4si, v4si)
     v16qi __builtin_ia32_pcmpgtb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pcmpgtw128 (v8hi, v8hi)
     v4si __builtin_ia32_pcmpgtd128 (v4si, v4si)
     v16qi __builtin_ia32_pmaxub128 (v16qi, v16qi)
     v8hi __builtin_ia32_pmaxsw128 (v8hi, v8hi)
     v16qi __builtin_ia32_pminub128 (v16qi, v16qi)
     v8hi __builtin_ia32_pminsw128 (v8hi, v8hi)
     v16qi __builtin_ia32_punpckhbw128 (v16qi, v16qi)
     v8hi __builtin_ia32_punpckhwd128 (v8hi, v8hi)
     v4si __builtin_ia32_punpckhdq128 (v4si, v4si)
     v2di __builtin_ia32_punpckhqdq128 (v2di, v2di)
     v16qi __builtin_ia32_punpcklbw128 (v16qi, v16qi)
     v8hi __builtin_ia32_punpcklwd128 (v8hi, v8hi)
     v4si __builtin_ia32_punpckldq128 (v4si, v4si)
     v2di __builtin_ia32_punpcklqdq128 (v2di, v2di)
     v16qi __builtin_ia32_packsswb128 (v16qi, v16qi)
     v8hi __builtin_ia32_packssdw128 (v8hi, v8hi)
     v16qi __builtin_ia32_packuswb128 (v16qi, v16qi)
     v8hi __builtin_ia32_pmulhuw128 (v8hi, v8hi)
     void __builtin_ia32_maskmovdqu (v16qi, v16qi)
     v2df __builtin_ia32_loadupd (double *)
     void __builtin_ia32_storeupd (double *, v2df)
     v2df __builtin_ia32_loadhpd (v2df, double *)
     v2df __builtin_ia32_loadlpd (v2df, double *)
     int __builtin_ia32_movmskpd (v2df)
     int __builtin_ia32_pmovmskb128 (v16qi)
     void __builtin_ia32_movnti (int *, int)
     void __builtin_ia32_movntpd (double *, v2df)
     void __builtin_ia32_movntdq (v2df *, v2df)
     v4si __builtin_ia32_pshufd (v4si, int)
     v8hi __builtin_ia32_pshuflw (v8hi, int)
     v8hi __builtin_ia32_pshufhw (v8hi, int)
     v2di __builtin_ia32_psadbw128 (v16qi, v16qi)
     v2df __builtin_ia32_sqrtpd (v2df)
     v2df __builtin_ia32_sqrtsd (v2df)
     v2df __builtin_ia32_shufpd (v2df, v2df, int)
     v2df __builtin_ia32_cvtdq2pd (v4si)
     v4sf __builtin_ia32_cvtdq2ps (v4si)
     v4si __builtin_ia32_cvtpd2dq (v2df)
     v2si __builtin_ia32_cvtpd2pi (v2df)
     v4sf __builtin_ia32_cvtpd2ps (v2df)
     v4si __builtin_ia32_cvttpd2dq (v2df)
     v2si __builtin_ia32_cvttpd2pi (v2df)
     v2df __builtin_ia32_cvtpi2pd (v2si)
     int __builtin_ia32_cvtsd2si (v2df)
     int __builtin_ia32_cvttsd2si (v2df)
     long long __builtin_ia32_cvtsd2si64 (v2df)
     long long __builtin_ia32_cvttsd2si64 (v2df)
     v4si __builtin_ia32_cvtps2dq (v4sf)
     v2df __builtin_ia32_cvtps2pd (v4sf)
     v4si __builtin_ia32_cvttps2dq (v4sf)
     v2df __builtin_ia32_cvtsi2sd (v2df, int)
     v2df __builtin_ia32_cvtsi642sd (v2df, long long)
     v4sf __builtin_ia32_cvtsd2ss (v4sf, v2df)
     v2df __builtin_ia32_cvtss2sd (v2df, v4sf)
     void __builtin_ia32_clflush (const void *)
     void __builtin_ia32_lfence (void)
     void __builtin_ia32_mfence (void)
     v16qi __builtin_ia32_loaddqu (const char *)
     void __builtin_ia32_storedqu (char *, v16qi)
     unsigned long long __builtin_ia32_pmuludq (v2si, v2si)
     v2di __builtin_ia32_pmuludq128 (v4si, v4si)
     v8hi __builtin_ia32_psllw128 (v8hi, v2di)
     v4si __builtin_ia32_pslld128 (v4si, v2di)
     v2di __builtin_ia32_psllq128 (v4si, v2di)
     v8hi __builtin_ia32_psrlw128 (v8hi, v2di)
     v4si __builtin_ia32_psrld128 (v4si, v2di)
     v2di __builtin_ia32_psrlq128 (v2di, v2di)
     v8hi __builtin_ia32_psraw128 (v8hi, v2di)
     v4si __builtin_ia32_psrad128 (v4si, v2di)
     v2di __builtin_ia32_pslldqi128 (v2di, int)
     v8hi __builtin_ia32_psllwi128 (v8hi, int)
     v4si __builtin_ia32_pslldi128 (v4si, int)
     v2di __builtin_ia32_psllqi128 (v2di, int)
     v2di __builtin_ia32_psrldqi128 (v2di, int)
     v8hi __builtin_ia32_psrlwi128 (v8hi, int)
     v4si __builtin_ia32_psrldi128 (v4si, int)
     v2di __builtin_ia32_psrlqi128 (v2di, int)
     v8hi __builtin_ia32_psrawi128 (v8hi, int)
     v4si __builtin_ia32_psradi128 (v4si, int)
     v4si __builtin_ia32_pmaddwd128 (v8hi, v8hi)

The following built-in functions are available when -msse3 is used. All of them generate the machine instruction that is part of the name.

     v2df __builtin_ia32_addsubpd (v2df, v2df)
     v4sf __builtin_ia32_addsubps (v4sf, v4sf)
     v2df __builtin_ia32_haddpd (v2df, v2df)
     v4sf __builtin_ia32_haddps (v4sf, v4sf)
     v2df __builtin_ia32_hsubpd (v2df, v2df)
     v4sf __builtin_ia32_hsubps (v4sf, v4sf)
     v16qi __builtin_ia32_lddqu (char const *)
     void __builtin_ia32_monitor (void *, unsigned int, unsigned int)
     v2df __builtin_ia32_movddup (v2df)
     v4sf __builtin_ia32_movshdup (v4sf)
     v4sf __builtin_ia32_movsldup (v4sf)
     void __builtin_ia32_mwait (unsigned int, unsigned int)

The following built-in functions are available when -msse3 is used.

v2df __builtin_ia32_loadddup (double const *): Generates the movddup machine instruction as a load from memory.

The following built-in functions are available when -m3dnow is used. All of them generate the machine instruction that is part of the name.

     void __builtin_ia32_femms (void)
     v8qi __builtin_ia32_pavgusb (v8qi, v8qi)
     v2si __builtin_ia32_pf2id (v2sf)
     v2sf __builtin_ia32_pfacc (v2sf, v2sf)
     v2sf __builtin_ia32_pfadd (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpeq (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpge (v2sf, v2sf)
     v2si __builtin_ia32_pfcmpgt (v2sf, v2sf)
     v2sf __builtin_ia32_pfmax (v2sf, v2sf)
     v2sf __builtin_ia32_pfmin (v2sf, v2sf)
     v2sf __builtin_ia32_pfmul (v2sf, v2sf)
     v2sf __builtin_ia32_pfrcp (v2sf)
     v2sf __builtin_ia32_pfrcpit1 (v2sf, v2sf)
     v2sf __builtin_ia32_pfrcpit2 (v2sf, v2sf)
     v2sf __builtin_ia32_pfrsqrt (v2sf)
     v2sf __builtin_ia32_pfrsqrtit1 (v2sf, v2sf)
     v2sf __builtin_ia32_pfsub (v2sf, v2sf)
     v2sf __builtin_ia32_pfsubr (v2sf, v2sf)
     v2sf __builtin_ia32_pi2fd (v2si)
     v4hi __builtin_ia32_pmulhrw (v4hi, v4hi)

The following built-in functions are available when both -m3dnow and -march=athlon are used. All of them generate the machine instruction that is part of the name.

     v2si __builtin_ia32_pf2iw (v2sf)
     v2sf __builtin_ia32_pfnacc (v2sf, v2sf)
     v2sf __builtin_ia32_pfpnacc (v2sf, v2sf)
     v2sf __builtin_ia32_pi2fw (v2si)
     v2sf __builtin_ia32_pswapdsf (v2sf)
     v2si __builtin_ia32_pswapdsi (v2si)

2012年6月3日星期日

build parsec for ARM

reference document: http://www.cs.utexas.edu/~parsec_m5/TR-09-32.pdf
cross-compilation environment:
1. HOSTTYPE=arm
2. PATH=/path/to/fake/uname/bin:$PATH
content of /path/to/fake/uname/bin/uname:
===============================
$ cat ~/research/benchmarks/parsec-2.1-arm/fake-uname/uname
#!/bin/sh

/bin/uname $* | sed 's/i686/armv7l/g'

===============================

3. cross compilation tools: arm-linux-gnueabi-*

4. host machine is i686

Steps:

1. compile tools natively

$ parsecmgmt -a build -p tools

Note: for now, use native i686 compilation flags in gcc.bldconf

2. compile apps to ARM binary:

1. set BINARY_PREFIX options in gcc.bldconf

2.1 blackscholes : OK

2.2 bodytrack:

2.2.1 In pkgs/apps/bodytrack/src/config.h.in, comment out #undef malloc

before change:

/* Define to rpl_malloc if the replacement function should be used. */

#undef malloc

after change:

/* Define to rpl_malloc if the replacement function should be used. */

//#undef malloc

2.2.2 In pkgs/apps/bodytrack/parsec/gcc-pthread.bldconf, add --host and --build.

before:

# Arguments to pass to the configure script, if it exists

build_conf="--enable-threads --disable-openmp --disable-tbb"

after:

# Arguments to pass to the configure script, if it exists

build_conf="--enable-threads --disable-openmp --disable-tbb --build=i686-linux-gnu --host=arm-linux-gnueabi"

2.3: facesim: OK

2.4: ferret: depends on gsl and imagick, so build them first, see 2.5, and 2.6. OK

2.5: gsl:

2.5.1 In pkgs/libs/gsl/parsec/gcc.bldconf, add --host and --build.

before:

# Arguments to pass to the configure script, if it exists

build_conf="--disable-shared"

after:

# Arguments to pass to the configure script, if it exists

build_conf="--disable-shared --build=i686-linux-gnu --host=arm-linux-gnueabi"

2.6: imagick: In pkgs/libs/imagick/parsec/gcc.bldconf, add --host and --build.

before:

# Arguments to pass to the configure script, if it exists

build_conf="--disable-shared --without-perl --without-magick-plus-plus --without-bzlib --without-dps --without-djvu --without-fpx --without-gslib --without-jbig --with-jpeg --without-jp2 --without-tiff --without-wmf --without-zlib --without-x --without-fontconfig --without-freetype --without-lcms --without-png --without-gvc --without-openexr --without-rsvg --without-xml"

after:

# Arguments to pass to the configure script, if it exists

2.7: freqmine: OK

2.8: raytrace: SKIP. In order to compile raytrace, libX11 must be cross-compiled which requires cross-compiling the following libraries:

libX11

libXmu

libXext

libxcb

xproto

xextproto

xtrans

libpthread_stubs

libXau

kbproto

inputproto

jpeg

2.9: swaptions: OK

2.10: fluidanimate: OK

2.11: vips: depends on glib and libxml2. libxml2 and vips only need to add --build and --host.
2.11.1: remove -L${CC_HOME}/lib in config/gcc.bldconf
before:

export LDFLAGS="$STATIC -pthread -L${CC_HOME}/lib"

after:

export LDFLAGS="$STATIC -pthread"

2.12: glib: add --host and --build in pkgs/libs/glib/parsec/gcc.bldconf.

2.12.1
before:

# Arguments to pass to the configure script, if it exists

build_conf="--disable-shared --enable-threads --with-threads=posix"

after:

# Arguments to pass to the configure script, if it exists

build_conf="--disable-shared --enable-threads --with-threads=posix --build=i686-linux-gnu --host=arm-linux-gnueabi"
2.12.2 In pkgs/libs/glib/src/configure, add following line at line 43:

ac_cv_func_posix_getpwuid_r=no$
glib_cv_stack_grows=no$
glib_cv_uscore=no$

2.13: dedup: OK. depends on ssl, see 2.14
2.14: ssl: OK.
2.14.1change gcc to arm-linux-gnueabi-gcc in pkgs/libs/ssl/src/Configure.pl line 323
before:

"linux-generic32","gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",

after:

"linux-generic32","arm-linux-gnueabi-gcc-:-DTERMIO -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",

2.14.2 comment out line 975

before:

$cflags .= " -m32 ";

after:

#$cflags .= " -m32 ";

2.15: streamcluster: OK.
2.16: canneal: OK. need pkgs/kernels/canneal/src/atomic/arm/atomic.h.
2.16.1: pkgs/kernels/canneal/src/atomic/atomic.h, add following lines:
before:

#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)
# include "alpha/atomic.h"
#else
# error Architecture not supported by atomic.h
#endif

after

#elif defined(__alpha__) || defined(__alpha) || defined(alpha) || defined(__ALPHA__)

# include "alpha/atomic.h"

#elif defined(__arm__) || defined(__arm) || defined(arm) || defined(__ARM__)

# include "arm/atomic.h"

#else

# error Architecture not supported by atomic.h

#endif

2.16.2: download from ftp://ftp.tw.freebsd.org/pub/FreeBSD-current/src/sys/arm/include/atomic.h.

and put to pkgs/kernels/canneal/src/atomic/arm/atomic.h

2.16.3: add following lines at line 49

before:

#ifndef _KERNEL

#include

#endif

#ifndef I32_bit

after:

#ifndef _KERNEL

#include

#endif

#define ARM_VECTORS_HIGH 0xffff0000U

#define ARM_TP_ADDRESS (ARM_VECTORS_HIGH + 0x1000)

#define ARM_RAS_START (ARM_TP_ADDRESS + 4)

#define ARM_RAS_END (ARM_TP_ADDRESS + 8)

#ifndef I32_bit

2.16.4: add following lines at line 353

before:

#define atomic_store_rel_ptr atomic_store_ptr

after:

#define atomic_store_rel_ptr atomic_store_ptr

#define atomic_load_acq_ptr atomic_load_acq_long

conlusion:
12 of 13 benchmarks built successfully.
fail applications:
1. raytrace, depends on several libX libraries, which need to be compiled in ARM.
native run: ferret failed
canneal: segfault
dedup: malloc fail

2012年5月31日星期四

statically build PARSEC

statically build PARSEC

1. cmake is difficult to statically built and is UN-NECESSARY since it just a tool to build several benchmarks.
So, to make life simpler, build cmake dynamically linked.

parsecmgmt -a build -p tools

2. Add -static in you CFLAGS, CXXFLAGS
type parsecmgmt -a build -p apps kernels

3. for three benchmarks, they are still dynamically linked due to libtool
just manually linked them:
go to log directory, find out the log file for the last build.
search "-o bodytrack", "-o facesim" and "-o vips" strings to locate the link commands.
and go to the right directory to statically link those benchmarks, and manually copy them to the installed directory.

4. DONE

2012年5月27日星期日

SPEC OMP 2001 318.galgel, fail to compile by gfortran 4.4

SPEC OMP 2001 318.galgel, fail to compile by gfortran 4.4
1. add -ffixed-form
2. /data/Benchmarks/SPEC/OMP2001_v3.2_x86/benchspec/OMPM2001/318.galgel_m/src/bifg21.f90

change
Poj2(NKY*(L-1)+M,1:K) = - MATMUL( LPOP(1:K,1:N), VI(K+1:K+N) )
to

Poj2(NKY*(L-1)+M,1:K) = - MATMUL( LPOP(1:K,1:N), VI(K+1:K+N))

remove the fucking space.

2012年5月23日星期三

0x2526303f 0x1b141d30

2012年5月14日星期一

20120513 - 503

Error: 1x318.galgel_m 1x324.apsi_m 1x326.gafort_m
Success: 1x310.wupwise_m 1x312.swim_m 1x314.mgrid_m 1x316.applu_m 1x320.equake_m 1x328.fma3d_m 1x330.art_m 1x332.ammp_m

ARM SPEC Regression test, 503 V.S 467

ARM
-14.08 445.gobmk
-6.85 456.hmmer
-10.92 458.sjeng
-3.10 471.omnetpp
-3.02 473.astar
-4.51 483.xalancbmk

4.63 403.gcc
-5.12 458.sjeng
-3.89 462.libquantum
-5.87 464.h264ref
-5.18 471.omnetpp

2012年5月12日星期六

0x4fae78a4 -> 31fb8dd7c2105
0x4fae798a ->

0x41d3eb9f0ec00000 <- 0x4fae7c3b
0x4fae7eeda

Q: 41d3eb9eb6c00000 <- 0x4fae7c3b 0x4fae7c3b
L: 41d3eb9ec7400000

0x0804a704

0x0804a6cd

2012年5月11日星期五

Compile 314.mgrid fail, resolved!

Compile 314.mgrid error:

==========================================================================
/usr/bin/gfortran-4.4.2 -fopenmp -O3 -m32 -march=prescott -mmmx -msse -msse2 -msse3 -msse4 -mfpmath=sse -fforce-addr -fivopts -fsee -ftree-vectorize -pipe mgrid.f -o mgrid
Error from make 'specmake build 2> make.err | tee make.out':
mgrid.f: In function 'resid':
mgrid.f:365: error: lastprivate variable 'i2' is private in outer context
mgrid.f:365: error: lastprivate variable 'i1' is private in outer context
mgrid.f: In function 'psinv':
mgrid.f:408: error: lastprivate variable 'i2' is private in outer context
mgrid.f:408: error: lastprivate variable 'i1' is private in outer context

==========================================================================

Related Post:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33904
And in the following post, OPENMP confirms that is a bug in mgrid.f
http://openmp.org/pipermail/omp/2007/001101.html

Bug description:

>
>Hi!
>
>Is
>
>      SUBROUTINE foo(a, b, n)
>      DOUBLE PRECISION a, b
>      INTEGER*8 i1, i2, i3, n
>      DIMENSION a(n,n,n), b(n,n,n)
>!$OMP PARALLEL
>!$OMP+DEFAULT(SHARED)
>!$OMP+PRIVATE(I3)
>!$OMP DO
>!$OMP+LASTPRIVATE(I1,I2)
>      DO i3 = 2, n-1, 1
>       DO i2 = 2, n-1, 1
>        DO i1 = 2, n-1, 1
>         a(i1, i2, i3) = b(i1, i2, i3);
>  600    CONTINUE
>        ENDDO
>       ENDDO
>      ENDDO
>!$OMP END DO NOWAIT
>!$OMP END PARALLEL
>      RETURN
>      END
>
>valid?  My reading of the standard is it is not, because both I1
>and I2 are sequential loop iterator vars in a parallel construct
>and as such should be predetermined private rather than implicitly
>determined shared (OpenMP 2.5, 2.8.1.1).  It is not present
>in any of the clauses on the parallel construct which could possibly
>override it.  2.8.3.5 about the lastprivate clause in the 
>first restriction
>says that the vars can't be private in the parallel region.
>Several other compilers accept this code though.
>
>In OpenMP 3.0 draft the wording is even clear, because it talks there
>about the loop iterators being predetermined private in a task region,
>and !$omp do doesn't create a new task region.
>
>Or am I wrong with this?
>
>Thanks.
>
> Jakub

He is right!!!

Solution:

!$OMP+DEFAULT(SHARED)
with:
!$OMP+SHARED(I1,I2)

makes the code compile successfully with gfortran. 
Alternatively, keeping DEFAULT(SHARED) and fusing the OMP PARALLEL
clause with the OMP DO clause (i.e. using OMP PARALLEL DO) also solves
the problem.

2012年5月8日星期二

SPEC OMP 2001

Name Remarks
310.wupwise_m Quantum chromodynamics
312.swim_m Shallow water modeling
314.mgrid_m Multi-grid solver in 3D potential field
316.applu_m Parabolic/elliptic partial differential equations
318.galgel_m Fluid dynamics: analysis of oscillatory instability
320.equake_m Finite element simulation; earthquake modeling
324.apsi_m Solves problems regarding temperature, wind, velocity and distribution of pollutants
326.gafort_m Genetic algorithm
328.fma3d_m Finite element crash simulation
330.art_m Neural network simulation; adaptive resonance theory
332.ammp_m Computational Chemistry

http://www.spec.org/omp2001/docs/runspec.html

https://engineering.purdue.edu/paramnt/publications/wompat01spec.pdf

2012年4月27日星期五

QEMU memory problem:

simulate 32 bit in 64 bit environment, we can use MAP_32BIT to force mmap to allocate memory address within 4G.
However, according to http://lxr.free-electrons.com/source/arch/x86/kernel/sys_x86_64.c, we can only get memory between 0x40000000 and 0x80000000,

atomic region

This project aims to efficiently handle guest instruction exception through atomic region (software approach).

Atomic region: blocked execution mode?

what is blocked execution mode

Installing Debian ARM under QEMU

from http://richizo.wordpress.com/2010/11/15/installing-debian-arm-under-qemu/

November 15, 2010

In this post I will explain how to install Debian Armel under QEMU.
Well, there are various reasons to you have a ARM Linux distro inside a Virtual Machine. One of them would be to have a test environment to you validate your programs before to release to an embedded linux (ARM).
Debian was choosed because is the most supported ARM distro and you will have an compatible environment with your embedded system (eglibc).
Hey! Ho! Let’s go:

Download a ARM kernel and a vmlinuz image under debian.org FTP site (I choosed squeeze/testing flavor):

wget http://ftp.debian.org/debian/dists/testing/main/installer-armel/current/images/versatile/netboot/vmlinuz-2.6.32-5-versatile

wget http://ftp.debian.org/debian/dists/testing/main/installer-armel/current/images/versatile/netboot/initrd.gz

Create a disk image (please, create a raw disk! It will be useful future):

qemu-img create -f raw debian.img 10G

Start Debian image using qemu:

qemu-system-arm -m 256 -M versatilepb -kernel vmlinuz-2.6.32-5-versatile -initrd initrd.gz -hda debian.img -append "root=/dev/ram"

After install Debian in the disk image, we will mount the contents of disk image:

sudo kpartx -av debian.img

sudo mount /dev/mapper/loop1p1 ./mnt/ -o loop

PS: I installed Debian with one only partition (/dev/sda1 is my root filesystem). Kpartx is needed because my disk image has 2 partitions (root filesystem and swap)

Copy the initrd image and the kernel to your system (outside of mountpoint). These two files will be used to start our Debian installation:

cp ./mnt/boot/initrd.img .

cp ./mnt/boot/vmlinuz-2.6.32-5-versatile .

Now, we can start Debian image installed in the disk image:

qemu-system-arm -m 256 -M versatilepb -kernel vmlinuz-2.6.32-5-versatile -initrd initrd.img -hda debian.img -append "root=/dev/sda1"

In the next posts I will talk about ARM architecture, toolchains, cross compilers and embedded linux. ;)

http://zhiwei.li/text/2010/12/qemu-system-arm%E8%BF%90%E8%A1%8Cversatilepb%E6%9C%80%E5%A4%9A%E5%8F%AA%E8%83%BD%E6%94%AF%E6%8C%81256m%E5%86%85%E5%AD%98/

Peter Maydell 在2010年11月29日有一个解释

http://comments.gmane.org/gmane.comp.emulators.qemu/86388

‘versatilepb’ also supports only 256MB of RAM, for the
same reason (system registers starting at 0×10000000).
You might try one of the realview models, which have a
special case for putting more RAM at a high memory address.

versatilepb 开发版也只支持256M 内存，因为同样的原因（系统寄存器开始于 0×10000000), 你可以试试 realview模型，它可以将更多的内存放置于高端地址

PARSEC

NOT RUN Benchmark:
ferret: can build, can run
raytrace: cannot build, need 32bit libXmu libX11 libGL libGLU
vips: cannot build, python2.6 cannot agree with LONG_BIT:

In file included from /usr/include/python2.6/Python.h:58,

from /local/tk/research/llvm-qemu/benchmarks/parsec-2.1/pkgs/libs/libxml2/src/python/libxml_wrap.h:2,

from /local/tk/research/llvm-qemu/benchmarks/parsec-2.1/pkgs/libs/libxml2/src/python/types.c:9:

/usr/include/python2.6/pyport.h:685:2: error: #error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."

========================================================

All 13 benchmarks except vips, all can build successfully

we have run 8 benchmarks, still have 4 benchmarks to try

ferret, raytrace, x264, dedup,
raytrace, x264, OK
ferret and dedup cannot run

use a list to

2012年4月23日星期一

dedup: tb_invalidate_phys_page in page_set_flags in target_mmap:559

qemu cannot run dedup because it mmap over 2G memory.

2012年4月19日星期四

helper functions

8 $adc_cc
1991 $add_cc
71 $clz
885 $cpsr_read
1547 $cpsr_write
773 $exception
21 $get_cp15
8 $get_user_reg
36 $sar
44 $sbc_cc
122 $set_cp15
29 $set_user_reg
479 $shl
62 $shl_cc
191 $shr
8 $shr_cc
32183 $sub_cc
1 $wfi

2012年3月12日星期一

45/243 <- ARM
15/172
32/390

2012年3月11日星期日

mibench http://www.eecs.umich.edu/mibench/
coremark http://www.coremark.org/home.php

i73,
TRACE_MERGE=1 TRACE_MERGE_USE_DBO=1 FUNCTION_OPT=0 TRACE_OPT=1 NUM_TRACE_WORKER=3 TRACE_MERGE_NUM_TARGETS=2
hmmer 1547 1254 -18.939883645766, really bad!

Wrong setting, indirect exit handling is used!

2012年3月9日星期五

build 32-bit PARSEC binaries on 64-bit host

Here are the directions I use to build 32-bit PARSEC binaries on 64-bit 
versions of Fedora/RHEL.  I don't guarantee that these are the most 
efficient directions for getting everything to compile properly, but 
they work for me.

Note that I temporarily replace 'uname', which requires root access. 
There were a number of scripts (I don't remember which ones) that use 
uname to detect the architecture. If it returns x86_64, these scripts 
will always attempt to compile the 64-bit versions of libraries (or 
whatever).  If you don't have root access, you'll have to find where 
these checks are and replace them yourself. :)

1. Modify the GCC build config:
   * Open ./config/gcc.bldconf
   * Change CC_HOME="/n/fs/parsec/local/gcc-4.4.0-static" to CC_HOME="/usr"
   * Change BINUTIL_HOME="/usr/local" to BINUTIL_HOME="/usr"
   * Make sure GNUTOOL_HOME is set to ="/usr"
   * Make sure BINARY_PREFIX is set to =""
   * Add '-m32' to CFLAGS, CXXFLAGS, CPPFLAGS, CXXCPPFLAGS, and LDFLAGS
   * Add in 'export INCLUDES="-m32"'
   * Remove '-L${CC_HOME}/lib64' from LDFLAGS
2. Change the environment variable HOSTTYPE to i386
   * In bash: 'HOSTTYPE=i386' 'export HOSTTYPE'
   * In csh: 'setenv HOSTTYPE i386'
3. Make sure /usr/lib/libXmu.so and /usr/lib/libX11.so exist. If they don't:
   * 'ln -s /usr/lib/libXmu.so.6 /usr/lib/libXmu.so'
   * 'ln -s /usr/lib/libX11.so.6 /usr/lib/libX11.so'
4. Make a copy of uname
   * 'sudo mv /bin/uname /bin/uname.orig'
5. Make a wrapper shell script to make uname return i686 instaed of x86_64:
   * Open a new file /bin/uname and add in:
 !/bin/sh
         /bin/uname.orig $* | sed 's/x86_64/i686/g'
   * 'sudo chmod a+x uname'
6. Change ./pkgs/apps/facesim/src/TaskQ/lib/Makefile 'CXXARGS' to 'CXXFLAGS'
7. Modify ./pkgs/libs/ssl/src/Configure.pl
   * Add '$cflags .= " -m32 ";' to line 976 (below the big list of "my" 
variable declarations.
8. Change ./pkgs/libs/mesa/src/configure line 4685 to 'asm_arch=x86'
9. Change ./pkgs/libs/glib/src/configure line 40390 to 'G_ATOMIC_I486'
10. Run compilation
   * ./bin/parsecmgmt -a build -p all -c gcc
   * Note that you'll need to do "./bin/parsecmgmt -a build -p freqmine 
-c gcc-openmp" if you want freqmine to compile, as it doesn't use pthreads.
11. Return the original uname
   * 'sudo mv /bin/uname.orig /bin/uname'

Good luck!
-Joe

Added by TK:

In parsec-2.1/pkgs/libs/ssl/parsec/gcc.bldconf, change line 20 to build_env="PATH=\"${PATH}\""

If something goes wrong, try not to use -jxxx when make

2012年3月2日星期五

TODO

identify trace head with

2012年1月6日星期五

2012 January 6

bwaves; performance drops 10%~15% after adding volatile modifier to load/store,

possible cause should be related to guest CPU FP register RLSO.
both trace and procedure have the same effect.
so far, only see difference due to code motion between these two version, perhaps we should see generated code.
we have observed over 10% mem loads for volatiled version
it is difficult to find exactly structure difference.
so, observer floating point operations: use FP_COMP_OPS_EXE:X87
no difference in the number of floating point operations
increased memory operations should be the cause the performance degradation of volatile memory.

20.1% performance degradation with volatile; 1390 -> 1661
11.86%, i.e. 305,180,492,006, extra memory loads
16.35%, i.e .216,346,459,794, extra memory stores

RUN both x86 CINT CFP, ARM CINT benchmarks again before

2011年11月16日星期三

vector?

~~development in i722~~
~~gogogo~~
~~vector state load/store elimination successfully!~~
~~need to clean up!~~
Need to implement NEON guest instructions

arm host?

need more time
development at haru

partial behavior?

why no effect? why good? why bad?
seems no burst anymore; do research
do it now 2011-11-21 14:46
it seems partial inlining is failed.

astar is good
hmmer is extremely bad! It is true!
need to know which functions are partial inlined and see why it performs bad!!
somehow, it degrade to trace mode. Run mostly in trace but not in functions.
side exit? NO! it's a stupid bug that I forgot link "RETURN" back to the procedure

using stack to find call path

but we can find early exit in other benchmarks, such as perlbench.

NEED to re-run ARM benchmarks

debug tonto - segfault, should be easy to fix

11:28
done!!!

Sometimes it is difficult to say it's easy! when
No! don't open stupid tab in browser!

2011年10月31日星期一

Experiments

i722 run NET Trace baseline, with workers 1 2 3, for x86 guest ISA
i722 run NET Trace baseline with workers 12 3, for arm guest ISA

2011年10月27日星期四

TODO

remove unnecessary entry when there exists a trace that go back to this entry point
gather number of function translated for each benchmark;

if there are multiple inputs, calculates the average
use google doc to gather these data

trace guided layout has no influence for performance, why?

maybe wrong benchmarks, I've tried gcc, which one should I try?

QUESTION:

GCC, trace-method performs bad than method; Evidence shows that T-M used more than 4 times IBTC than method. WHY??? The problem may be in the transition between

Running test inputs and output trace profile; see what we got...
shit happen: how to stop optimization thread when execution thread logout...
Any IDEA?
It is because there is too many trace/method to build such that we still spend too much time on block fragment.

TODO:

1' prove that FP has been fix

2' prove trace-guilded layout works

3' prove partial inline works

3' prove chained partial inline works

Question:

gcc is strange: 1. does it trace-threshold-sensitive? 2. does it function-threshold-sensitive?

2011年8月9日星期二

TFTP + NFS boot openrd-ultimate

TFTP + NFS boot openrd-ultimate

1, install Debian in USB as described in http://www.cyrius.com/debian/kirkwood/openrd/install.html

2, after success installation, login to Debian system

3, edit /etc/initramfs-tools/initramfs.conf, find "BOOT=local", change it to "BOOT=nfs";

4, issue "update-initramfs -u" to get new /boot/uImage and /boot/uInitrd.

5, shutdown the system, copy the base system to /opt/openrd/nfs, which is exported via NFS.
5.1, Edit /opt/openrd/nfs/etc/fstab to comment out all local mount since there is no local file system after NFS boot. Otherwise, debian would try to mount root file system again after NFS boot

6, copy /opt/openrd/nfs/boot/{uImage,uInitrd} to /opt/openrd/tftp/{uImage,uInitrd}

7, start Openrd-Ultimate, enter u-boot, set environments
set mainlineLinux yes
set arcNumber 2884
set ipaddr=192.168.1.2
set serverip=192.168.1.1
set console 'console=ttyS0,115200n8'
set nfs 'mtdparts=orion_nand:0x400000@0x100000(uImage),0x1fb00000@0x500000(rootfs) rw root=/dev/nfs rw nfsroot=192.168.1.1:/opt/openrd/nfs'
set ip 'ip=192.168.1.2:192.168.1.1:192.168.1.1:255.255.255.0:DB88FXX81:eth0:none'
set bootargs $(console) $(nfs) $(ip)
set bootcmd 'tftpboot 0x01100000 uInitrd; tftpboot 0x00800000 uImage; bootm 0x00800000 0x01100000'
saveenv
reset

8, After restart, should be able use tftp to load uImage and uInitrd, and mount NFS directory /opt/openrd/nfs as root filesystem

2011年8月4日星期四

libquantum problem

libquantum has performance lost

why?

Check ibtc and shack hit ratio
they are the same!

don't know
check all version!

2011年8月3日星期三

segmentation fault:
1' where?
0x00007fffe0000158
0x00007fffe8000168
0x00007fffe8000078
0x00007fffe0000158

This address seems strange
jit_event_listener

try debug version

turn off AddShackPush in trace, wait and see...
still seg fault, re-implement
should always try both single thread version, multi-thread version

Reason:
timing plus inappropriate block_map
when translating a block, the block info is inserted into block_map
at beginning of translation process, which violates the assumption of block map.
We assume all infos in block map are valid, which means they have host code address where the translated code at. However, at the beginning of translation, we didn't know the exact address of the translated block. As a result, when the trace builder doing AddShackPush, it found the incomplete block in the block map, and uses its address.
Therefore, we modified the code as the following:
1. block info is added immoderately after the location of the translated block is known, which is in NotifyFunctionEmitted of jit_event_listener.h.
2. the query is delayed until we are going to patch the shack point.

However, in this experiment, I found shadow stack is quietly in-effect to performance,
plus it consumes more memory, and makes code complex. It seems more reasonable not to use shack when trace is available.

2011年7月11日星期一

Debug cactusADM. Died in signal 9.
Cannot move on
Keep gen trace_

1. add print_pc
2. assert re-genereated trace
3. possible reason:
somehow it traps within a block...., why?
self loop?

Fight!
Timing!
why....

a -> b -> c -> c -> c -> c -> c -> c ...
a->b->c->b->c->b->c->b->c->b

Possibility:
1' trace gen die. or no response.
which ca

Trace: Path/Cycle with duplicate nodes
CFG: Graph with unique nodes

Need convert Trace to Graph

2011年2月8日星期二

run native run

dispatch+translate+optimize
execution, test and set

2011年1月5日星期三

TII

TII->isMoveInstr(*DefMI, SrcReg, DstReg, SrcSubIdx, DstSubIdx)

2010年9月21日星期二

一個我不記得的夢

我覺得我一定做過一個我不記得的夢，
夢中神過來跟我說：
她要結婚了，她想告訴你，這是人的個人意志，我無法插手，
但我可以決定她何時來告訴你比較好，所以來問你說要何時，
以下是我的(神的)分析：
一、9/18那個周末前告訴你，
雖然這樣你的 conference paper 就毀了，
但真慧去新加坡所以不會知道這件事。

二、9/25那個周末告訴你，
雖然你的 conference paper 可能可以投，
但真慧在你旁邊可能會察覺你怪怪的。

你要那個？
顯然我挑了第一個選擇

哈

2010年8月18日星期三

肚子餓了！

所以，

好煩哦，現在要產生operandproperty table，要先產生

2010年7月31日星期六

Note on experiments:
There is error running Qemu on perlbench, is it normal?
I've run orignal QEMU again, see what will happen.

Keep going debug

2010年7月29日星期四

TODO

1' Debug cc lazy; run perl fail; chop.t
2' Do block linking; let jitter can recompile code;
3' profile time; use previous profile time framework; use tick as time info
4' add entry code to profile block execution; can we have a generic framework for add and retrieve
statical info. how about use MACRO, and use tag to indicate type of statical, such as invoke times,
execution time

2' do block linking; let jitter can recompile code at the same address.
2.1 Locate files where we should modify.
ExecutionEngine/JIT/JITEmitter.cpp
two files

add recompileFunction in JIT.cpp
should

================================

No use; In JITEmitter::startFunction(), CurBufferPtr will be reset;
Need to implement our MemoryManagement

我可以想成這個 jitter 只有我在用，所以我設計一個我自己用的 memory manager

回家！

今天要做的事：

一、讓 jitter 可以在一個指定的位置上 jit code

todo:

2010年7月26日星期一

Fuck!

TODO:
1. cc-lazy still has trouble running perl, need debug!
2. do block linking

block linking:
1' add an exit block;
2' modify that exit block

change to 20100726
addPointerToBasicBlock doesn't work
it die at DominatorPass, DFSPass,
I don't know why, but since that function canno use, I don't have any reason use llvm-2.8svn
change back to llvm-2.7svn
use svn command:
svn update
svn merge -r HEAD:7381
svn commit -m "Roll back to llvm-2.7svn"

2010年7月21日星期三

Hot trace model

today, you need to figure out the model of building a hot trace
and you will have a meeting with Prof. Liu to discuss this.

First, collect the posts of Prof. Liu for hot trace, qutoe as follows.

We encountered a hot-trace-finding problem last Thursday. Basically
this problem asks for a good hot trace so that a large number of
execution sequences will remain in the hot trace. Formally we are
given a directed graph G (call graph?) and a set of simple paths P in
G. Each simple path is a sequence of at least two different nodes in
G so that consecutive nodes in the path has a directed edge in G. Note
that we assume that the nodes in a simple path must be different. I am
not sure if this is the case in our system but let us assume it for
the moment. Now we want to find another simple path h (this is the hot
trace we are looking for) so that at least k of the paths in P are
contained in H. A path p is contained in another path q if and only if
q is a substring of p. Now we can define the problem -- given G, P,
and k, is there a hot trace h that cotains at least k paths in P? We
will refere to this problem as k-HOT-TRACE.

It seems that k-HOT-TRACE is NP-complete by reducing from Hamitonian
path. Given a graph H we would like to ask whethere there is a
Hamitonian path in H, we simple transform the priblem instance into a
k-HOT-TRACE problem instance. We simply use H as G, and let P be the
set of path of two nodes, i.e., all edges in G, and set k to be the
number of nodes in G minus 1. As a result if there is a solution for
the (n-1)-HOT-TRACE problem for G, there is a Hamitonian path in H,
and vice versa.

Two followups may be possible. Anyone interested in these please come
talk to me.

There could be an efficient dynamic programming solution when G is a
tree, even when we further restrict the length of the hot trace.

This seems easy. Now the problem is that we limit the length of the
hot trace, and try to find the one that contains the maximum number of
paths. We define a function P(v, l) to be the number of paths that are
contained by the hot trace that ends at v and has length l. It is easy
to write down the recursive formula that P(v, l) = P(w, l-1) + the
number of path that ends at v nad has length no more than l, where w
is the parent of v. We have N times L cells to fill, where N is the
number of tree nodes and L is the maximum hot trace length allowed.
Each cell needs no more than \log n(v) operations where n(v) is the
number of paths end at v. Roughly the total cost is no more than O(N L
\log(N)). Of course this is very rough and I need to work harder to
get a closer bound.

There could be a dynamic programming solution when G is a serial-
parallel graph.

FROM TK:
I think the function P(v, l) should be defined as the MAXIMUM number of paths that are
contained by a hot trace that ends at v and has length l since there are more
than one traces that are end at v and have length l.

Also, about the recursive function, it seems the formula only contains "one half" cases.
It only considers hot traces that pass through w and end at v.
However, there could be other ended-at-v-with-length-l hot traces that are contained only within the subtree rooted at v,
rather passing through v's parent.

However, I still need some time to figure the correct recursive function. come back later.

I write down the problem and recursive function for tree in http://www.iis.sinica.edu.tw/~tk/k-hot-path.pdf
Chun-Chen

NOTE: FROM ICE
In our system based on Qemu with LLVM, we generate the traces(i.e.
dynamic basic blocks)
which have only one entrance on the head of these blocks, so far.
Moreover, especially we want to translate a hot trace with the
optimization options of LLVM such that
the translated instructions may be reordered.
Consequently, the host code of the hot traces cannot have multiple
entrances, either.
In other word, different jump target addresses imply different traces.
Therefore, the same (static) basic blocks may be included in different
hot traces.
Furthermore, the assumption, "the nodes in a simple path must be
different", may not be suitable for our system.

Our approach prefers to construct the hot traces whose lengths are as
long as possible
since we believe that longer traces have more opportunities to be
optimized.
However, the longer traces will incur the more overhead to generate
optimized code.
We should limit the length of a hot trace.

Considering definition of a hot trace, we should consider the
frequency of the execution.
We should not only find the longer hot traces on the CFG but also make
the hot code segments included in the hot traces.

Due to these reasons,
I think that the hot trace problem should be modified as follows to
match the practice of our system.

We still have a directed graph G.
A node in G represents a (static) basic block.
An edge connecting two node in G represents that one of these two node
may be executed after the other one is executed.
An edge in G has a weight to represent the frequency of the execution
between the two nodes connected by this edge.

We limit the lengths of hot traces under a given k to consider the
overhead of generate optimized code.
We want to find the minimum number of the hot traces with length
constraint to include the hot edges
where hot edges are the edges which have the weight larger than a
given threshold.
(Note that the same nodes can be included in different hot traces.)

If we remove the weights on the edges but give each node a weight to
represent the execution frequency of this node, instead,
we may have another similar version of this problem as follows.

We want to find the minimum number of the hot traces with length
constraint to include the hot nodes
where hot nodes are the nodes which have the weight larger than a
given threshold.

ice

Sorry that I do not understand what you are saying here.
Do you mean you want to find a minimum set of length-limited paths
that cover a set of specified (hot) edges?

I will refine the model according to the description of ice. Correct
me if I am wrong.

We will have weight information on edges and nodes, instead of a set
of paths as I described earlier. So we will be given a directed graph
G, a constant l, and a constant k. The edges of G are partitioned into
two subsets -- important edges and unimportant edges. We want to find
a set S of at most k directed paths where each of them is at most of
length l, and the paths in S cover all important edges. We will call
this problem edge-covering-path-set problem. Similarly we can define
node-covering-path-set problem, where we partition nodes in G into
important and unimportant ones, and define the problem similarly.

It is easy to see that node-covering-path-set is NPC, since we simply
make every node important, set k to 1 and l to n - 1, where n is the
number of nodes in G, then we have a Hamiltonian path problem. I are
wondering if edge-covering-path-set is also NPC.

These two problems on tree should be easy as long as the all edges are
directed away from the root. However, this is only theoretically
interesting since in practice the graph will not be a tree.

The edge-covering-path-set problem seems to be NPC as well. Here is
the proof. We will reduce from Hamiltonian path again. Given a
directed graph H as an instance of Hamiltonian path problem, we
construct G as follows. Every node in H node splits into two nodes
called head and tail. We then add an directed edge (called internal
edge) from head to tail. Now if there is an edge from v to w in H, we
put a directed edge from the tail of v to the head of w. Now we make
all internal edges "important", l to be 2n - 1, where n is the number
of nodes in H, and k = 1. It should be easy to see that a hot path in
G will be a Hamiltonian path in H, and vice versa.

We have three problems now -- information given as paths, on the node,
or on the edges. All of them are NPC, but all of them are sovlable on
trees, with edgings going away from the root. The path version can be
solved by a dynamic programming, and the node and edge version can be
solved by greedy methods.

The above discussion did not include possibility of conditional branches.
maybe we can add possibility into this problem.

2010年7月1日星期四

5880059

2010年6月25日星期五

重點在 namespaceId 的值，它是 hash( host+port )。
如果其他的 namenode 不合會怎樣？
不能啟動嗎？
總不可能每次 connect 都檢查這個值吧？

just do it!!!

TODO:
namenode namesystem.blockReceived

2010年6月18日星期五

整理 code ？
1. prologue 及 epilogue。
2. indirect branch -> direct brance
3. code cache memory manager
4.

2010年6月3日星期四

build hadoop

2010/06/03 build hadoop, problem:
ivy-download:
/home/tk/Software/hadoop/hadoop-svn-0.20.2/build.xml:1630: java.net.SocketException: Network is unreachable
mentioned in this post that :

Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf on my debian squeeze installation seems to have fixed the problem. Sorry for littering the list.

This actually is a special issue in Debina squeeze/sid version as explained in another post.
So what to do is:
1. Setting bindv6only to 0 in /etc/sysctl.d/bindv6only.conf
2. restart procps: sudo invoke-rc.d procps restart

now we have problem for java version

2010年6月1日星期二

debug

目前想改 repz 的instruction，讓它們跟 qemu 的行為一樣，這樣比較好debug，這很重要，雖然implement 上會比較慢，但開發速度可以快很多，確定都ok了之後就可以改回來。

2010年5月25日星期二

what's the major challenge of retargetable binary translator
I think the binary translator currently does not has enough tool to run optimization

what's the benefit to convert binary code to high level intermediate code?

now we have LLVM tool to manipulate binary at runtime.
to map guest register to host register at runtime.

gogogogogogogogogogogogogogogogogo

2010年5月13日星期四

do I need to start writing this paper?

2010年5月4日星期二

tobuy:
床架、布？以後也許不太會在客廳也不一定
桌子要擺飯廳
要

2010年5月3日星期一

今天要完成其他的app，要先列出順序

floating point 有點麻煩，要自己寫，用 qemu 處理 exception
qemu 有處理 pre-op float exception，但沒有 post-op float exception

2010年4月29日星期四

記得將 hot path problem 寫成 paper

2010年4月24日星期六

最近心情有點煩

debug不知如何進行！

又要搬家

一個人住好了，簡單，但花錢！

目前有兩個大煩惱：debug如何進行，以及hadoop的project。
我覺得這兩個都是很有趣的問題，不太算煩惱的煩惱。
搬家雖然很煩，但是個改變，感覺也不錯，也許先搬過去再找房客也不差，不急。

下星期會很忙！就星期五搬好了。冷氣要跟室友談好，

加油！

反覆的開啟又關閉

我不能讓一件事就打亂我的心情，

乾脆，那間會很恐怖嗎？
不知道，不過，

抖抖抖抖抖抖的的的抖抖抖抖

2010年4月12日星期一

qemu 在 user mode 下是用 static code gen buffer，size為32mb
in ./exec.c line 408 it mentioned:

Currently it is not recommended to allocate big chunks of data in
user mode. It will change when a dedicated libc will be used

but don't know why. Why not use mmap as system mode use?

Segment selector Format:

15 ...... 3	2 1	0
index	TI	RPL

TI Table indicator:: 0 means selector indexes into GDT
1 means selector indexes into LDT
RPL Privelege level. Linux uses only two privelege levels.: 0 means kernel
3 means user

Examples:

Kernel code segment: TI=0, index=1, RPL=0, therefore selector = 0x08 (GDT[1])
User data segment: TI=1, index=2, RPL=3, therefore selector = 0x17 (LDT[2])

stack 弄壞了！

先弄一個 exit basic block，然後再將 pass 給去掉。

現在要focus 在 indirect branch 上，要知道的是第一個參數的 address 是什麼？
FP->getPassName()
lib/CodeGen/StackProtector.cpp

DefaultJITMemoryManager

lib/CodeGen/LLVMTargetMachine.cpp:
LLVMTargetMachine::addPassesToEmitMachineCode
LLVMTargetMachine::addCommonCodeGenPasses

=============================
Prevent spill register code gen.
enable to emit epilog for branch and indirect brance
=============================
錯在這：0x40095872: mov (%edx),%eax
%edx 為 init_stack+c 的位置，其中的值
在 loader_exec 中就放了！
在 create_elf_tables() 之後！
在 loader_build_argptr() 之後！
在 put_user_ual(stringp, envp) 之後！

2010年4月7日星期三

MOVS：move string from ds:esi to es:edi
this confuses me

Question is
address-size attribute

給ice：(guest_ip，target_ip，count，type)

2010年4月6日星期二

what are the meanings of cf,of and af flags in eflag？

CF: This flag indicates an overflow condition for unsigned-integer arithmetic.
Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise.

OF: This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.
Set if the integer result is too large a positive number or too small a negative number (excludingthe sign-bit) to fit in the destination operand; cleared otherwise.

AF: This flag is used in binary-coded decimal (BCD) arithmetic.
Set if an arithmetic operation generates a carry
or a borrow out of bit 3 of the result; cleared otherwise.

2010年4月1日星期四

首先：

我要將 translation funtion的界面換掉，
它要有：
一、 disassembler傳過去的參數。
二、是否要更新 cc code。
三、extra parameter。

MOV16sr: move register into seqment register
store guest ip into

2010年3月26日星期五

I summarize our discussion as follows.

As stated in the CACM article, the single master may pose some problems when tremendous amount of data are considered.
For example, the master is unable to store all the metadata in the memory.
In addition, the master would become the bandwidth bottleneck because it has to a large amount of connections.
Google now also wants to deal with this issue, and they propose to use multiple masters.
In their proposal, a file will be related to one of the masters in an static/unchangeable way.

Besides, it seems that they do not consider the load balance problem of masters in their design.

So, the following is our design, in which the load balance and bandwidth problems are simultaneously solved.
Assume that we have k masters, each of which takes care of certain chunkservers.

Each master stores k counting bloom filters, among which i-th bloom filter represents the association between the i-th master and the file-chunk mapping it is in charge of.

Note that each master broadcasts its association to the other masters so that each master can renew the counting bloom filters it stores.

When a client wants to access a file, what it should do is to randomly pick a master and then to query the corresponding k counting bloom filters.

If not hit happens in all the filters, it means that the file does not exist.

Otherwise, the client turns to ask the master that is in charge of the file the client would like to access.

Of course, there could be cases that more than one masters are responsible for the file the client would like to access according to the response of the filters.

This is due to the nature of bloom filter, and can be easily mitigated and handled.

Increasing the filter size can mitigate such rate of false hit.

Simply asking those two masters to make sure again can be helpful in handling this ambiguity.

As a whole, this design can solve the memory problem because k masters share the workload.

This design can solve the bandwidth problem because of the randomness in the choice made by the client.

I think we can raise the following questions:

1. Why the engineers in Google insist that, in the single master case, all the metadata should be stored in the memory in the master?

A: First, storing some of metadata in the disk may slow down the overall access speed. Second, doing so still cannot solve the bandwidth problem.

2. The CACM ariticle briefs Google's plan on the multiple master setting in GFS. Do you have any idea about their design? Besides, Do you have any idea how to coordinate multiple masters?

A: (We can decribe our above idea. I draw a conceptional picture. Perhaps we can attach this picture to the question-raising mail if you would like to.)

The following are some questions I personally want to ask:

1. Memory and bandwidth problems can be simultaneously solved in the multiple master setting. However, the multiple master setting has been described in the CACM article, though, very briefly. So, I think what we do is to detail how masters work by using our idea?

2. Although namespace file partition the metadata in a static way, we may still directly modify the namespace file so that after a specific time, the metadata that belongs to master 2 before the specific time belongs master 3. So, maybe it is also dynamic?

Q1:
In the CACM article, they mention a "static namespace partition file" approach to enable multiple masters over a pool of chunkservers.
The description, however, is very rough and we are very interested in it.
So, we would like to describe that idea more clearly and identify its pro and con in the class.

Answer to Q1:

First, a static namespace partition file is just a mapping table of directory <-> master ID.
Clients will read this table to find the "right" master.
Master will write to this table for a directory creation request.

The corresponding master adds an entry to this table when a directory creation request arrives.
And we think the client chooses a master randomly to server this request.

The content of this table must be consistent to all clients and masters.
Therefore, we think the reasonable approach is only masters have this table.
And the scenario for a read request should be as follows.
1. Whenever a client send a read request, it chooses randomly a master and sends the request.
2. The master looks up the table for the corresponding master and redirect that request to it.
3. The corresponding master serves the read request and send the result back to client.
Therefore, each master has a "DISPATCH" functionality, i.e., to dispatch request to the right master according to the mapping table.

Things needed to be pointed out:
1. Although they call it namespace partition "file", we don't think this information is stored in file(in the disk).
It should resides in the memories of masters in order to keep faster response. After all, all requests must consult this table.

2. The granularity of partition is "directory". There are pro and con here.
pro 1: Large granularity results in less synchronization. Synchronization overhead can be small.
pro 2: This granularity simplifies the modification of the design of master.

con 1: Large granularity may result in load imbalance both in terms of memory usage and served requests.
Once a directory is created, all metadata of files and chunks in that directory will store in one master.
And all requests for all metadata of files and chunks in that directory will direct to one master.

3. Since they allow multiple masters over a pool of chunkservers. Then each chunkserver must maintain another table to map chunk ID to master ID.
So that the chunkserver knows which chunk should report to which master according to that table.

Concluding remarks:
The "static namespace partition file" approach for multiple masters is just a workaround approach for GFS.
The two pros makes multiple masters design to be done quickly and release the burden of single master design.
However, the con described above may be a problem.
So, what about more fine granularity: "FILE".
We will address this problem in our project proposal.

Q2:
Why the namespace partition file is "STATIC" rather than "DYNAMIC"?
Again, we would like to describe what we have discussed in the class.

Answer to Q2:

First, we give our definition about STATIC and DYNAMIC:
"STATIC" means once the entry (directory to master ID) is added into the file, this mapping won't change in the future.
On the other hand, "DYNAMIC" means a directory, say "/A", can be mapped to master 1 at a time, and then mapped to master 2.

The reason why not allow dynamic:
If we allowed the mapping to be changed, we must support metadata migration, which means to move metadata from one master to another master.
Since the granularity of partition is directory, one can imagine that there will be LOTS of metadata need to migrate when the mapping is changed.
So, the overhead of migration should be large since all requests to that directory must be suspend until the migration is done, and the mapping information of chunkservers should be change accordingly.
we believe the reason why they don't allow dynamic is that they want to simplify design since this is just a workaround approach.

訂閱：文章 (Atom)

2012年6月28日 星期四

2012年6月27日 星期三

2012年6月22日 星期五

2012年6月21日 星期四

2012年6月20日 星期三

2012年6月19日 星期二

2012年6月8日 星期五

2012年6月6日 星期三

2012年6月5日 星期二

2012年6月3日 星期日

2012年5月31日 星期四

2012年5月27日 星期日

2012年5月23日 星期三

2012年5月14日 星期一

2012年5月12日 星期六

2012年5月11日 星期五

2012年5月8日 星期二

2012年4月27日 星期五

Installing Debian ARM under QEMU

2012年4月23日 星期一

2012年4月19日 星期四

2012年3月12日 星期一

2012年3月11日 星期日

2012年3月9日 星期五

2012年3月2日 星期五

2012年1月6日 星期五

2011年11月16日 星期三

2011年10月31日 星期一

2011年10月27日 星期四

2011年8月9日 星期二

2011年8月4日 星期四

2011年8月3日 星期三

2011年7月11日 星期一

2011年2月8日 星期二

2011年1月5日 星期三

2010年9月21日 星期二

2010年8月18日 星期三

2010年7月31日 星期六

2010年7月29日 星期四

2010年7月26日 星期一

2010年7月21日 星期三

2010年7月1日 星期四

2010年6月25日 星期五

2010年6月18日 星期五

2010年6月3日 星期四

2010年6月1日 星期二

2010年5月25日 星期二

2010年5月13日 星期四

2010年5月4日 星期二

2010年5月3日 星期一

2010年4月29日 星期四

2010年4月24日 星期六

2010年4月12日 星期一

2010年4月7日 星期三

2010年4月6日 星期二

2010年4月1日 星期四

2010年3月26日 星期五

關於我自己

網誌存檔

2012年6月28日星期四

2012年6月27日星期三

2012年6月22日星期五

2012年6月21日星期四

2012年6月20日星期三

2012年6月19日星期二

2012年6月8日星期五

2012年6月6日星期三

2012年6月5日星期二

2012年6月3日星期日

2012年5月31日星期四

2012年5月27日星期日

2012年5月23日星期三

2012年5月14日星期一

2012年5月12日星期六

2012年5月11日星期五

2012年5月8日星期二

2012年4月27日星期五

2012年4月23日星期一

2012年4月19日星期四

2012年3月12日星期一

2012年3月11日星期日

2012年3月9日星期五

2012年3月2日星期五

2012年1月6日星期五

2011年11月16日星期三

2011年10月31日星期一

2011年10月27日星期四

2011年8月9日星期二

2011年8月4日星期四

2011年8月3日星期三

2011年7月11日星期一

2011年2月8日星期二

2011年1月5日星期三

2010年9月21日星期二

2010年8月18日星期三

2010年7月31日星期六

2010年7月29日星期四

2010年7月26日星期一

2010年7月21日星期三

2010年7月1日星期四

2010年6月25日星期五

2010年6月18日星期五

2010年6月3日星期四

2010年6月1日星期二

2010年5月25日星期二

2010年5月13日星期四

2010年5月4日星期二

2010年5月3日星期一

2010年4月29日星期四

2010年4月24日星期六

2010年4月12日星期一

2010年4月7日星期三

2010年4月6日星期二

2010年4月1日星期四

2010年3月26日星期五