2010年3月26日 星期五

I summarize our discussion as follows.
As stated in the CACM article, the single master may pose some problems when tremendous amount of data are considered.
For example, the master is unable to store all the metadata in the memory.
In addition, the master would become the bandwidth bottleneck because it has to a large amount of connections.
Google now also wants to deal with this issue, and they propose to use multiple masters.
In their proposal, a file will be related to one of the masters in an static/unchangeable way.

Besides, it seems that they do not consider the load balance problem of masters in their design.

So, the following is our design, in which the load balance and bandwidth problems are simultaneously solved.
Assume that we have k masters, each of which takes care of certain chunkservers.

Each master stores k counting bloom filters, among which i-th bloom filter represents the association between the i-th master and the file-chunk mapping it is in charge of.

Note that each master broadcasts its association to the other masters so that each master can renew the counting bloom filters it stores.

When a client wants to access a file, what it should do is to randomly pick a master and then to query the corresponding k counting bloom filters.

If not hit happens in all the filters, it means that the file does not exist.

Otherwise, the client turns to ask the master that is in charge of the file the client would like to access.

Of course, there could be cases that more than one masters are responsible for the file the client would like to access according to the response of the filters.

This is due to the nature of bloom filter, and can be easily mitigated and handled.

Increasing the filter size can mitigate such rate of false hit.

Simply asking those two masters to make sure again can be helpful in handling this ambiguity.

As a whole, this design can solve the memory problem because k masters share the workload.

This design can solve the bandwidth problem because of the randomness in the choice made by the client.

I think we can raise the following questions:

1. Why the engineers in Google insist that, in the single master case, all the metadata should be stored in the memory in the master?
A: First, storing some of metadata in the disk may slow down the overall access speed. Second, doing so still cannot solve the bandwidth problem.

2. The CACM ariticle briefs Google's plan on the multiple master setting in GFS. Do you have any idea about their design? Besides, Do you have any idea how to coordinate multiple masters?
A: (We can decribe our above idea. I draw a conceptional picture. Perhaps we can attach this picture to the question-raising mail if you would like to.)

The following are some questions I personally want to ask:
1. Memory and bandwidth problems can be simultaneously solved in the multiple master setting. However, the multiple master setting has been described in the CACM article, though, very briefly. So, I think what we do is to detail how masters work by using our idea?

2. Although namespace file partition the metadata in a static way, we may still directly modify the namespace file so that after a specific time, the metadata that belongs to master 2 before the specific time belongs master 3. So, maybe it is also dynamic?

Q1:
In the CACM article, they mention a "static namespace partition file" approach to enable multiple masters over a pool of chunkservers.
The description, however, is very rough and we are very interested in it.
So, we would like to describe that idea more clearly and identify its pro and con in the class.

Answer to Q1:

First, a static namespace partition file is just a
mapping table of directory <-> master ID.
Clients will read this table to find the "right" master.
Master will write to this table for a directory creation request.


The corresponding master adds an entry to this table
when a directory creation request arrives.
And we think the client chooses a master randomly to server this request.

The content of this table must be consistent to all clients and masters.
Therefore, we think the reasonable approach is only masters have this table.
And the scenario for a read request should be as follows.
1. Whenever a client send a read request, it chooses randomly a master and sends the request.
2. The master looks up the table for the corresponding master and redirect that request to it.
3. The corresponding master serves the read request and send the result back to client.
Therefore, each master has a "DISPATCH" functionality, i.e., to dispatch request to the right master according to the mapping table.


Things needed to be pointed out:
1. Although they call it namespace partition "file", we don't think this information is stored in file(in the disk).
It should resides in the memories of masters in order to keep faster response. After all, all requests must consult this table.

2. The granularity of partition is "directory". There are pro and con here.
pro 1: Large granularity results in less synchronization. Synchronization overhead can be small.
pro 2: This granularity simplifies the modification of the design of master.

con 1: Large
granularity may result in load imbalance both in terms of memory usage and served requests.
Once a directory is created, all metadata of files and chunks in that directory
will store in one master.
And all requests for
all metadata of files and chunks in that directory will direct to one master.

3.
Since they allow multiple masters over a pool of chunkservers. Then each chunkserver must maintain another table to map chunk ID to master ID.
So that the chunkserver knows which chunk should report to which master according to that table.

Concluding remarks:
The "static namespace partition file" approach for multiple masters is just a workaround approach for GFS.
The two pros makes multiple masters design to be done quickly and release the burden of single master design.
However, the con described above may be a problem.
So, what about more fine granularity: "FILE".
We will address this problem in our project proposal.




Q2:
Why the namespace partition file is "STATIC" rather than "DYNAMIC"?
Again, we would like to describe what we have discussed in the class.

Answer to Q2:

First, we give our definition about STATIC and DYNAMIC:
"STATIC" means once the entry (directory to master ID) is added into the file, this mapping won't change in the future.
On the other hand, "DYNAMIC" means a directory, say "/A", can be mapped to master 1 at a time, and then mapped to master 2.

The reason why not allow dynamic:
If we allowed the mapping to be changed, we must support metadata migration, which means to move metadata from one master to another master.
Since the granularity of partition is directory, one can imagine that there will be LOTS of metadata need to migrate when the mapping is changed.
So, the overhead of migration should be large since all requests to that directory must be suspend until the migration is done, and the mapping information of chunkservers should be change accordingly.
we believe the reason why they don't allow dynamic is that they want to simplify design since this is just a workaround approach.





























2010年3月25日 星期四

Master Problem in GFS:
1' memory usage problem
2' bandwidth bottleneck
3' single failure problem

suppose we use two Metadata Server(MS)
we would like to have:
1' load balance: the metadata is evenly distributed among MSes

approach:
use dispatcher: each query request is sent to dispatcher who will decide which MS
is going to serve this request.

Then we have single dispatcher problems.
So our approach is to embed dispatch function into MS.

A user wants to create a file, then it sends creation request to arbitrary MS.
The MS first decide which MS is going to serve this request
with one mysterious algorithm. It decides MS1 should serve this request.
So it redirect the creation request to MS1.

As long as we can distribute creation request evenly, we can end balance

2010年3月24日 星期三

我需要:
一、 所有 speccpu各app所用到的instruction

一般的 instruction 的 translation function 不要處理 condition code,
可能嗎?這要再想想。

2010年3月22日 星期一

我還需要麼,
我需要的是:
1、一個 test program。

2、列出它用到的 opcode 及 operands 的值

3、qemu 停止translate 的 opcode 以及 它的處理方式。

4、目前code gen的 overhead是否高?

5、如何確定translation function寫對?

exec.c:910
tb_link_phy_page??????

今天的進度:
1、要完成 test case 的 qemu result
2、要檢驗目前完成的 qemu

call 的

正確性!
要如何確定正確性?
我本來是想要用 qemu 的 output 來當做正確性,
就是將 translate 過後的 inst 來比較
不過這方法只要會碰到 memory 就一定會 segmentation fault,
應該沒關係吧?

另外是有關 log file 的問題,exec.c: logfilename, 及 main.c 中有 DEFAULT_LOG_FILE
需要可以讓我指定檔名。

log file 何時開啟的?
加了 logfile 的 option,可是一定要在 -d 的後面,一定!

qemu stop conditions 有 58 個 instructions:

0059 addl $1000, %eax
0065 addl $1, %eax
0066 addl %gs:0, %edx
0272 call *%gs:16
0273 call *%eax
0277 call -100149
0467 cmpsb
0768 imull -4740(%ebp)
0809 int $128
0912 jae 1
0913 jae -135
0914 ja 1
0915 ja -1163
0916 jbe 100
0917 jbe -1281
0918 jb 10
0919 jb -156
0920 jcxz 44
0921 je 1
0922 je -10119
0923 jge 107
0924 jge -1081
0925 jg 100
0926 jg -1047
0927 jle 10
0928 jle -1177
0929 jl 10
0930 jl -1285
0931 jmpl *100(%ebx)
0932 jmpl *%eax
0936 jmp 10
0937 jmp -1007
0938 jne 10
0939 jne -1012
0943 jnp -213
0944 jns 10
0945 jns -1546
0948 jp 13
0949 jp -196
0950 js 10
0951 js -1081
0987 leal (%eax,%eax), %ecx
1274 movl %eax, %gs:(%edx)
1280 movl %gs:(%eax), %eax
1281 movl %eax, %ebx
1371 movsb
1372 movsl
1424 movzwl (%eax), %eax
1469 fmuls (%eax,%ebx)
1982 ret
1983 ret $12
2059 sarl %eax
2158 shll $4, -116(%ebp)
2197 shrl $11, %edx
2264 stosb
2265 stosl
2302 subl (%eax), %ecx
2303 subl %eax, %ebx

condition code is a challenge

2010年3月15日 星期一

我現在遇到的問題在於 debug,
我需要知道我是那一步錯了。
將 Makefile.target中的
#LDFLAGS+=-Wl,-shared
改成
LDFLAGS+=-Wl,-T,$(SRC_PATH)/$(ARCH).ld
就可以避免產生 PIE 的檔案。這樣就可以用 gdb 了。

現在遇到的另一個問題是:
要跳回qemu時的位置錯了!
就是我從qemu拿到的是 0x62933f88
可是經由 llvm 產生 call instruction 時的位置卻是0xdb24137。
而且這個值是會變的!

something wrong with "inttoptr" instruction

這是怎麼回事?

0x764, 1
28(Reg)

0x501, 2
28(Reg) 45(Reg)

0x4ff, 2
38(Reg) 5(Imm)

0x4ff, 2
27(Reg) 1(Imm)

0x501, 2
29(Reg) 38(Reg)

0x329, 1
128(Imm)

====================
現在好玩了,我發現 int 的處理是去 call helper_raise_interrupt
而非跳回去。
所以,只要 call 它就好了嗎?

call 了,可是跳不過去…

Question : how to link

可以了,哦耶!
要記得的教訓是:
1、要注意qemu對結束條件的處理,
像 int 0x80,qemu的處理法是呼叫 helper_raise_interrupt( 80, 02 )。
這要用 gdb 看才知道。
2、generated code的處理。LLVM的 CodeEmitter 將code產生在它自己管理的地方,
這部分要再處理一下。如,能否指定 memory 的位置,

目前已經可以產生 switch statement 以及 llvm translation functions header

我還需要麼,
我需要的是:
1、一個 test program。
2、列出它用到的 opcode 及 operands 的值
3、qemu 停止translate 的 opcode 以及 它的處理方式。
4、目前code gen的 overhead是否高?
5、如何確定translation function寫對?

exec.c:910
tb_link_phy_page??????

2010年3月8日 星期一

3/8-3/12

=======3/8========
現在的問題是如何解讀它 opcode?

第二個問題是為什麼compile出來的 qemu-i386是個 PIE?為什麼 gdb 會覺得它是?

我現在想幹嘛?
????????
instruction ID 到底是什麼?

=================

push %ebp
0x764, 1
28(Reg)

mov %esp,%ebp
0x501, 2
28(Reg) 45(Reg)

sub $0x8,%esp
0x8fd, 3
45(Reg) 45(Reg) 8(Imm)

movl $0x5,-0x4(%ebp)
0x4f9, 6
28(Reg) 1(Imm) 0(Reg) -4(Imm) 0(Reg) 5(Imm)

mov -0x4(%ebp),%eax
0x500, 6
27(Reg) 28(Reg) 1(Imm) 0(Reg) -4(Imm) 0(Reg)

mov %eax,(%esp)
0x4fa, 6
45(Reg) 1(Imm) 0(Reg) 0(Imm) 0(Reg) 27(Reg)

call 8048080
0x115, 1
-56(Imm)

===========================notes==========
1' the generated function contains a push in the first place, remove it!!!
2'

=======================================
use size_t to represent memory address size type

32bit VS 64bit, this is really a problem:w
:wG





2010年3月5日 星期五

ToDo

TODO:
1. take the course.
2. read papers
3. PP2010 judge system, worksheet items
4. Money:孫(10000)、王(5400)、包(5400)、房(6500)、信(9000)。

2010年3月3日 星期三

Mongrel Cluster Setting Steps

http://azureusonrails.rubyforge.org/wiki/wiki.pl?Install/Mongrel_Cluster_With_Apache_2.2

3/3

今天要看 paper,看那一篇好五了 -
Design and Engineering of a Dynamic Binary Optimizer.

看 openmp 的 worksheet,然後想一下如何用批改系統來改。

批改系統弄好了,
在 grid01.csie.ntu.edu.tw/~pp2010/judgegirl

2010年3月1日 星期一

3/1 - 3/5

過後的第二個星期,加油!喝!
加油,你可以的

另外,qemu 在 interrupt 之後就沒有再 disas code,why?
因為 syscall 的為 TARGET_NR_exit,所以就直接 _exit 了,
後面的 code 就不用 return 了。

首先:需要找出到底有那些 op 是我的 target?

先決定那些頁面要印,
我需要 instruction format 的 頁面,還有

REX prefixes are instruction-prefix bytes used in 64-bit mode. They do the following:
• Specify GPRs and SSE registers.
• Specify 64-bit operand size.
• Specify extended control registers.

要處理的 instructions 有:

83 c4 08 add $0x8,%esp

83 is op code, for ADD
c4 is ModR/M, 11 000 100 (2+3+3)
MOD: 11
Reg: 000
R/M:100
0000 1000


83 ec 08 sub $0x8,%esp

e8 c8 ff ff ff call 8048080

cd 80 int $0x80

5d pop %ebp

55 push %ebp

c3 ret

89 04 24 mov %eax,(%esp)
89 45 f8 mov %eax,-0x8(%ebp)
89 45 fc mov %eax,-0x4(%ebp)
89 cb mov %ecx,%ebx
89 e5 mov %esp,%ebp
8b 45 fc mov -0x4(%ebp),%eax
8b 4d fc mov -0x4(%ebp),%ecx
b8 01 00 00 00 mov $0x1,%eax
c7 45 fc 05 00 00 00 movl $0x5,-0x4(%ebp)

============3/2===============
每一個都要寫一個 disassembler

為什麼會這麼焦慮呢?
冷靜下來,好好的想,念「南無阿彌佗佛」,南無阿彌佗佛 南無阿彌佗佛 南無阿彌佗佛,肚子好餓,
translator 寫在一起算了!
重寫 cpu_gen_code!!

tb->tb_jmp_offset : 用來跳到 code cache 的其他地方,
./tcg/i386/tcg-target.c:873

tb->tb_next_offset : 用來跳到 code cache 的其他地方,
./tcg/i386/tcg-target.c:881:
s->tb_next_offset[args[0]] = s->code_ptr - s->code_buf;

這兩個跟我們應該沒關係,我們不會用到。

幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹幹
剛剛看了 tb_next_offset 的用途,並非沒用,
uint16_t tb_next_offset[2]; /* offset of original jump target */
uint16_t tb_jmp_offset[4]; /* offset of jump instruction */
unsigned long tb_next[2]; /* address of jump generated code */


tb_jmp_offset 在 ./exec-all.h:233 會用到,
offset = tb->tb_jmp_offset[n],

這是在 tb_set_jmp_target 中,
tb_set_jmp_target(TranslationBlock *tb, int n, unsigned long addr)
先取出 tb 中 第 n 個 tb_jmp_offset,在該處放置於 addr 相對的位置:
jmp_addr = tb->tc_ptr + offset;
*(uint32_t*)jmp_addr = add - ( T + 4);

而 tb_set_jmp_target是在 tb_add_jmp中呼叫的的,
tb_add_jump(TranslationBlock *tb, int n, TranslationBlock *tb_next)
它先 check tb->jmp_next[n]不為0,若是,則呼叫 tb_set_jmp_target
然後將 tb->jmp_next[n] = tb_next->jmp_first;
然後將 tb_next->jmp_first = (TranslationBlock *)((long)(tb) | (n));

上面有個我不知道的假設,它似乎拿最後面兩個 bit 來用,
可是它怎麼知道 TranslationBlock *tb 指到的位置都沒有用到最後面兩個 bit,
那你要去看那個 pointer 是怎麼來的,tb_alloc,
沒什麼特別,是不是所有的 pointer都是最後兩個bit沒用?
是的,是的,pointer 的最後兩個 bit 都是沒用的,因為
一個 pointer 是4個byte(在 32 位元的機器上)。

在 cpu_exec 中會呼叫 tb_add_jump,
tb_add_jump((TranslationBlock *)(next_tb & ~3), next_tb & 3, tb);
上面function的意思是指:
若 next_tb->jmp_next[n]不為0,n is "next_tb & 3"
則將 tb->tc_ptr 跟 (next_tb->tc_ptr + next_tb->tb_jmp_offset[n])的距離,
放在 next_tb->tc_ptr + next_tb->tb_jmp_offset[n] 所指的位置上。


真的會用到嗎?
會!做實驗了。
從source code document中提到關於 jmp_next:
/* list of TBs jumping to this one. This is a circular list using
the two least significant bits of the pointers to tell what is
the next pointer: 0 = jmp_next[0], 1 = jmp_next[1], 2 =
jmp_first */

what does epilogue do?
restore stack and pop out saved register
回傳值是藏那去?
算了,先不要回傳值好了,
from wikipedia:
http://en.wikipedia.org/wiki/X86_calling_conventions
the returned value is stored in EAX register

來研究一下等會要幹嘛好了,
一、綀

=============3/3=================
一、prologue 跟 epilogue 都可用,
但要將 tb_ret_addr 要能 pass 過 translator。

另一方面,要想,有沒可能自己寫 prologue 跟 epilogue?
是要寫在 translation function 中嗎?不行吧,這樣你怎麼做 linking?
還要跳過 prologue 跟 epilogue。

test program 中有兩個 basic blocks:
55 push %ebp
89 e5 mov %esp,%ebp
83 ec 08 sub $0x8,%esp
c7 45 fc 05 00 00 00 movl $0x5,-0x4(%ebp)
8b 45 fc mov -0x4(%ebp),%eax
89 04 24 mov %eax,(%esp)
e8 c8 ff ff ff call 8048080

55 push %ebp
89 e5 mov %esp,%ebp
83 ec 08 sub $0x8,%esp
8b 45 08 mov 0x8(%ebp),%eax
89 45 fc mov %eax,-0x4(%ebp)
8b 4d fc mov -0x4(%ebp),%ecx
b8 01 00 00 00 mov $0x1,%eax
89 cb mov %ecx,%ebx
cd 80 int $0x80

要處理的 instructions 有

83 c4 08 add $0x8,%esp

83 is op code
08 is immediate value 0x8
c4 is ModR/M, 11+000+100 (2+3+3)
MOD: 11
Reg: 000
R/M:100

83 ec 08 sub $0x8,%esp

e8 c8 ff ff ff call 8048080

cd 80 int $0x80

5d pop %ebp

55 push %ebp

c3 ret

89 04 24 mov %eax,(%esp)
89 45 f8 mov %eax,-0x8(%ebp)
89 45 fc mov %eax,-0x4(%ebp)
89 cb mov %ecx,%ebx
89 e5 mov %esp,%ebp

8b 45 fc mov -0x4(%ebp),%eax
8b 4d fc mov -0x4(%ebp),%ecx

b8 01 00 00 00 mov $0x1,%eax
c7 45 fc 05 00 00 00 movl $0x5,-0x4(%ebp)

in translate-all.c, line 88

CPUX86State 很大,有 28536 byte,一個 page 才 4k=4096 byte,它就佔了7個 page!

==================================

先將 disassembler 弄好。