工作日誌: 7月 2012

2012年7月26日星期四

Work Flow

measure cycle of fast path of TMC
measure misses of TLB

conflict miss
compulsory miss
capacity miss

=======================================================================

Optimization 2 seems useless

no performance gain in my implementation.

Large TLB Table: 2^8 -> 2^16

2^11 has best performance gain, about 64K bytes.
performance drop after 2^12, 2^15 and 2^16 -40% in GCC
why ?

========================================================================
Enable Run-time Optimizations on Cross-ISA System

Cross-Page Block Linking:

Other approaches?

reverse physical page mapping
Check All conditions

Large TLB Size

system time increases along with size
probably related to tlb flush
use structure of array for TLB: TBD

Combine Big Lv2 Cache + Victim Cacahe

Hash Table of Linked List
LRU replacement

========================================================================

question: is opt5 broken?

Run experiments quickly
實驗一定要將數據"當下"變成圖表，存下。否則只是廢物。

========================================================================

tlb_table initialization problem:

First initialized by memset in main->vexpress_a9_init->vexpress_common_init->cpu_init->cpu_arm_init->arm_cpu_reset->tlb_flush
Then initialized by memset again in main->qemu_system_reset->do_cpu_reset->arm_cpu_reset
Then initialized by memset AGAIN in main->qemu_system_reset->do_cpu_reset->arm_cpu_reset->tlb_flush

2012年7月24日星期二

Wrok Flow

opt0: rearrange assembly instructions
opt1: sunk all miss blocks
opt2: move redundant dirty stores to slow path
opt3: victim tlb cache

could have variations

opt4: cross-page block linking
opt5: indirect branch target caching
opt6: enlarge TLB table size
opt7: TLB mini buffer to reduce fast path cycles; probably failed,

=============================================================

performance reduction?

baseline performance?

code_read TLB access defined in exec-all.h

quick thought as follows
split code access and data access, similar to i-Cache and d-Cache
Not worth it, too much work, too little gain.

move redundant stores to miss block:

restore clobber flag for qemu_ld and qemu_st
so before qemu_ld/st, dirty states will now be stored back to memory.
and we only need to store that is NOT redundant.
This is true for globals, what about temporaries?

ABOUT temporaries: we only need to store temporaries in miss block.

We store temporary variables in miss block.
We only need to consider global variables.

DO IT!
Optimization 2: Boot Test:

opt2: OK
opt2+opt4: OK
opt1+opt2+opt4:OK
opt1+opt2+opt3+opt4:OK, NOT SO SURE....
opt1+opt2+opt3+opt4+opt5: NOT OK
opt3: OK

2012年7月20日星期五

Status of PARSEC ARMv7 Native Run

dedup: memory allocation fail
canneal: segmentation fault
ferret: abort due to assertion
facesim: no test input

SIM-SMALL

blackscholes: 0m3.598s
bodytrack: 0m15.082s
facesim: 6m55.221s
ferret: abort due to assertion
fluid: 0m15.290s
freqmine: 0m4.358s
swaptions: 0m11.446s
vips: 0m20.788s
x264: 0m5.126s
canneal: segfault
dedup: memory allocation fail
streamcluster: 0m23.788s

SIM-MEDIUM

blackscholes:0m14.384s
bodytrack:0m50.297s
facesim:6m55.629s
ferret: abort due to assertion
fluid:0m35.486s
freqmine: 0m13.768s
swaptions:0m45.786s
vips:1m1.796s
x264:0m33.920s
canneal: segfault
dedup: memory allocation fail
streamcluster:1m41.766s

Victim Cache

Victim cache

when tlb lookup miss, let E_old be the previous TLB entry, and E_new be the new TLB entry after

0x7fffeac00000
The microarchitecture of Intel, AMD and VIA CPUs

http://www.agner.org/optimize/microarchitecture.pdf

2012年7月19日星期四

Indirect Branch Target Caching

IBTC
Difference Between Median Ratio V.S Average Ratio

Median Over Average: perlbench, bzip2, sjeng, h264
Median Below Average: omnetpp,
Neutral: gcc, mcf, gobmk, hmmer, libquantum, astar, xalancbmk

2012年7月17日星期二

Original NET, WRF

glibc error, double free, caused by translated segments

which one?

2012年7月12日星期四

Block Link Across Pages

Which instruction cause problem when block link across page?

Line 8007: YES /* branch (and link) */
TLB entry?
803b368c

sigfd_handler()

2012年7月11日星期三

check cross-page block-linking

TranslationBlock has four members which are physical page related:

struct TranslationBlock *phys_hash_next /* next matching tb for physical address. */

tb_phys_hash[] -> so called global TB mapping table, index by physical page address[2:17]
phys_hash_next: next TB in the link list of this entry

struct TranslationBlock *page_next; /*first and second physical page containing code. The lower bit of the pointer tells the index in page_next[] */

defined in tb_alloc_page() called in tb_link_page() called in tb_gen_code().
page_next is used for a link list of TranslationBlocks in the SAME page.
PageDesc *p->first_tb points to the LAST TB in this page(the latest encountered.)
Why we have two page_next[]? Because, this TB could span across two pages, therefore, we need to page_next, one for the first page, and the next for the second.
Question: is it possible that more than two TB in the same page has two page_next? YES, when two translation blocks are overlapped.

tb_page_addr_t page_addr[2]

address of two pages. Stupid question: is it possible one TB across two different physical page? YES, although virtual addresses of pages are consecutive, they may not be consecutive in physical pages.

/* Circular List of TBs jumping to this one. This is a circular list using

the two least significant bits of the pointers to tell what is

the next pointer: 0 = jmp_next[0], 1 = jmp_next[1], 2 =

jmp_first (tb itself)*/

struct TranslationBlock *jmp_next[2];
struct TranslationBlock *jmp_first;

Initialized in tb_link_page(), which is called in tb_gen_code(); Initially, it points to itself which set low bits to 2.
tb_add_jump(tb, n, tb_next) sets a jump from tb[n] to tb_next to tb; tb[n]----tb_next.
Then set tb->jmp_next[n] pointed to tb_next->jmp_first.
Then set tb_next->jmp_first to tb, and set low bits to n;
Illustration:

=====================================================================

tb_remove(ptb, tb, next_offset)

Remove tb in hash tb link list.
called in tb_phys_invalidate()

tb_page_remove(ptb, tb)

Remove tb from page tb link list.
called in tb_phys_invalidate()

tb_jmp_remove(tb, n)

Remove tb from tb jump circular list.
called in tb_phys_invalidate()

tb_phys_invalidate(tb, page_addr)

called from

tb_invalidate_phys_page_range(start, end, is_cpu_write_access)
tb_invalidate_phys_page(addr, pc, puc)

Only in USER MODE.

check_watchpoint(offset, len_mask, flags)
cpu_io_recompile(env, retaddr)

tb_invalidate_phys_page_range(start, end, is_cpu_write_access)

called from

tb_invalidate_phys_range(start, end, is_cpu_write_access)

only called in linux-user/mmap.c

tb_invalidate_phys_page_fast(start, len)

use code_bitmap to quickly check whether there has tb in this range.
called from notdirty_mem_write(opague, ram_addr, val, size)

tb_invalidate_phys_addr(addr)

only if TARGET_HAS_ICE

cpu_physical_memory_rw(addr, buf, len, is_write)
cpu_physical_memory_unmap(buffer, len, is_write, access_len)
stl_phys_notdirty(addr, val)

tb_invalidate_phys_page_range is called if in_migration

stl_phys_internal(addr, val, endian)
stw_phys_internal(addr, val, endian)

called from stw_phys(), stw_le_phy()/*little endian*/, stw_be_phys()

===============================================================

2012年7月5日星期四

ubuntu boot order/sequence management UPSTART

https://help.ubuntu.com/community/UpstartHowto

Sink-Optimization Implementation Problem

Use over 5G memory? what the fuck?
Fixed 256 MissBlockInfo in TranslationBlock are too much!!!!
Use just enough MissBlockInfo in Translation Block fixed this problem

BR_MISP_EXEC

Mispredicted Branch instructions executed, speculative and retired, by type

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/ug_docs/reference/pmn/events/br_misp_exec.html

2012年7月2日星期一

Note

Sink Slow Blocks has bugs?

boot output differs from original
cpu_restore_count for cpu_restore_state seems to be normal
problem: there are some errors during boot and trigger fail boot delay, which will force execution to sleep

Rewrite Optimization 1

simplify it

Problem should be cpu_restore_state

not exactly!

New phenomena: now original is as slow as opt1 and opt2(maybe).

======================================================

Debug Sinking slow blocks

individual test : qemu_ld32, qemu_ld16[us], qemu_ld8[us], qemu_ld64
rewrite!

The approach

two copies of tlb_load, and qemu_st, qemu_ld

======================================================

工作日誌

2012年7月26日星期四

Work Flow

2012年7月24日星期二

Wrok Flow

2012年7月20日星期五

Status of PARSEC ARMv7 Native Run

Victim Cache

2012年7月19日星期四

Indirect Branch Target Caching

2012年7月17日星期二

Original NET, WRF

2012年7月12日星期四

Block Link Across Pages

2012年7月11日星期三

check cross-page block-linking

tb_remove(ptb, tb, next_offset)

tb_page_remove(ptb, tb)

tb_jmp_remove(tb, n)

tb_phys_invalidate(tb, page_addr)

tb_invalidate_phys_page_range(start, end, is_cpu_write_access)

2012年7月5日星期四

ubuntu boot order/sequence management UPSTART

Sink-Optimization Implementation Problem

BR_MISP_EXEC

2012年7月2日星期一

Note

關於我自己

網誌存檔

2012年7月26日 星期四

2012年7月24日 星期二

2012年7月20日 星期五

2012年7月19日 星期四

2012年7月17日 星期二

2012年7月12日 星期四

2012年7月11日 星期三

tb_remove(ptb, tb, next_offset)

tb_page_remove(ptb, tb)

tb_jmp_remove(tb, n)

tb_phys_invalidate(tb, page_addr)

tb_invalidate_phys_page_range(start, end, is_cpu_write_access)

2012年7月5日 星期四

2012年7月2日 星期一

關於我自己

網誌存檔

2012年7月26日星期四

2012年7月24日星期二

2012年7月20日星期五

2012年7月19日星期四

2012年7月17日星期二

2012年7月12日星期四

2012年7月11日星期三

2012年7月5日星期四

2012年7月2日星期一