2012年7月26日 星期四

Work Flow


  • measure cycle of fast path of TMC
  • measure misses of TLB 
    • conflict miss
    • compulsory miss
    • capacity miss
=======================================================================
  • Optimization 2 seems useless
    • no performance gain in my implementation.
  • Large TLB Table: 2^8 -> 2^16
    • 2^11 has best performance gain, about 64K bytes.
    • performance drop after 2^12, 2^15 and 2^16 -40% in GCC
    • why ?
========================================================================
Enable Run-time Optimizations on Cross-ISA System

  • Cross-Page Block Linking:
    • Other approaches?
      • reverse physical page mapping
      • Check All conditions
  • Large TLB Size
    • system time increases along with size
    • probably related to tlb flush 
    • use structure of array for TLB: TBD
  • Combine Big Lv2 Cache + Victim Cacahe
    • Hash Table of Linked List
    • LRU replacement
========================================================================
  • question: is opt5 broken?
    • ???
  • Run experiments quickly
  • 實驗一定要將數據"當下"變成圖表,存下。否則只是廢物。
========================================================================
tlb_table initialization problem:
  • First initialized by memset in main->vexpress_a9_init->vexpress_common_init->cpu_init->cpu_arm_init->arm_cpu_reset->tlb_flush
  • Then initialized by memset again in main->qemu_system_reset->do_cpu_reset->arm_cpu_reset
  • Then initialized by memset AGAIN in main->qemu_system_reset->do_cpu_reset->arm_cpu_reset->tlb_flush

2012年7月24日 星期二

Wrok Flow


  • opt0: rearrange assembly instructions
  • opt1: sunk all miss blocks
  • opt2: move redundant dirty stores to slow path
  • opt3: victim tlb cache
    • could have variations
  • opt4: cross-page block linking
  • opt5: indirect branch target caching
  • opt6: enlarge TLB table size
  • opt7: TLB mini buffer to reduce fast path cycles;  probably failed, 
=============================================================
  • performance reduction?
    • baseline performance?
  • code_read TLB access defined in exec-all.h
    • quick thought as follows
    • split code access and data access, similar to i-Cache and d-Cache
    • Not worth it, too much work, too little gain.
  • move redundant stores to miss block:
    • restore clobber flag for qemu_ld and qemu_st
    • so before qemu_ld/st, dirty states will now be stored back to memory.
    • and we only need to store that is NOT redundant.
    • This is true for globals, what about temporaries?
      • ABOUT temporaries: we only need to store temporaries in miss block.
    • We store temporary variables in miss block.
    • We only need to consider global variables.
  • DO IT!
  • Optimization 2: Boot Test:
    • opt2: OK
    • opt2+opt4: OK
    • opt1+opt2+opt4:OK
    • opt1+opt2+opt3+opt4:OK, NOT SO SURE....
    • opt1+opt2+opt3+opt4+opt5: NOT OK
    • opt3: OK

2012年7月20日 星期五

Status of PARSEC ARMv7 Native Run


  • dedup: memory allocation fail
  • canneal: segmentation fault
  • ferret: abort due to assertion
  • facesim: no test input
SIM-SMALL
  • blackscholes: 0m3.598s
  • bodytrack: 0m15.082s
  • facesim: 6m55.221s
  • ferret: abort due to assertion
  • fluid: 0m15.290s
  • freqmine: 0m4.358s
  • swaptions: 0m11.446s
  • vips: 0m20.788s
  • x264: 0m5.126s
  • canneal: segfault
  • dedup: memory allocation fail
  • streamcluster: 0m23.788s
SIM-MEDIUM
  • blackscholes:0m14.384s
  • bodytrack:0m50.297s
  • facesim:6m55.629s
  • ferret: abort due to assertion
  • fluid:0m35.486s
  • freqmine: 0m13.768s
  • swaptions:0m45.786s
  • vips:1m1.796s
  • x264:0m33.920s
  • canneal: segfault
  • dedup: memory allocation fail
  • streamcluster:1m41.766s

Victim Cache

Victim cache

  • when tlb lookup miss,  let E_old be the previous TLB entry, and E_new be the new TLB entry after 

2012年7月19日 星期四

Indirect Branch Target Caching


  • IBTC
  • Difference Between Median Ratio V.S Average Ratio
  • Median Over Average: perlbench, bzip2, sjeng, h264
  • Median Below Average: omnetpp,
  • Neutral: gcc, mcf, gobmk, hmmer, libquantum, astar, xalancbmk

2012年7月17日 星期二

Original NET, WRF

glibc error, double free, caused by translated segments
  • which one?

2012年7月12日 星期四

Block Link Across Pages


  • Which instruction cause problem when block link across page?
    • Line 8007: YES /* branch (and link) */
    • TLB entry?
    • 803b368c
  • sigfd_handler()

2012年7月11日 星期三

check cross-page block-linking

TranslationBlock has four members which are physical page related:
  1. struct TranslationBlock *phys_hash_next     /* next matching tb for physical address. */ 
    1. tb_phys_hash[] -> so called global TB mapping table, index by physical page address[2:17]
    2. phys_hash_next: next TB in the link list of this entry
  1. struct TranslationBlock *page_next; /*first and second physical page containing code. The lower bit  of the pointer tells the index in page_next[] */
    1. defined in tb_alloc_page() called in tb_link_page() called in tb_gen_code().
    2. page_next is used for a link list of TranslationBlocks in the SAME page.
    3. PageDesc *p->first_tb points to the LAST TB in this page(the latest encountered.)
    4. Why we have two page_next[]? Because, this TB could span across two pages, therefore, we need to page_next, one for the first page, and the next for the second.
    5. Question: is it possible that more than two TB in the same page has two page_next? YES, when two translation blocks are overlapped.

  1. tb_page_addr_t page_addr[2]
    1. address of two pages. Stupid question: is it possible one TB across two different physical page? YES, although virtual addresses of pages are consecutive, they may not be  consecutive in physical pages.

    /* Circular List of TBs jumping to this one. This is a circular list using
       the two least significant bits of the pointers to tell what is
       the next pointer: 0 = jmp_next[0], 1 = jmp_next[1], 2 =
       jmp_first (tb itself)*/
  1. struct TranslationBlock *jmp_next[2];
  2. struct TranslationBlock *jmp_first;
    1. Initialized in tb_link_page(), which is called in tb_gen_code(); Initially, it points to itself which set low bits to 2.
    2. tb_add_jump(tb, n, tb_next) sets a jump from tb[n] to tb_next to tb;  tb[n]----tb_next.
    3. Then set tb->jmp_next[n]  pointed to tb_next->jmp_first.
    4. Then set tb_next->jmp_first to tb, and set low bits to n;
    5. Illustration:


=====================================================================

tb_remove(ptb, tb, next_offset)

  • Remove tb in hash tb link list.
  • called in tb_phys_invalidate()

tb_page_remove(ptb, tb)

  • Remove tb from page tb link list.
  • called  in tb_phys_invalidate()

tb_jmp_remove(tb, n)

  • Remove tb from tb jump circular list.
  • called in tb_phys_invalidate()

tb_phys_invalidate(tb, page_addr)

  • called from
    • tb_invalidate_phys_page_range(start, end, is_cpu_write_access)
    • tb_invalidate_phys_page(addr, pc, puc)
      • Only in USER MODE.
    • check_watchpoint(offset, len_mask, flags)
    • cpu_io_recompile(env, retaddr)

tb_invalidate_phys_page_range(start, end, is_cpu_write_access)

  • called from 
    • tb_invalidate_phys_range(start, end, is_cpu_write_access)
      • only called in linux-user/mmap.c
    • tb_invalidate_phys_page_fast(start, len)
      • use code_bitmap to quickly check whether there has tb in this range.
      • called from notdirty_mem_write(opague, ram_addr, val, size)
    • tb_invalidate_phys_addr(addr)
      • only if TARGET_HAS_ICE
    • cpu_physical_memory_rw(addr, buf, len, is_write)
    • cpu_physical_memory_unmap(buffer, len, is_write, access_len)
    • stl_phys_notdirty(addr, val)
      • tb_invalidate_phys_page_range is called if in_migration
    • stl_phys_internal(addr, val, endian)
    • stw_phys_internal(addr, val, endian)
      • called from stw_phys(), stw_le_phy()/*little endian*/, stw_be_phys()
===============================================================
























2012年7月5日 星期四

ubuntu boot order/sequence management UPSTART

https://help.ubuntu.com/community/UpstartHowto

Sink-Optimization Implementation Problem


Use over 5G memory? what the fuck?
Fixed 256 MissBlockInfo in TranslationBlock are too much!!!!
Use just enough MissBlockInfo in Translation Block fixed this problem




BR_MISP_EXEC

Mispredicted Branch instructions executed, speculative and retired, by type 


http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/ug_docs/reference/pmn/events/br_misp_exec.html


2012年7月2日 星期一

Note


  • Sink Slow Blocks has bugs?
    • boot output differs from original
    • cpu_restore_count for cpu_restore_state seems to be normal
    • problem: there are some errors during boot and trigger fail boot delay, which will force execution to sleep
  • Rewrite Optimization 1
    • simplify it
  • Problem should be cpu_restore_state
    • not exactly!
  • New phenomena: now original is as slow as opt1 and opt2(maybe).
======================================================
  • Debug Sinking slow blocks
    1. individual test : qemu_ld32, qemu_ld16[us], qemu_ld8[us], qemu_ld64
    2. rewrite!
  • The approach
    • two copies of tlb_load, and qemu_st, qemu_ld
======================================================