2010年7月21日 星期三

Hot trace model

today, you need to figure out the model of building a hot trace
and you will have a meeting with Prof. Liu to discuss this.

First, collect the posts of Prof. Liu for hot trace, qutoe as follows.
We encountered a hot-trace-finding problem last Thursday. Basically
this problem asks for a good hot trace so that a large number of
execution sequences will remain in the hot trace. Formally we are
given a directed graph G (call graph?) and a set of simple paths P in
G. Each simple path is a sequence of at least two different nodes in
G so that consecutive nodes in the path has a directed edge in G. Note
that we assume that the nodes in a simple path must be different. I am
not sure if this is the case in our system but let us assume it for
the moment. Now we want to find another simple path h (this is the hot
trace we are looking for) so that at least k of the paths in P are
contained in H. A path p is contained in another path q if and only if
q is a substring of p. Now we can define the problem -- given G, P,
and k, is there a hot trace h that cotains at least k paths in P? We
will refere to this problem as k-HOT-TRACE.

It seems that k-HOT-TRACE is NP-complete by reducing from Hamitonian
path. Given a graph H we would like to ask whethere there is a
Hamitonian path in H, we simple transform the priblem instance into a
k-HOT-TRACE problem instance. We simply use H as G, and let P be the
set of path of two nodes, i.e., all edges in G, and set k to be the
number of nodes in G minus 1. As a result if there is a solution for
the (n-1)-HOT-TRACE problem for G, there is a Hamitonian path in H,
and vice versa.

Two followups may be possible. Anyone interested in these please come
talk to me.

There could be an efficient dynamic programming solution when G is a
tree, even when we further restrict the length of the hot trace.

This seems easy. Now the problem is that we limit the length of the
hot trace, and try to find the one that contains the maximum number of
paths. We define a function P(v, l) to be the number of paths that are
contained by the hot trace that ends at v and has length l. It is easy
to write down the recursive formula that P(v, l) = P(w, l-1) + the
number of path that ends at v nad has length no more than l, where w
is the parent of v. We have N times L cells to fill, where N is the
number of tree nodes and L is the maximum hot trace length allowed.
Each cell needs no more than \log n(v) operations where n(v) is the
number of paths end at v. Roughly the total cost is no more than O(N L
\log(N)). Of course this is very rough and I need to work harder to
get a closer bound.

There could be a dynamic programming solution when G is a serial-
parallel graph.

FROM TK:
I think the function P(v, l) should be defined as the MAXIMUM number of paths that are
contained by a hot trace that ends at v and has length l since there are more
than one traces that are end at v and have length l.

Also, about the recursive function, it seems the formula only contains "one half" cases.
It only considers hot traces that pass through w and end at v.
However, there could be other ended-at-v-with-length-l hot traces that are contained only within the subtree rooted at v,
rather passing through v's parent.

However, I still need some time to figure the correct recursive function. come back later.

I write down the problem and recursive function for tree in http://www.iis.sinica.edu.tw/~tk/k-hot-path.pdf
Chun-Chen


NOTE: FROM ICE
In our system based on Qemu with LLVM, we generate the traces(i.e.
dynamic basic blocks)
which have only one entrance on the head of these blocks, so far.
Moreover, especially we want to translate a hot trace with the
optimization options of LLVM such that
the translated instructions may be reordered.
Consequently, the host code of the hot traces cannot have multiple
entrances, either.
In other word, different jump target addresses imply different traces.
Therefore, the same (static) basic blocks may be included in different
hot traces.
Furthermore, the assumption, "the nodes in a simple path must be
different", may not be suitable for our system.

Our approach prefers to construct the hot traces whose lengths are as
long as possible
since we believe that longer traces have more opportunities to be
optimized.
However, the longer traces will incur the more overhead to generate
optimized code.
We should limit the length of a hot trace.

Considering definition of a hot trace, we should consider the
frequency of the execution.
We should not only find the longer hot traces on the CFG but also make
the hot code segments included in the hot traces.

Due to these reasons,
I think that the hot trace problem should be modified as follows to
match the practice of our system.

We still have a directed graph G.
A node in G represents a (static) basic block.
An edge connecting two node in G represents that one of these two node
may be executed after the other one is executed.
An edge in G has a weight to represent the frequency of the execution
between the two nodes connected by this edge.

We limit the lengths of hot traces under a given k to consider the
overhead of generate optimized code.
We want to find the minimum number of the hot traces with length
constraint to include the hot edges
where hot edges are the edges which have the weight larger than a
given threshold.
(Note that the same nodes can be included in different hot traces.)

If we remove the weights on the edges but give each node a weight to
represent the execution frequency of this node, instead,
we may have another similar version of this problem as follows.

We want to find the minimum number of the hot traces with length
constraint to include the hot nodes
where hot nodes are the nodes which have the weight larger than a
given threshold.

ice

Sorry that I do not understand what you are saying here.
Do you mean you want to find a minimum set of length-limited paths
that cover a set of specified (hot) edges?

I will refine the model according to the description of ice. Correct
me if I am wrong.

We will have weight information on edges and nodes, instead of a set
of paths as I described earlier. So we will be given a directed graph
G, a constant l, and a constant k. The edges of G are partitioned into
two subsets -- important edges and unimportant edges. We want to find
a set S of at most k directed paths where each of them is at most of
length l, and the paths in S cover all important edges. We will call
this problem edge-covering-path-set problem. Similarly we can define
node-covering-path-set problem, where we partition nodes in G into
important and unimportant ones, and define the problem similarly.

It is easy to see that node-covering-path-set is NPC, since we simply
make every node important, set k to 1 and l to n - 1, where n is the
number of nodes in G, then we have a Hamiltonian path problem. I are
wondering if edge-covering-path-set is also NPC.

These two problems on tree should be easy as long as the all edges are
directed away from the root. However, this is only theoretically
interesting since in practice the graph will not be a tree.

The edge-covering-path-set problem seems to be NPC as well. Here is
the proof. We will reduce from Hamiltonian path again. Given a
directed graph H as an instance of Hamiltonian path problem, we
construct G as follows. Every node in H node splits into two nodes
called head and tail. We then add an directed edge (called internal
edge) from head to tail. Now if there is an edge from v to w in H, we
put a directed edge from the tail of v to the head of w. Now we make
all internal edges "important", l to be 2n - 1, where n is the number
of nodes in H, and k = 1. It should be easy to see that a hot path in
G will be a Hamiltonian path in H, and vice versa.

We have three problems now -- information given as paths, on the node,
or on the edges. All of them are NPC, but all of them are sovlable on
trees, with edgings going away from the root. The path version can be
solved by a dynamic programming, and the node and edge version can be
solved by greedy methods.


The above discussion did not include possibility of conditional branches.
maybe we can add possibility into this problem.

沒有留言:

張貼留言