--Mode: markdown;--
-
Extensive MemGaze cleanup and consolidation.
- Rework tools' user interface.
- Document, organize, and cleanup drivers
- Create basic test suite and examples
-
MemGaze now supports analysis of loads only or load and stores. The
.binanlys
data generated bymemgaze-inst
contains access type (load vs. store). -
MemGaze now supports analysis of applications that use multiple load modules (DSOs). That is, multiple DSOs can be instrumented, traced, and analyzed. To support this, trace collection and anlaysis retain the mapping between trace entries and the responsible load module.
-
memgaze-run
can be invoked with redirection operators.-
Won't work: invoke memgaze-run with < redirection. This form redirects to memgaze-run, not perf (and memgaze-run doesn't see '<' arg)
./memgaze-run ... -- ./a.out < file
-
Won't work: invoke with < quoted. The redirect looses meaning.
./memgaze-run ... -- ./a.out \< file
-
Current solution: invoke as (1), read STDIN to tmp and invoke as:
perf record ... < tmp
-
Better solution: invoke as (1), detect something on STDIN and pass the resulting file descriptor directly to perf:
perf record ... < 0
-
Better solution: invoke as (2) but use
eval <args>
-
-
Limitation:
memgaze-inst
and non-contiguous functions, i.e., functions whose code is spread over multiple segments in the binary.We have corrected most but not all problems when
memgaze-inst
encounters a non-contiguous function (e.g., for hot-cold region layout). DynInst represents a function as a set of regions, possibly with gaps. Initially our CFG builder failed for non-contiguous functions due to a problem in Tarjan interval analysis.In our first version, we used set the function end address as the max among the DynInst regions. However, that led to a case of including unrelated code that caused garbage/confusing CFG/data flow.
Our next version selected the first contiguous region as the function bounds and ignored the rest. This was good for ignoring uninteresting exception handling.
Our current version still uses a single begin/end address for a function. However, it uses a small threshold to expands the function for small gaps. This ensures that we continue analyzing across nops or small regions of padding and nops (e.g., XSBench).
Long term: Use DynInst for control and data flow.
-
Corrected problem in
memgaze-inst
where source code mapping's line numbers to be off. Now capture DynInst's final values for .dyninst section. -
The memory bloat of
memgaze-analyze
has been corrected.Correct trace's memory bloat: 25x over text trace file? (220 MB -> 5 GB) - Problem 1: Trace was a vector, each entry newed; each entry a fully connected graph with vertix objects of { access-time, address, instruction, cpu} - Solution for v1: Make a vector of trace accesses, where each access has tuple fields. - Savings/entry: 22 bytes + 200 string bytes (primarily in strings) - 12 pointers: 4 objects x 3 pointers - 2 bytes: 'cpuid' is uint16 - 4 bytes: 'int rDist' is not needed - 2 bytes: load class is a uint16 - 2 bytes: extra-frame-loads is uint16 - ~50-100/bytes: string for func-name; should be id in string map - ~100/bytes: load module string; should be id (uint16) in string vector - Problem 2: Computation of footprint metrics uses intermediate data structures (primarily unique address sets) that are both copied and not properly freed. Tangled mess. - Solution for v2: Free intermediate address sets when possible - Memory usage analysis: - v0: orig (200 MB trace file, 4.6M access, 450 access/sample) bad trace format - v1: sane trace representation metrics currently a map --> vector! footprint address sets are both copied and not deleted --> neither! - v2: correct v1 problems v0: htop 6 GB; in-use bytes: 4.5 GB; allocs 402,337,069; frees: 344,264,091; total bytes 26,771,447,937 v1: htop 4 GB; in-use bytes: 3.9 GB; allocs 125,180,603; frees: 57,382,481; total bytes: 8,513,580,436 - 800 MB: trace (new) (primarily due to strings) - +600 MB: single and intra-sample windows (w/ metric calc data) - +2.6 GB: entire window tree (w/ metric calc data) v2: htop 2.2 GB; in-use bytes: 1.8 GB; allocs 128,684,796; frees: 95,264,329; total bytes: 8,746,743,884 There are still some leaks, but this is much better - execution interval tree uses too much memory - interior nodes preserve only final summary metrics, not intermediate data - sample nodes have trace pointers; for any interior node, find trace
-
Ozgur Kilic's annotations:
-
FIXME:dyninst -> changes to use dyninst
-
FIXME:amd -> commented out for AMD instructions
-
FIXME:instruction -> changes for new instruction types
-
FIXME:BETTER -> things can be updated for better aproach
-
FIXME:NEWBUILD -> changes made for spack build
-
FIXME:latency -> possible error for latency
-
FIXME:unkown -> I don't remember why I did that
-
FIXME:tallent -> not from Ozgur
-
FIXME:old -> not from Ozgur
-
FIXME:deprecated -> not from Ozgur
-
-
Ozgur Kilic's changes for Spack build:
load_module.C:1146:11: error: ‘class BPatch’ has no member named ‘setRelocateJumpTable’ # bpatch.setRelocateJumpTable(true); #TODO FIXME I commet out that line for now #src/Scheduler/Makefile 36 DYNINST_CXXFLAGS = \ 37 -I$(DYNINST_INC) \ 38 -I$(BOOST_INC) \ 39 -I$(TBB_INC)
/files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:168:9: note: suggested alternative: ‘bfd_set_section_flags if ((bfd_get_section_flags (abfd, section) & SEC_ALLOC) == 0) /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:175:10: note: suggested alternative: ‘bfd_set_section_vma vma = bfd_get_section_vma (abfd, section); /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:176:11: note: suggested alternative: ‘bfd_set_section_size size = bfd_get_section_size (section); /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:217:19: error: ‘bfd_get_section_vma’ was not declared in this scope addrtype vma = bfd_get_section_vma (info->abfd, section); src/common/InstructionDecoder-xed-iclass.h 25 //FIXME:NEWBUILD case XED_ICLASS_PFCPIT1: // 3DNOW 38 //FIXME:NEWBUILD case XED_ICLASS_PFSQRT: // 3DNOW
-
Initial support for new Xed instructions (see MIAMI-NW/Pin 3.x support) [[TODO]] properly model instruction, esp. LOCK instructions.
-
Initial support for decoding with either Xed or DynInst
SystemSpecific/x86_xed_decoding.C
->InstructionDecoder-xed.cpp
SystemSpecific/IB_x86_xed.C
->InstructionDecoder-xed-iclass.h
-
Simplify make system (now that PIN-based tools are gone)
miami.config
-
src/make.rules
[removed] -
src/Scheduler/makefile.pin -> Makefile
[renamed]
-
Replace use of PIN with Xed (support newer GCCs and C++ RTTI)
- Use argp option parser (instead of PIN's)
- Remove tools that depend on PIN.
miami.config.sample
miami.config
src/make.rules
-
src/Scheduler/makefile.pin
[removed] -
src/tools/pin_config
[removed] -
src/{CFGtool, CacheSim, MemReuse, StreamSim}
[removed]
-
Use DynInst to load/process binary instead of PIN. LoadModule/Routine/instructions use data from DynInst.
src/Scheduler/load-module.C
src/Scheduler/routine.C
-
Changes NRT made to Xia's code to incorporate 'external4' into 'MemGaze/bin-anlys'
-
Reinstated MIAMI driver.
-
Disable Xia's code.
-
Although we will eventually discard it, we should use XED as a validation tool.
-
I updated the instrution translation to print the input instruction using XED's and DynInst's decoders. The decoding should align.
-
We should check the output translation in two ways. The first way is using MIAMI's original XED-based translator; and then using our DynInst-based translator. I've updated the driver accordingly.
-
-
Xia's code is slow for a couple reasons:
- For one routine, seach through all functions O(|functions|)...
- For one instruction, the translation is O(|functions| * |blocks| * |insn-in-block|). This holds even for the simplest instruction (e.g. a nop) without registers.
I removed the last term (|insn-in-block|) by fixing a decoding bug isaXlate_getDyninstInsn(). The code had scanned the instrutions in a basic block but stopped just before the requested instruction address.
-
Xia's code had static data structures. It makes the code hard to understand and is unsafe for threads. Also, it turns out that some of the data structures were computed twice, once when building the CFG and again when initializing instruction translation for the routine. For example
bpatch.openBinary()
was done twice on the same routine, creating two versions of BPatch_image and vector<BPatch_function*>. [create_loadModule()
andisaXlate_init()
]I consolidated the CFG and Instruction translation code to avoid this and make the code easier to understand.
For now I have partially cleaned the way the static data is used so that, e.g., lm_func2blockMap is not computed multiple times (once for each routine).
-
-
BUGS:
-
The DynInst Function/BasicBlock context should be part of MIAMI classes, e.g. Routine and CFG.
-
Complete translation from DynInst::Instruction -> MIAMI::Instruction.
-
Some of the static data structure are never cleared. For example,
func_blockVec
is never cleared. It seems this may have created an ever-expanding worklist forget_instructions_address_from_block()
. -
Memory leak in
Routine::decode_instructions_for_block()
...
-
-
~huxi333/palm/trunk/external: Corresponds to MIAMI-v1 (just debug output)
-
src/CFGtool/cfgtool_dynamic.C
-
src/CFGtool/cfgtool_static.C
-
src/Scheduler/DGBuilder.C
-
src/Scheduler/MiamiDriver.C
-
src/Scheduler/load_module.C
-
src/Scheduler/routine.C
-
src/common/PrivateCFG.h
-
src/common/SystemSpecific/x86_xed_decoding.C
-
-
~huxi333/palm/trunk/external4: Incorporated partially into MemGaze/bin-anlys
MIAMI/src/Scheduler/Report
: Summary of workMIAMI/src/Scheduler/dyninst_*
: New files
-
Control flow changes in driver
src/Scheduler/MiamiDriver.C
src/Scheduler/load-module.C
src/Scheduler/routine.C
src/Scheduler/DGBuilder.C
-
Test for duplicates
src/OAUtils/BaseGraph.C
-
Replace
dynamic_cast
withstatic_cast
src/CFGtool/CFG.h
src/MemReuse/CFG.h
src/OAUtils/DGraph.h
src/Scheduler/PatternGraph.h
-
Debug output
src/CFGtool/cfg_data.C
src/CFGtool/routine.C
src/CFGtool/cfgtool_static.C
src/Scheduler/XML_output.C
src/Scheduler/schedtool.C
-
No effects/buggy
src/Scheduler/DGBuilder.h
src/Scheduler/SchedDG.C
-
Environment: ~huxi333/.bashrc ~huxi333/pkg
-
Others ~huxi333/palm/trunk/external2: Not useful; not yet compiled. ~huxi333/palm/trunk/external3: Not useful. Attempted to work on 'CFGTool'
-
Open 'cfgprof' profile:
MIAMI_Driver::Initialize()
-
Instrution path analysis using new basic-block profiling (Paths are reconstructed per routine. Callsites connect inter-procedural paths.)
Routine::myConstructPaths
[routine.C] ->SchedDG::SchedDG()
->MIAMI_DG::DGBuilder(Routine)
->DGBuilder::build_graph()
Routine::constructPaths()
->DGBuilder::computeMemoryInformationForPath()
->SchedDG::find_memory_parallelism()
-
Calls [some] to instruction decoding:
main()
[schedtool.C] ->MIAMI_Driver::LoadImage()
->LoadModule::analyzeRoutines()
->Routine::main_analysis()
->Routine::build_paths_for_interval
->Routine::decode_instructions_for_block()
->isaXlate_insn() / decode_instruction_at_pc()
<== ->Routine::build_paths_for_interval()
->Routine::constructPaths()
->DGBuilder::DGBuilder()
->DGBuilder::build_graph()
->DGBuilder::build_node_for_instruction()
->isaXlate_insn() / decode_instruction_at_pc()
-
Some debug tracing:
- src/Scheduler/debug_scheduler.h
-
XED-based decoding implementation
common/instruction_decoding.h
common/SystemSpecific/x86_xed_decoding.C
-
Scheduling...
Scheduler/SchedDG.{h,C}
: scheduler analysis on the MIAMI IRScheduler/DGBuilder.{h,C}
: dependence graph builder; extends SchedDGMIAMI's DGBuilder takes a CFG::Block of raw data and decodes it. (This seems to be the wrong order.)
Gabriel: "The DGBuilder takes as input a vector of (MIAMI) CFG::Node elements (the basic blocks of the path), decodes them to the MIAMI IR, and builds register/memory/control dependencies on them. All the analysis is then done on the IR itself, so I feel that it is possible to interface with another tool if you write another DG builder that takes as input your representation of blocks and converts them to the MIAMI IR."
Two basic approaches for interfacing with DynInst. Favor (b) a) Use DynInst CFG/instructions. Build interface around them to enable MIAMI dependence-graph builder. b) Create MIAMI CFG/instructions from DynInst CFG/instructions.
MIAMI's DGBuilder takes a CFG::Block of raw data and decodes it. (This seems to be the wrong order.)
common/instr_info.H
: information for an instructioncommon/instr_bins.H
: IB = instruction-bincommon/instruction_decoding.C
: instruction decoding interfaceScheduler/GenericInstruction.h
Any definition files for other microarchitectures?
In routine.C:
line:1706 MIAMI_DG::DGBuilder *sch = NULL;
line:1726 sch = new MIAMI_DG::DGBuilder(this, pathId,
line:1753 MIAMI_DG::schedule_result_t res = sch->myComputeScheduleLatency()
In Scheduler/SchedDG.C:
line:10306 retValues ret = myMinSchedulingLengthDueToDependencies(memLatency, cpuLatency);
line:11615 teit->sink()->myComputePathToLeaf()
line:11897 SchedDG::Node::myComputePathToLeaf
myComputePathToLeaf function is a recursive function to visit all the edges on outgoing edge iterator.
In Scheduler/MiamiDriver.C
line: 141 MIAMI_Driver::Initialize(MiamiOptions *_mo, int _pid)
line:153 fd = fopen(mo->cfg_file.c_str(), "rb");
Keep using the fd on different functions. Flow is as following:
In Scheduler/schedtool.C
line: 309 MIAMI::mdriver.LoadImage()
In Scheduler/MiamiDriver.C
line: 457 MIAMI_Driver::LoadImage(
line: 516 newimg->loadFromFile(fd, false);
In Scheduler/load_module.C
line: 59 LoadModule::loadFromFile(FILE *fd, bool parse_routines)
line: 142 LoadModule::loadRoutineData(FILE *fd)
line: 185 LoadModule::loadOneRoutine(FILE *fd, uint32_t r)
In Scheduler/routine.C
line: 73 Routine::loadCFGFromFile(FILE *fd)
In Scheduler/CFG.C
line: 142 CFG::loadFromFile()