Skip to content

Latest commit

 

History

History
471 lines (339 loc) · 16.7 KB

README-notes-history.md

File metadata and controls

471 lines (339 loc) · 16.7 KB

--Mode: markdown;--

$Id$


TODO


Detailed Notes

  • Extensive MemGaze cleanup and consolidation.

    • Rework tools' user interface.
    • Document, organize, and cleanup drivers
    • Create basic test suite and examples
  • MemGaze now supports analysis of loads only or load and stores. The .binanlys data generated by memgaze-inst contains access type (load vs. store).

  • MemGaze now supports analysis of applications that use multiple load modules (DSOs). That is, multiple DSOs can be instrumented, traced, and analyzed. To support this, trace collection and anlaysis retain the mapping between trace entries and the responsible load module.

  • memgaze-run can be invoked with redirection operators.

    1. Won't work: invoke memgaze-run with < redirection. This form redirects to memgaze-run, not perf (and memgaze-run doesn't see '<' arg) ./memgaze-run ... -- ./a.out < file

    2. Won't work: invoke with < quoted. The redirect looses meaning. ./memgaze-run ... -- ./a.out \< file

    3. Current solution: invoke as (1), read STDIN to tmp and invoke as: perf record ... < tmp

    4. Better solution: invoke as (1), detect something on STDIN and pass the resulting file descriptor directly to perf: perf record ... < 0

    5. Better solution: invoke as (2) but use eval <args>

  • Limitation: memgaze-inst and non-contiguous functions, i.e., functions whose code is spread over multiple segments in the binary.

    We have corrected most but not all problems when memgaze-inst encounters a non-contiguous function (e.g., for hot-cold region layout). DynInst represents a function as a set of regions, possibly with gaps. Initially our CFG builder failed for non-contiguous functions due to a problem in Tarjan interval analysis.

    In our first version, we used set the function end address as the max among the DynInst regions. However, that led to a case of including unrelated code that caused garbage/confusing CFG/data flow.

    Our next version selected the first contiguous region as the function bounds and ignored the rest. This was good for ignoring uninteresting exception handling.

    Our current version still uses a single begin/end address for a function. However, it uses a small threshold to expands the function for small gaps. This ensures that we continue analyzing across nops or small regions of padding and nops (e.g., XSBench).

    Long term: Use DynInst for control and data flow.

  • Corrected problem in memgaze-inst where source code mapping's line numbers to be off. Now capture DynInst's final values for .dyninst section.

  • The memory bloat of memgaze-analyze has been corrected.

    Correct trace's memory bloat: 25x over text trace file? (220 MB -> 5 GB)
    - Problem 1: Trace was a vector, each entry newed; each entry a fully connected graph with vertix objects of { access-time, address, instruction, cpu}
    
    - Solution for v1: Make a vector of trace accesses, where each access has tuple fields. 
      - Savings/entry: 22 bytes + 200 string bytes (primarily in strings)
        - 12 pointers: 4 objects x 3 pointers
        - 2 bytes: 'cpuid' is uint16
        - 4 bytes: 'int rDist' is not needed
        - 2 bytes: load class is a uint16
        - 2 bytes: extra-frame-loads is uint16
        - ~50-100/bytes: string for func-name; should be id in string map
        - ~100/bytes: load module string; should be id (uint16) in string vector
    - Problem 2: Computation of footprint metrics uses intermediate data structures (primarily unique address sets) that are both copied and not properly freed. Tangled mess.
    
    - Solution for v2: Free intermediate address sets when possible
    
    - Memory usage analysis:
      - v0: orig (200 MB trace file, 4.6M access, 450 access/sample)
        bad trace format
      - v1: sane trace representation
         metrics currently a map --> vector!
         footprint address sets are both copied and not deleted --> neither!
      - v2: correct v1 problems
    
      v0: htop 6 GB; in-use bytes: 4.5 GB; allocs 402,337,069; frees: 344,264,091; total bytes 26,771,447,937
    
      v1: htop 4 GB; in-use bytes: 3.9 GB; allocs 125,180,603; frees: 57,382,481; total bytes: 8,513,580,436
      -  800 MB: trace (new) (primarily due to strings)
      - +600 MB: single and intra-sample windows (w/ metric calc data)
      - +2.6 GB: entire window tree (w/ metric calc data)
      
      v2: htop 2.2 GB; in-use bytes: 1.8 GB; allocs 128,684,796; frees: 95,264,329; total bytes: 8,746,743,884
        There are still some leaks, but this is much better
    
    
    - execution interval tree uses too much memory
      - interior nodes preserve only final summary metrics, not intermediate data
      - sample nodes have trace pointers; for any interior node, find trace
    

MemGaze/bin-anlys: Changes from MIAMI-NW (newest first)

  • Ozgur Kilic's annotations:

    • FIXME:dyninst -> changes to use dyninst

    • FIXME:amd -> commented out for AMD instructions

    • FIXME:instruction -> changes for new instruction types

    • FIXME:BETTER -> things can be updated for better aproach

    • FIXME:NEWBUILD -> changes made for spack build

    • FIXME:latency -> possible error for latency

    • FIXME:unkown -> I don't remember why I did that

    • FIXME:tallent -> not from Ozgur

    • FIXME:old -> not from Ozgur

    • FIXME:deprecated -> not from Ozgur

  • Ozgur Kilic's changes for Spack build:

    load_module.C:1146:11: error: ‘class BPatch’ has no member named ‘setRelocateJumpTable’
    #    bpatch.setRelocateJumpTable(true);
    #TODO FIXME I commet out that line for now
    
    #src/Scheduler/Makefile
    36 DYNINST_CXXFLAGS = \
    37         -I$(DYNINST_INC) \
    38         -I$(BOOST_INC) \
    39         -I$(TBB_INC)
    
    /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:168:9: note: suggested alternative: ‘bfd_set_section_flags
        if ((bfd_get_section_flags (abfd, section) & SEC_ALLOC) == 0)
    
    /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:175:10: note: suggested alternative: ‘bfd_set_section_vma
        vma = bfd_get_section_vma (abfd, section);
    
    /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:176:11: note: suggested alternative: ‘bfd_set_section_size
        size = bfd_get_section_size (section);
    
    /files0/kili337/TestBed/memgaze/memgaze/bin-anlys/src/common/source_file_mapping_binutils.C`:217:19: error: ‘bfd_get_section_vma’ was not declared in this scope
        addrtype vma = bfd_get_section_vma (info->abfd, section);  
    
    src/common/InstructionDecoder-xed-iclass.h
    25 //FIXME:NEWBUILD    case XED_ICLASS_PFCPIT1:            // 3DNOW
    38 //FIXME:NEWBUILD    case XED_ICLASS_PFSQRT:             // 3DNOW 
    
  • Initial support for new Xed instructions (see MIAMI-NW/Pin 3.x support) [[TODO]] properly model instruction, esp. LOCK instructions.

  • Initial support for decoding with either Xed or DynInst SystemSpecific/x86_xed_decoding.C -> InstructionDecoder-xed.cpp SystemSpecific/IB_x86_xed.C -> InstructionDecoder-xed-iclass.h

  • Simplify make system (now that PIN-based tools are gone)

    • miami.config
    • src/make.rules [removed]
    • src/Scheduler/makefile.pin -> Makefile [renamed]
  • Replace use of PIN with Xed (support newer GCCs and C++ RTTI)

    • Use argp option parser (instead of PIN's)
    • Remove tools that depend on PIN.
    • miami.config.sample
    • miami.config
    • src/make.rules
    • src/Scheduler/makefile.pin [removed]
    • src/tools/pin_config [removed]
    • src/{CFGtool, CacheSim, MemReuse, StreamSim} [removed]
  • Use DynInst to load/process binary instead of PIN. LoadModule/Routine/instructions use data from DynInst.

    • src/Scheduler/load-module.C
    • src/Scheduler/routine.C

MemGaze/bin-anlys: Translate DynInst IR to MIAMI IR

  • Changes NRT made to Xia's code to incorporate 'external4' into 'MemGaze/bin-anlys'

    • Reinstated MIAMI driver.

    • Disable Xia's code.

    • Although we will eventually discard it, we should use XED as a validation tool.

      • I updated the instrution translation to print the input instruction using XED's and DynInst's decoders. The decoding should align.

      • We should check the output translation in two ways. The first way is using MIAMI's original XED-based translator; and then using our DynInst-based translator. I've updated the driver accordingly.

    • Xia's code is slow for a couple reasons:

      • For one routine, seach through all functions O(|functions|)...
      • For one instruction, the translation is O(|functions| * |blocks| * |insn-in-block|). This holds even for the simplest instruction (e.g. a nop) without registers.

      I removed the last term (|insn-in-block|) by fixing a decoding bug isaXlate_getDyninstInsn(). The code had scanned the instrutions in a basic block but stopped just before the requested instruction address.

    • Xia's code had static data structures. It makes the code hard to understand and is unsafe for threads. Also, it turns out that some of the data structures were computed twice, once when building the CFG and again when initializing instruction translation for the routine. For example bpatch.openBinary() was done twice on the same routine, creating two versions of BPatch_image and vector<BPatch_function*>. [create_loadModule() and isaXlate_init()]

      I consolidated the CFG and Instruction translation code to avoid this and make the code easier to understand.

      For now I have partially cleaned the way the static data is used so that, e.g., lm_func2blockMap is not computed multiple times (once for each routine).

  • BUGS:

    • The DynInst Function/BasicBlock context should be part of MIAMI classes, e.g. Routine and CFG.

    • Complete translation from DynInst::Instruction -> MIAMI::Instruction.

    • Some of the static data structure are never cleared. For example, func_blockVec is never cleared. It seems this may have created an ever-expanding worklist for get_instructions_address_from_block().

    • Memory leak in Routine::decode_instructions_for_block()...

Xia's efforts on SeaPearl

  • ~huxi333/palm/trunk/external: Corresponds to MIAMI-v1 (just debug output)

    • src/CFGtool/cfgtool_dynamic.C

    • src/CFGtool/cfgtool_static.C

    • src/Scheduler/DGBuilder.C

    • src/Scheduler/MiamiDriver.C

    • src/Scheduler/load_module.C

    • src/Scheduler/routine.C

    • src/common/PrivateCFG.h

    • src/common/SystemSpecific/x86_xed_decoding.C

  • ~huxi333/palm/trunk/external4: Incorporated partially into MemGaze/bin-anlys

    • MIAMI/src/Scheduler/Report : Summary of work
    • MIAMI/src/Scheduler/dyninst_* : New files
    • Control flow changes in driver

      • src/Scheduler/MiamiDriver.C
      • src/Scheduler/load-module.C
      • src/Scheduler/routine.C
      • src/Scheduler/DGBuilder.C
    • Test for duplicates

      • src/OAUtils/BaseGraph.C
    • Replace dynamic_cast with static_cast

      • src/CFGtool/CFG.h
      • src/MemReuse/CFG.h
      • src/OAUtils/DGraph.h
      • src/Scheduler/PatternGraph.h
    • Debug output

      • src/CFGtool/cfg_data.C
      • src/CFGtool/routine.C
      • src/CFGtool/cfgtool_static.C
      • src/Scheduler/XML_output.C
      • src/Scheduler/schedtool.C
    • No effects/buggy

      • src/Scheduler/DGBuilder.h
      • src/Scheduler/SchedDG.C
  • Environment: ~huxi333/.bashrc ~huxi333/pkg

  • Others ~huxi333/palm/trunk/external2: Not useful; not yet compiled. ~huxi333/palm/trunk/external3: Not useful. Attempted to work on 'CFGTool'


MIAMI-NW structure

  • Open 'cfgprof' profile: MIAMI_Driver::Initialize()

  • Instrution path analysis using new basic-block profiling (Paths are reconstructed per routine. Callsites connect inter-procedural paths.)

    Routine::myConstructPaths [routine.C] -> SchedDG::SchedDG() -> MIAMI_DG::DGBuilder(Routine) -> DGBuilder::build_graph()

    Routine::constructPaths() -> DGBuilder::computeMemoryInformationForPath() -> SchedDG::find_memory_parallelism()

  • Calls [some] to instruction decoding: main() [schedtool.C] -> MIAMI_Driver::LoadImage() -> LoadModule::analyzeRoutines() -> Routine::main_analysis() -> Routine::build_paths_for_interval -> Routine::decode_instructions_for_block() -> isaXlate_insn() / decode_instruction_at_pc() <== -> Routine::build_paths_for_interval() -> Routine::constructPaths() -> DGBuilder::DGBuilder() -> DGBuilder::build_graph() -> DGBuilder::build_node_for_instruction() -> isaXlate_insn() / decode_instruction_at_pc()

  • Some debug tracing:

    • src/Scheduler/debug_scheduler.h
  • XED-based decoding implementation

    • common/instruction_decoding.h
    • common/SystemSpecific/x86_xed_decoding.C
  • Scheduling... Scheduler/SchedDG.{h,C}: scheduler analysis on the MIAMI IR Scheduler/DGBuilder.{h,C}: dependence graph builder; extends SchedDG

    MIAMI's DGBuilder takes a CFG::Block of raw data and decodes it. (This seems to be the wrong order.)

    Gabriel: "The DGBuilder takes as input a vector of (MIAMI) CFG::Node elements (the basic blocks of the path), decodes them to the MIAMI IR, and builds register/memory/control dependencies on them. All the analysis is then done on the IR itself, so I feel that it is possible to interface with another tool if you write another DG builder that takes as input your representation of blocks and converts them to the MIAMI IR."

    Two basic approaches for interfacing with DynInst. Favor (b) a) Use DynInst CFG/instructions. Build interface around them to enable MIAMI dependence-graph builder. b) Create MIAMI CFG/instructions from DynInst CFG/instructions.

    MIAMI's DGBuilder takes a CFG::Block of raw data and decodes it. (This seems to be the wrong order.)

    • common/instr_info.H: information for an instruction
    • common/instr_bins.H: IB = instruction-bin
    • common/instruction_decoding.C: instruction decoding interface
    • Scheduler/GenericInstruction.h

    Any definition files for other microarchitectures?


MIAMI-NW structure (Ozgur's notes)

Dependency analysis

In routine.C:

line:1706  MIAMI_DG::DGBuilder *sch = NULL;
line:1726     sch = new MIAMI_DG::DGBuilder(this, pathId,
line:1753        MIAMI_DG::schedule_result_t res = sch->myComputeScheduleLatency()

In Scheduler/SchedDG.C:

line:10306    retValues ret =  myMinSchedulingLengthDueToDependencies(memLatency, cpuLatency);
line:11615             teit->sink()->myComputePathToLeaf()
line:11897 SchedDG::Node::myComputePathToLeaf
myComputePathToLeaf    function is a recursive function to visit all the edges on outgoing edge iterator.

Reading CFG file

In Scheduler/MiamiDriver.C

line: 141 MIAMI_Driver::Initialize(MiamiOptions *_mo, int _pid)
line:153     fd = fopen(mo->cfg_file.c_str(), "rb");

Keep using the fd on different functions. Flow is as following:

In Scheduler/schedtool.C

line: 309        MIAMI::mdriver.LoadImage()

In Scheduler/MiamiDriver.C

line: 457 MIAMI_Driver::LoadImage(
line: 516       newimg->loadFromFile(fd, false);

In Scheduler/load_module.C

line: 59 LoadModule::loadFromFile(FILE *fd, bool parse_routines)
line: 142 LoadModule::loadRoutineData(FILE *fd)
line: 185 LoadModule::loadOneRoutine(FILE *fd, uint32_t r)

In Scheduler/routine.C

line: 73 Routine::loadCFGFromFile(FILE *fd)

In Scheduler/CFG.C

line: 142 CFG::loadFromFile()