Skip to content

Prototype(symbolization): Add symbolization in Pyroscope read path #3799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

marcsanmi
Copy link
Contributor

@marcsanmi marcsanmi commented Dec 20, 2024

Context

This PR introduces a comprehensive implementation for DWARF symbolization of unsymbolized profiles in the Pyroscope read path. It enables automatic symbolization of profiles for non-customer code (primarily open-source libraries and binaries) where symbol information isn't available at collection time.

Symbolization

  • DWARF parsing: Optimized parsing of debug information with minimal memory overhead
  • Comprehensive symbol resolution: Support for function names, file paths, and line numbers
  • Inline function resolution: Proper handling of inlined functions for accurate stack traces
  • Address-based lookup: Fast address-to-symbol mapping with optimized data structures

Multi-level Caching

  • In-memory symbol cache: LRU cache for frequently accessed symbols
  • Object storage for debug files: Persistent storage of debug files with configurable obj storage solution
  • Configurable TTL: Control over cache expiration for both memory and storage caches

Integration Points

  • Read path symbolization: Automatically symbolize profiles during query time
  • Remote debug info fetching: Integration with debuginfod for symbol discovery from public servers

Configuration Example

symbolizer:
  enabled: true
  debuginfod_url: "https://debuginfod.elfutils.org"
  in_memory_symbol_cache_size: 100000         # Symbol cache in memory (entries)
  in_memory_debuginfo_cache_size: 2147483648  # Debug info cache in memory (bytes)
  persistent_debuginfo_store:                 # Debug info in persistent storage
    enabled: true
    max_age: 168h
    storage:                                  # Storage backend configuration
      backend: s3
      s3:
        bucket_name: debug-symbols-bucket
        endpoint: s3.amazonaws.com
        access_key_id: ${S3_ACCESS_KEY}
        secret_access_key: ${S3_SECRET_KEY}

Copy link
Collaborator

@korniltsev korniltsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have some benchmarks of symbolizing different amount of locations and different file sizes, I think it can help us to pick the right place and architecture for using this

@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch from efdde88 to 6b009d3 Compare January 16, 2025 12:15
Copy link
Collaborator

@kolesnikovae kolesnikovae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, Marc! I'm excited to see some experimental results 🚀

I think we can implement a slightly more optimized version for production use:

sequenceDiagram
    autonumber

    participant QF as Query Frontend
    participant M  as Metastore
    participant QB as Query Backend
    participant SYM as Symbolizer

    QF ->>+M: Query Metadata
    Note left of M: Build identifiers are returned<br> along with the metadata records
    M ->>-QF: 

    par
        QF ->>+SYM: Request for symbolication
        Note left of SYM: Prepare symbols for<br>the objects requested
    and
        QF ->>+QB: Data retrieval and aggregation
        Note left of QB: The main data path<br>Might be serverless
    end

    QB ->>-QF: Data in pprof format
    Note over QF: Because of the truncation,<br> only a limited set of locations<br>make it here (16K by default) 

    QF --)SYM: Location addresses
    
    SYM ->>-QF: Symbols
    
    QF ->>QF: Flame graph rendering
Loading

Even without a parallel pipeline and dedicated symbolication service, we could implement something like this:

sequenceDiagram
    autonumber

    participant QF as Query Frontend
    participant M  as Metastore
    participant QB as Query Backend
    participant SYM as Symbols

    QF ->>+M: Query Metadata
    Note left of M: No build identifiers are returned
    M ->>-QF: 

    QF ->>+QB: Data retrieval and aggregation
    Note left of QB: The main data path<br>Might be serverless

    QB ->>-QF: Data in pprof format
    Note over QF: Because of the truncation,<br> only a limited set of locations<br>make it here (16K by default)

    QF ->>+SYM: Fetch symbols
    SYM ->>-QF: Symbols
    Note over QF: In terms of the added latency,<br>this approach is not worse than<br>block level symbolication
    
    QF ->>QF: Flame graph rendering
Loading

I think we should avoid symbolization at the block level if the symbols are not already present in the block itself. Otherwise, this approach leads to excessive processing, increased latency, and higher resource usage. Please keep in mind, that a query may span many thousands of blocks.

I won't delve too deeply into how we fetch and process ELF/DWARF files, but I strongly doubt we can bypass the need for an intermediate representation optimized for our access patterns. Additionally, we need a solution to prevent concurrent access to the debuginfod service.

@korniltsev
Copy link
Collaborator

I have not look into the code yet, but I've tried to run it locally and it looks like it's trying to load a lot of unnecesarry debug files.

I run ebpf profiler with no ontarget symbolization , also run a simple python -m http.server to mock debug infod responses.

I then query only one executable process_cpu:cpu:nanoseconds:cpu:nanoseconds{service_name="unknown", process_executable_path="/home/korniltsev/.cache/JetBrains/IntelliJIdea2024.2/tmp/GoLand/___go_build_go_opentelemetry_io_ebpf_profiler"}

I see 268 GET requests, with 13 requests to "GET /buildid/fbce2598b34f1cf8d0c899f34c2218864e1da6d1/debuginfo HTTP/1.1" 200 - (which is the profiler binary I put into mock server for testing and a bunch of 404 which I assume are build ids for the filles in the other processes which the query does not target.

Other then that it works \M/ Can't wait to run it in dev.

image

@marcsanmi marcsanmi changed the title POC feat(symbolization): Add DWARF symbolization with debuginfod support Prototype(symbolization): Add symbolization for unsymbolized profiles in Pyroscope read path Jan 19, 2025
@marcsanmi marcsanmi changed the title Prototype(symbolization): Add symbolization for unsymbolized profiles in Pyroscope read path Prototype(symbolization): Add symbolization in Pyroscope read path Jan 19, 2025
@liaol
Copy link

liaol commented Feb 18, 2025

Hi @marcsanmi , when can this PR be merged?
Thanks

@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch 2 times, most recently from 2937519 to 0b8a289 Compare February 19, 2025 17:29
@marcsanmi
Copy link
Contributor Author

Hi @liaol,
It's still going to take a little while :)

@marcsanmi marcsanmi requested a review from korniltsev February 20, 2025 07:59
@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch 2 times, most recently from 7c2ab09 to 87a481c Compare March 3, 2025 15:34
@marcsanmi
Copy link
Contributor Author

marcsanmi commented Mar 3, 2025

I've created this diagram to outline the current Symbolization arch:

flowchart TD
    A[SymbolizePprof] --> B{Group by Mapping}
    B --> C[Symbolize Request]
    
    C --> D{Check Symbol Cache}
    
    subgraph "Symbol Cache Layer (LRU, in-memory)"
        D -->|Cache Hit| E[Return Cached Symbols]
        D -->|Cache Miss| F
    end
    
    F{Check Debug Info Cache} 
    
    subgraph "Debug Info Cache Layer (Ristretto, in-memory)"
        F -->|Cache Hit| G[Read from Debug Info Cache]
        F -->|Cache Miss| H
    end
    
    subgraph "Persistent Storage Layer"
        H{Check Object Store}
        H -->|Cache Hit| I[Read from Object Store]
        H -->|Cache Miss| J[Fetch from Debuginfod]
    end
    
    I --> K[Store in Debug Info Cache]
    J --> L[Store in Debug Info Cache]
    J --> M[Store in Object Store]
    
    G --> N[Parse ELF/DWARF]
    K --> N
    L --> N
    
    subgraph "DWARF Resolution Layer"
        N --> O[Resolve Addresses]
        O --> P{Check Address Map}
        P -->|Map Hit| Q[Return from Map]
        P -->|Map Miss| R[Parse DWARF Data]
        R --> S[Build Lookup Tables]
        S --> T[Store in Address Map]
        T --> U[Return Symbols]
        Q --> U
    end
    
    U --> V[Update Symbol Cache]
    V --> W[Return to Caller]
    E --> W
Loading

@kolesnikovae
Copy link
Collaborator

kolesnikovae commented Mar 4, 2025

I might be missing some details, but I have doubts about the cache hierarchy.

Now it looks like we have: symbols_cache -> object_store -> in_memory_object_store (ristretto) -> debuginfod.

As far as I understand, we're going to read from object_store even if there's just a single unresolved address.

I expected to see: symbols_cache -> in_memory_object_store (ristretto) -> object_store -> debuginfod.

Could you please elaborate on the decision?

@marcsanmi
Copy link
Contributor Author

I might be missing some details, but I have doubts about the cache hierarchy.

You're right @kolesnikovae. I've just realized the problem is that the ristretto cache is coupled inside the debuginfod client. I'll decoupled it and placed it at symbolizer level. Thus, we'll be able to have the following path:

symbols_cache -> in_memory_object_store (ristretto) -> object_store -> debuginfod

@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch 2 times, most recently from fd39d9e to 65fb599 Compare March 5, 2025 15:49
@marcsanmi marcsanmi requested a review from kolesnikovae March 5, 2025 16:52
@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch from 65fb599 to 10f52ea Compare March 17, 2025 08:52
@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch from 829f81e to e66f976 Compare May 8, 2025 06:19
@marcsanmi marcsanmi requested a review from korniltsev May 8, 2025 12:07
start := time.Now()
status := statusSuccess
defer func() {
s.metrics.profileSymbolization.WithLabelValues(status).Observe(time.Since(start).Seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metric is touched twice - once for the whole pprf and once for the permapping request. I suggest we remove the latter one or create a separate metric for it.

statusCode, isHTTPError := isHTTPStatusError(err)

if errors.As(err, &bnfErr) || (isHTTPError && statusCode == http.StatusNotFound) {
s.metrics.debuginfodRequestDuration.WithLabelValues(statusErrorNotFound).Observe(time.Since(debuginfodStart).Seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take another look at this metric. Something is off here. it looks like error statuses are accounted multiple times, in the DebugInfodClient and in the symbolizer. Can we remove all the usages of debuginfod metrics from the symbolizer?

debugReader, err := s.fetchFromDebuginfod(ctx, buildID)
if err != nil {
var bnfErr buildIDNotFoundError
if errors.As(err, &bnfErr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this type assertion? both branches are the same and return err. Is this needed? can we jsust return err?

lidiaBytes, err := s.getLidiaBytes(ctx, req.buildID)
if err != nil {
var bnfErr buildIDNotFoundError
if errors.As(err, &bnfErr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do not fail the query if there are any errors during symbolization. Let's treat all the errors the same - createNotFoundSymbolz

lidiaReader := NewReaderAtCloser(lidiaBytes)
table, err = lidia.OpenReader(lidiaReader, lidia.WithCRC())
if err != nil {
s.metrics.debugSymbolResolution.WithLabelValues("lidia_error").Observe(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do not fail the query if there are any errors during symbolization. Let's treat all the errors the same - createNotFoundSymbolz

}

for mappingID, locations := range locationsByMapping {
if err := s.symbolizeLocationsForMapping(ctx, profile, mappingID, locations); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do not fail the query if there are any errors during symbolization. Let's treat all the errors the same - createNotFoundSymbolz

type location struct {
address uint64
lines []lidia.SourceInfoFrame
mapping *pprof.Mapping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field is unused. Let's remove it

pprof "github.com/google/pprof/profile"
)

type locToSymbolize struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace locToSymbolize with just *googlev1.Location because loc.Id == idx +1


var decompressed bytes.Buffer
if _, err := decompressed.ReadFrom(gr); err != nil {
gr.Close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use defer gr.Close() ?


var decompressed bytes.Buffer
if _, err := decompressed.ReadFrom(zr); err != nil {
zr.Close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use defer gr.Close() ?

// detectCompression reads the beginning of the input to determine if it's compressed,
// and if so, returns a ReaderAt that decompresses the data.
func detectCompression(r io.Reader) (io.ReaderAt, error) {
br := bufio.NewReader(r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets accept []byte as an argument to this function and do not use bufio - we already have everything in memory, no need to buffer. Let's return []byte here instead of ReaderAt - it will make it a bit more visible that we are reading and decompressing everything fully in memory

return lidiaBytes, nil
}

level.Error(s.logger).Log(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not an error if it is "not found"

}

// updateProfileWithSymbols updates the profile with symbolization results
func (s *Symbolizer) updateProfileWithSymbols(profile *googlev1.Profile, mapping *googlev1.Mapping, locs []locToSymbolize, symLocs []*location) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reorganize the code a bit, to call this function once per pprof instead of once per mapping? Othervise we create all the maps multiple times

Id: funcID,
Name: nameIdx,
Filename: filenameIdx,
StartLine: int64(line.LineNumber),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove this line. On the left we have the start of the function and on the right we have the middle of the function. Also the line number is not included in the key. Also we don't really have line numbers at the moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also don' really have filenames yet. I suggest we remove filename here as well.

nameIdx, filenameIdx int64
}
funcMap := make(map[funcKey]uint64)
maxFuncID := uint64(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do maxFuncId := len(profile.Function) + 1 here? I know I told you sometime ago it's not a valid approach in general, but it is a valid approach for pprof files returned from our pprof queries. so should be fine?


// fetchLidiaFromObjectStore retrieves Lidia data from the object store
func (s *Symbolizer) fetchLidiaFromObjectStore(ctx context.Context, buildID string) ([]byte, error) {
objstoreReader, err := s.store.Get(ctx, buildID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid store.Get invocations if we already know debuginfod returned NotFound ? it would probably require to pull the cache from debuginfod into symbolizer

return false
}

return len(slices.Collect(metadata.FindDatasets(block, matcher))) > 0
Copy link
Collaborator

@korniltsev korniltsev May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleks-p Can you help me verify my assumption here please.

  1. Blocks contain multiple datasets and so a block may have both unsymbolized and symbolized datasets.
  2. I believe the MetadataQueryService.QueryMetadata does not return all the datasets present in the block ,but only the ones that matched query. I've made this conclusion from reading this code
    if matches, ok = m.CollectMatches(matches, ds.Labels); ok {

If my understanding is correct, than this hasUnsymbolizedProfiles is fine. If the second assumption is not correct, then this check would degrade queries for symbolized datasets.

I think everything is good here, just want another 👀 to doublecheck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants