Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to return raw file offsets from within the tar? #162

Open
jeroen opened this issue Jan 4, 2024 · 6 comments · May be fixed by #170
Open

Is it possible to return raw file offsets from within the tar? #162

jeroen opened this issue Jan 4, 2024 · 6 comments · May be fixed by #170

Comments

@jeroen
Copy link

jeroen commented Jan 4, 2024

I would like to generate an index of a tar file with the start and end offset of each file in the tarball, such that I can mmap or extract a single file later on. Is this possible with tar-stream?

The documentation of headers only mentions the size of each file, but I would also need the offset within the tar.

From hacking it looks like the global property extract._buffer.shifted contains what I need but this is mostly a guess. It would be nice if the header callback could include the offset property for each entry.

@jeroen
Copy link
Author

jeroen commented Jan 4, 2024

Here is what I puzzled together. Does this seem right? Is there a more efficient way to do this:

const fs = require('fs')
const tar = require('tar-stream')
const gunzip = require('gunzip-maybe');

function tar_index(path){
  const input = fs.createReadStream(path);
  const extract = tar.extract();
  let output = [];
  return new Promise(function(resolve, reject) {

    function process_entry(header, stream, next) {
      var offset = extract._buffer.shifted
      //console.log(stream)
      output.push({
        name: header.name,
        offset: offset,
        size: header.size
      });
      stream.on('end', function () {
        next() //read for next file
      })
      stream.on('error', reject);
      stream.resume();
    }

    function finish_stream(){
      resolve(output);
    }

    var extract = tar.extract({allowUnknownFormat: true})
      .on('entry', process_entry).on('finish', finish_stream).on('error', reject)
    input.pipe(gunzip()).pipe(extract);
  }).finally(function(){
    input.destroy();
  });
}

tar_index('myfile.tar.gz').then(console.log)

@mafintosh
Copy link
Owner

Should be easy to add yea. Feel free to PR that

@jeroen
Copy link
Author

jeroen commented Aug 7, 2024

@mafintosh sorry to pick you again, trying to understand this code.

Am I correct that extract._buffer.shifted global state contains the total offset within the tar file of where the files starts, at the time the entry callback is triggered?

@mafintosh
Copy link
Owner

I would track it independently, seems much simpler. ie a property that tracks byteOffset that is updated everytime we eat from the buffer. then add that to the header object we emit

@jeroen
Copy link
Author

jeroen commented Aug 7, 2024

I'll give that another try.

FWIW, the goal is to mmap the tar file in WASM using the emscripten packaging format. So we need the server to generate an index for the tar that looks like so: https://jeroen.r-universe.dev/bin/emscripten/contrib/4.4/jsonlite_1.8.9.js.metadata

@jeroen
Copy link
Author

jeroen commented Aug 7, 2024

I would track it independently, seems much simpler. ie a property that tracks byteOffset that is updated everytime we eat from the buffer. then add that to the header object we emit

Do you suggest we only have to add a line to _next() with byteOffset += size;? I am trying to understand how this byteOffset would be different from the existing shifted?

tar-stream/extract.js

Lines 44 to 62 in 126968f

_next (size) {
const buf = this.queue.peek()
const rem = buf.byteLength - this._offset
if (size >= rem) {
const sub = this._offset ? buf.subarray(this._offset, buf.byteLength) : buf
this.queue.shift()
this._offset = 0
this.buffered -= rem
this.shifted += rem
return sub
}
this.buffered -= size
this.shifted += size
return buf.subarray(this._offset, (this._offset += size))
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants