nondeterministic behavior when fed lots of files quickly #8

rdpoor · 2015-05-08T10:00:51Z

When I call `pdf_text_extract()' with one file at a time, all works well. But when I hit it with a directory of pdf files (still one at a time, just fast), I get non-deterministic behavior where the resulting text is truncated.

Following is an example. Notice that calling reportOneFile(...) correctly processes and reports on individual files. But reportDirectory(...) reports that the processed text is either zero length, 8191 or (once in a while) the correct length. Subsequent calls to reportDirectory() produce different results:

$ node
> var TestParser = require('./lib/test-parser')
undefined
> tp = new TestParser()
{}
> tp.reportOneFile('test/pdf-test/0908050036_91406141321_20140625.pdf')
undefined
> 8005 'test/pdf-test/0908050036_91406141321_20140625.pdf'
> tp.reportOneFile('test/pdf-test/0908050036_91405131615_20140522.pdf')
undefined
> 7999 'test/pdf-test/0908050036_91405131615_20140522.pdf'
> tp.reportDirectory('test/pdf-test')
undefined
> 0 'test/pdf-test/0908050036_91406141321_20140625.pdf'
0 'test/pdf-test/0908050036_91405131615_20140522.pdf'
0 'test/pdf-test/0908050036_91404126594_20140423.pdf'
0 'test/pdf-test/0908050036_91401135121_20140123.pdf'
0 'test/pdf-test/0908050036_91312128215_20131226.pdf'
0 'test/pdf-test/0908050036_91311122264_20131120.pdf'
0 'test/pdf-test/0908050036_9131064218_20131011.pdf'
8191 'test/pdf-test/0908050036_91503129909_20150323.pdf'
8191 'test/pdf-test/0908050036_91502129494_20150220.pdf'
8191 'test/pdf-test/0908050036_91501125167_20150122.pdf'
8191 'test/pdf-test/0908050036_91412121290_20141222.pdf'
8191 'test/pdf-test/0908050036_91411119510_20141120.pdf'
8191 'test/pdf-test/0908050036_91410143534_20141022.pdf'
8191 'test/pdf-test/0908050036_91409124480_20140923.pdf'
8191 'test/pdf-test/0908050036_91408133729_20140822.pdf'
8191 'test/pdf-test/0908050036_91407135090_20140724.pdf'
8191 'test/pdf-test/0908050036_91403128149_20140321.pdf'
8191 'test/pdf-test/0908050036_91402117758_20140220.pdf'
0 'test/pdf-test/0908050036_91310129406_20131022.pdf'
8876 'test/pdf-test/0908050036_91308125534_20130822.pdf'
8122 'test/pdf-test/0908050036_91307116578_20130724.pdf'
8128 'test/pdf-test/0908050036_91306120163_20130624.pdf'
8122 'test/pdf-test/0908050036_91305124314_20130523.pdf'
8122 'test/pdf-test/0908050036_91304127419_20130424.pdf'

undefined

Here is the code in its entirety:

// File: test-parser.js
"use strict";

var fs = require('fs');

function TestParser() {
}

// Extract the text in a pdf file as a single page.
TestParser.prototype.extractFile = function(pdf_filename, callback) {
    var pdf_text_extract = require('pdf-text-extract')
    pdf_text_extract(pdf_filename, function extractionComplete(err, pages) {
        if (err) {
            callback(err)
            return;
        }
        callback(null, pages.join(''), pdf_filename)
    });
}

// Call extractFile() on each file named *.pdf in the given directory.
TestParser.prototype.extractDirectory = function(directory, callback) {
    var pdfs = fs.readdirSync(directory).filter(function(element) {
        return path.extname(element) === '.pdf'
    });
    for(var i=pdfs.length-1; i>=0; --i) {
        var filename = directory + "/" + pdfs[i];
        this.extractFile(filename, callback);
    }
}

TestParser.prototype.reportOneFile = function(pdf_filename) {
    this.extractFile(pdf_filename, this.reportLength);
}

TestParser.prototype.reportDirectory = function(directory) {
    this.extractDirectory(directory, this.reportLength);
}

TestParser.prototype.reportLength = function(err, text, filename) {
    if (err) {
        console.error(err);
        return;
    }
    console.error(text.length, filename);
}

module.exports = TestParser;

P.S.: Regrettably I cannot post the .pdfs themselves in a gist as they contain personal customer data. But I've verified this fails on other directories full of pdfs. If you want a gist, let me know.

The text was updated successfully, but these errors were encountered:

tpreusse · 2015-05-21T13:54:50Z

Most likely related to the parallel extraction problems: #7

While you trigger one file after another it still extracts async in parallel:
this.extractFile(filename, callback); will spawn a child process and return immediately and not waiting for the extraction to finish.

Try:

npm install git+ssh://[email protected]:tpreusse/pdf-text-extract.git#fix-parallel

nisaacson · 2015-09-27T14:32:36Z

@rdpoor are you still seeing this issue in the latest version 1.3.1?

rdpoor · 2015-09-27T18:38:56Z

Sorry to report that I switched away from pdf-text-extract, so I cannot say whether 1.3.1 fixes the original issue. But I have provided a test case... :)

whilei · 2017-07-17T21:29:56Z

Possibly related behavior after the first ~600-ish files

~/Documents/pa-en-export master *% ⟠ node index.js
/Users/ia/Documents/pa-en-export/node_modules/pdf-text-extract/index.js:120
  child.stdout.setEncoding('utf8')
              ^

TypeError: Cannot read property 'setEncoding' of undefined
    at streamResults (/Users/ia/Documents/pa-en-export/node_modules/pdf-text-extract/index.js:120:15)
    at pdfTextExtract (/Users/ia/Documents/pa-en-export/node_modules/pdf-text-extract/index.js:109:5)
    at files.forEach.file (/Users/ia/Documents/pa-en-export/index.js:21:9)
    at Array.forEach (native)
    at fs.readdir (/Users/ia/Documents/pa-en-export/index.js:18:11)
    at FSReqWrap.oncomplete (fs.js:123:15)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nondeterministic behavior when fed lots of files quickly #8

nondeterministic behavior when fed lots of files quickly #8

rdpoor commented May 8, 2015

tpreusse commented May 21, 2015

nisaacson commented Sep 27, 2015

rdpoor commented Sep 27, 2015

whilei commented Jul 17, 2017

nondeterministic behavior when fed lots of files quickly #8

nondeterministic behavior when fed lots of files quickly #8

Comments

rdpoor commented May 8, 2015

tpreusse commented May 21, 2015

nisaacson commented Sep 27, 2015

rdpoor commented Sep 27, 2015

whilei commented Jul 17, 2017