Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printing of large dataset fails #431

Open
behrica opened this issue Oct 28, 2024 · 2 comments
Open

printing of large dataset fails #431

behrica opened this issue Oct 28, 2024 · 2 comments

Comments

@behrica
Copy link
Contributor

behrica commented Oct 28, 2024

I made a big dataset of shape:
[4 3724776]

and "printing" in the repl of even small parts fails:

(ds/head ds)
; Error printing return value (ArithmeticException) at java.lang.Math/toIntExact (Math.java:1074).
; integer overflow
clj꞉text-perf꞉> 
clojure.main/repl (main.clj:442)
clojure.main/repl (main.clj:459)
clojure.main/repl (main.clj:368)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:84)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:56)
nrepl.middleware.interruptible-eval/interruptible-eval (interruptible_eval.clj:152)
nrepl.middleware.session/session-exec (session.clj:218)
nrepl.middleware.session/session-exec (session.clj:217)
java.lang.Thread/run (Thread.java:829)
@behrica
Copy link
Contributor Author

behrica commented Oct 28, 2024

Maybe I should mention that the dataset is really big, as it has 3 million rows of text.
Roughly 8 GB of text.

I loaded it successfully using the "mmap" feature, so by doing:

(ds/->dataset "bigdata/repeatedAbstrcats_3.7m_.txt"
                {:text-temp-dir "/tmp/xxx"
                 :file-type :tsv
                 :header-row? false})

@behrica
Copy link
Contributor Author

behrica commented Oct 28, 2024

Restricting it to 1 million rows, made it work.
in this case the mmap file on disk was about 2G.

I have worked witg TMD datasets with far "more rows" before, so the "rows" are not the problem.

Not sure, if it is the "mmap" support or the "strings to print".

It seems top be a "printing" related issues. I can do operations on the dataset, just not print any result of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant