Performance improvement for transposing data #2

kylebarron · 2018-10-31T01:04:20Z

There's a tutorial about this on the Arrow C++ documentation:
https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html

From Arrow to row-wise is the second half of the document.

mcaceresb · 2018-10-31T02:22:05Z

I think the trick will be to loop through stata row-wise. Atm I loop through Stata column-wise. It sounds like the performance loss in Arrow will be smaller than the loss in Stata.

kylebarron · 2018-10-31T03:49:32Z

The Stata methods aren't threadsafe, right? So it has to be single threaded?

mcaceresb · 2018-10-31T03:54:19Z

I think it can me multi-threaded but I don't think it actually improved performance when I tested it (though I might have not done it right). I think what might benefit quite a bit from multi-threading is the read/write from/to parquet files, not Stata memory.

kylebarron · 2018-10-31T03:55:46Z

I wouldn't be surprised if that is multi-threaded by default

kylebarron · 2018-10-31T03:58:20Z

At least in Python, it reads Parquet files multi-threaded by default.

Sometime soon I'd like to try to go through your code more closely.

mcaceresb · 2018-10-31T04:01:52Z

I think that the way to structure this that might be faster:

Read parquet file into Arrow table. Multi-threaded; should be fast.
Arrow table into Stata memory, looping through Stata and the table in row-order. Might be slow.

Then the converse for write:

Read data in memory to Arrow table looping through Stata and the table in row-order. Might be slow.
Write arrow table to parquet file; should be fast.

At the moment, this reads the parquet file in column order and saves to Stata on the fly in column order as well. For writing, it reads the data in memory into a parquet table but, again, it loops through Stata in column order.

kylebarron · 2018-10-31T04:04:43Z

Yes, I agree with all of that.

mcaceresb · 2018-10-31T04:16:23Z

I'm doing some benchmarking. Writing 10M rows and 3 variables once the data is in an arrow table takes 0.5s. Looping over Stata as it is atm also takes 0.5s.

Writing a that to a .dta file takes 0.2s.

kylebarron · 2018-10-31T04:27:01Z

Even without further speed improvements, this package would be extremely helpful for anybody who uses Stata and {Python,R,Spark} (though R support for Parquet is still kinda limited), because it would mean that Stata could read binary data exported from one of those platforms.

mcaceresb · 2018-10-31T04:30:29Z

I wonder if it's not multi-threaded.

I would like to cut down processing time in half, ideally. I think that's plausible, but I doubt it can ever be faster than reading/writing .dta files directly (otherwise, I mean, what's the point of dta files; I imagine that there is no looping over entries in that case and Stata just writes that in bulk.)

kylebarron · 2018-10-31T04:37:54Z

I doubt it can ever be faster than reading/writing .dta files directly

You're comparing reading the entire .dta file into Stata with the entire .parquet file... That's not necessarily the right comparison. Reading the first column of the first row group in Parquet is extremely fast. Doing

use col1 in 1/1000 using file.dta

is sometimes extremely slow. I originally was frustrated because when you do

use in 1 using file.dta

it has to load the entire file just to read the first row of the data!

So if there are huge (~500GB) files that can be split into say ~20 row groups, that's something that Parquet could excel at.

mcaceresb · 2018-10-31T04:42:07Z

Nicely enough, it takes literally a third of the time (one col vs 3)

kylebarron · 2018-10-31T04:50:55Z

Haha yeah, that's been my experience as well. Generally it's linear in the amount of columns you read. And since a ton of data analysis only cares about a few columns out of a dataset of 300, the columnar file type can really make a difference.

mcaceresb · 2018-11-01T04:25:11Z

It sounds like individual columns can be chuncked. I think I can only implement the solution suggested in the apache docs if the number of chunks and each chunk size is the same.

I suppose most flat data would be like that, tho. Need to check first and fall back to out of order if each column is not stored in the same way.

kylebarron · 2018-11-01T04:33:59Z

I think that each column can only be chunked inside a row group. So if the first row group is 10,000 rows, then there won't be any chunks smaller than that for the first 10,000 rows. I'm not sure if that sentence makes sense

mcaceresb · 2018-11-01T04:40:51Z

I think a row group is only relevant when reading the file from disk, not when iterating over the table already in memory.

mcaceresb · 2018-11-02T19:15:57Z

I've been trying this out on the server on modestly large data that I've been using for a project (few GiB) and compression is amazing! Performance for traversing several variables in Stata in column order is pretty poor, though, specially if there are a ton of strings.

I won't spend any time optimizing the row vs column-order thing until we figure out how the Java version fares, but it's pretty cool to see a fairly complicated 21GiB file down at 5GiB.

kylebarron · 2018-11-02T20:06:54Z

Yes, the compression is amazing. Using Parquet files with something like Dask or Spark completely opens up doing computation on 20Gb files on a laptop.

kylebarron · 2019-02-12T16:36:22Z

Just bumping this in case you had any great discovery in the last few months.

Since you're still on my DUA, can you try this command:

parquet use /disk/agebulk3/medicare.work/doyle-DUA51929/barronk-dua51929/raw/pq_from_spark/100pct/med/med2014.parquet

That parquet directory is 2.6GB, but the command has been running for 11 minutes and hasn't finished...

It would be really awesome if you had some way of making a progress bar.

mcaceresb · 2019-02-12T16:41:05Z

Yup. Had this on the back of my head. Don't think it'd take too long. Format ideas?

Reading [###        ] X% (obs i / N; group r / R)

?

kylebarron · 2019-02-12T16:49:21Z

Yeah that seems great

mcaceresb · 2019-02-12T17:48:20Z

linesize is a problem ):

kylebarron · 2019-02-12T17:49:54Z

why?

mcaceresb · 2019-02-12T18:01:32Z

When I try to print the timer, it often gets broken up by timeline, so it looks all wrong.

I also can't get the formatting to work right. In my tests stata prints at the end of the program, not as the program is executing. I suspect it's waiting for a new line...

kylebarron · 2019-02-12T18:09:54Z

Weird. Btw did you try this?

parquet use /disk/agebulk3/medicare.work/doyle-DUA51929/barronk-dua51929/raw/pq_from_spark/100pct/med/med2014.parquet

mcaceresb · 2019-02-12T18:11:15Z

I'm just going to have it report ever 30 seconds or something like that.

mcaceresb · 2019-02-12T19:13:04Z

parquet desc seems to be working. Takes 3 seconds for me, so that's nice. Allocating the memory for the target data has been running for a minute or so, tho. It hasn't even started importing the data...

kylebarron · 2019-02-12T19:18:01Z

It ended up taking 26.5 minutes for me:

mcaceresb · 2019-02-12T19:19:59Z

Wow. So, basically, it's Stata that chocked? Mmm... Is there a way to add 381 variables that doesn't take 23 minutes? I think it might be somewhat faster if I specify mata not initialize the variables (or maybe mata performance deteriorates with these many variables? I tested it and it was faster than looping gen...)

kylebarron · 2019-02-12T19:21:07Z

I don't know... This was basically the first real-world file I've tried to read into Stata with this

mcaceresb · 2019-02-12T19:21:21Z

Ah! I have it. Maybe it's faster to allocate 1 observation and then set obs to _N.

kylebarron · 2019-02-12T19:27:11Z

Weird

mcaceresb · 2019-02-12T19:27:16Z

Great! With 20M obs, allocating 8 variables and then the observations takes 0.5s, vs 2.5s the way parquet currently does it.

kylebarron · 2019-02-12T19:35:58Z

Great. Let me know when to test

mcaceresb · 2019-02-12T19:42:30Z

        Reading: 6.0%, 30.0s (rg 9 / 127 > col 3 / 381 > obs 1055035 / 16482565)
        Reading: 12.0%, 60.0s (rg 17 / 127 > col 46 / 381 > obs 2173492 / 16482565)
        Reading: 19.0%, 90.0s (rg 25 / 127 > col 373 / 381 > obs 3195623 / 16482565)
        Reading: 25.0%, 120.0s (rg 33 / 127 > col 317 / 381 > obs 4297575 / 16482565)
        Reading: 31.0%, 150.0s (rg 42 / 127 > col 35 / 381 > obs 5459327 / 16482565)
        Reading: 39.0%, 180.0s (rg 51 / 127 > col 2 / 381 > obs 6607437 / 16482565)
        Reading: 45.0%, 210.0s (rg 59 / 127 > col 122 / 381 > obs 7580063 / 16482565)
        Reading: 51.0%, 240.0s (rg 67 / 127 > col 102 / 381 > obs 8630077 / 16482565)
        Reading: 56.0%, 270.0s (rg 73 / 127 > col 129 / 381 > obs 9401371 / 16482565)
        Reading: 60.0%, 300.0s (rg 78 / 127 > col 104 / 381 > obs 10130308 / 16482565)
        Reading: 63.0%, 330.0s (rg 83 / 127 > col 108 / 381 > obs 10823961 / 16482565)
        Reading: 68.0%, 360.0s (rg 88 / 127 > col 176 / 381 > obs 11459102 / 16482565)
        Reading: 72.0%, 390.0s (rg 93 / 127 > col 337 / 381 > obs 12099446 / 16482565)
        Reading: 76.0%, 420.0s (rg 98 / 127 > col 372 / 381 > obs 12680918 / 16482565)
        Reading: 80.0%, 450.0s (rg 103 / 127 > col 372 / 381 > obs 13385985 / 16482565)
        Reading: 84.0%, 480.0s (rg 109 / 127 > col 39 / 381 > obs 14202073 / 16482565)
        Reading: 88.0%, 510.0s (rg 114 / 127 > col 61 / 381 > obs 14767487 / 16482565)
        Reading: 92.0%, 540.0s (rg 119 / 127 > col 68 / 381 > obs 15451973 / 16482565)
        Reading: 97.0%, 570.0s (rg 125 / 127 > col 218 / 381 > obs 16243190 / 16482565)
Warning: 2417652156 NaN values in string variables coerced to blanks ('').
Read data from disk (576.97 sec).
r; t=627.12 14:41:53

10 minutes for me. I wonder if the progress code has slowed down the plugin? Although I do have someone else using 8 cores in the server where I'm in... While it does print mid-execution, it appears that Stata doesn't print it right away anyway ): Since it it might be pointless, should I just take it out?

kylebarron · 2019-02-12T19:44:37Z

How long is the delay until it's printed?

mcaceresb · 2019-02-12T19:45:10Z

It seems to print every 4 lines or something like that.

mcaceresb · 2019-02-12T19:46:21Z

I can push what I have for you to test if you want?

kylebarron · 2019-02-12T19:49:24Z

Sure

mcaceresb · 2019-02-12T19:50:32Z

Up.

kylebarron · 2019-02-12T19:58:09Z

When printing size on disk/size in memory, I'd recommend a human-readable number, instead of millions or billions of bytes

kylebarron · 2019-02-12T20:02:27Z

How does this compare time wise to Stata reading a 20GB file?

mcaceresb · 2019-02-12T20:18:51Z

Stata's not that slow reading an entire dataset, afaik. Not sure how that compares to native parquet performance, tho (16GiB)

.  use /disk/aging/medicare/data/100pct/med/2014/med2014.dta, clear
r; t=23.95 15:18:28

kylebarron · 2019-02-13T16:08:11Z

So it's still 10 minutes for parquet vs 24 seconds for Stata format?

mcaceresb · 2019-02-13T17:24:30Z

Took 3 min in python

>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> from time import time
>>> 
>>> f = '/disk/agebulk3/medicare.work/doyle-DUA51929/barronk-dua51929/raw/pq_from_spark/100pct/med/med2014.parquet/'
>>> t = time()
>>> dataset = pq.ParquetDataset(f)
>>> print(time() - t)
1.6530742645263672
>>> table   = dataset.read()
>>> print(time() - t)
68.58256483078003
>>> df      = table.to_pandas()
>>> print(time() - t)
168.9156460762024

If took me 10 min, but I it took you 4 min (and the rest was the inefficient Stata memory alloc). I don't think performance is that bad, all things considered, but it could def be better. Basides, the point is to benchmark reading only a subset of all columns, right?

kylebarron · 2019-02-13T17:25:55Z

Pyarrow is also multithreaded I'm pretty sure. But yes, most of the time a user would only be loading a subset of columns.

Performance improvement for transposing data #2

Performance improvement for transposing data #2

Comments

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018

kylebarron commented Oct 31, 2018

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018 • edited Loading

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018

kylebarron commented Oct 31, 2018

mcaceresb commented Oct 31, 2018 • edited Loading

kylebarron commented Oct 31, 2018

mcaceresb commented Nov 1, 2018

kylebarron commented Nov 1, 2018 • edited Loading

mcaceresb commented Nov 1, 2018

mcaceresb commented Nov 2, 2018

kylebarron commented Nov 2, 2018

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 12, 2019

kylebarron commented Feb 12, 2019

mcaceresb commented Feb 12, 2019

kylebarron commented Feb 13, 2019

mcaceresb commented Feb 13, 2019

kylebarron commented Feb 13, 2019

mcaceresb commented Oct 31, 2018 •

edited

Loading

mcaceresb commented Oct 31, 2018 •

edited

Loading

kylebarron commented Nov 1, 2018 •

edited

Loading