Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route choice additions #508

Merged
merged 36 commits into from
Feb 28, 2024

Conversation

Jake-Moss
Copy link
Contributor

Migrated from Jake-Moss#1

I don't think the approach I currently have is the way to go but it functions as a POC. I'll revamp it tomorrow utilising the table more now that I have a better understanding.

Ideally we'd be able to

  • Apply partitioning to the data, reading one origin shouldn't require reading all the origins,
  • but we should retain locality, reading an origin should (optionally) reading in all the relevant destinations.
  • And, store varying destinations, not all origins will have the same destinations.

Currently, the approach I had isn't able to be partitioned, although I think it does satisfy the other two.

To allow partitioning I think I'll have to move the origin index up to the main table, this would also let me drop the column index field and the struct entirely. But I think this would force the use of string keys for the columns in the table

@Jake-Moss
Copy link
Contributor Author

Jake-Moss commented Feb 7, 2024

I think a summary of my reading today would be useful. As I see it we have 3 real approaches to this.

  1. Retain the whole route choice set in memory until we are done with computation. Then use the multithreaded writing options of Arrow. This could be either from Python or from Cython using the C++ interface instead.
    • Implementation wise this would be the most straight forward, the stages of compute and dumping would be nicely separated and we'd be able to leverage just about any file formats or compression systems Arrow and Parquet have to offer.
    • A major disadvantage would be the memory requirements. Not only would we have to hold everything in memory once, but ~2x that. Arrow object construction is by copy only, the objects are immutable and this is one way they can guarantee that, doing this from Python would also require first constructing Python objects from the C++ data structures, then copying it again to make the Arrow objects.
    • This is definitely the most accessible and flexible option, even if we end up using the C++ interface to construct the Arrow objects before writing.
  2. Utilise Parquet writer FileWriter to incrementally write a table at a time. It has more advanced options but those really get into the weeds.
    • We'd have to create multiple tables but that should be fine. If it functions the same as the python API then it'll read back as one table.
    • The C++ API does have an option to manually create and manage RowGroups through the use of the NewRowGroup function, this however does require manual sizing of the RowGroup and is really getting into the weeds of the Parquet format. It would require careful reading of the Parquet format spec and some statistical analysis of our data in order to not completely destroy the format.
    • With either of these it would allow us to use a worker thread that consumers from thread safe queue of results to write, we'd be able to limit the number of results in memory at once and dynamically pause calculations to account for the speed of writing and utilise threads, compression, and anything else Parquet has to offer.
  3. Ditch Parquet in favour of Arrow IPC. This is a pretty big change from what we discussed and has some caveats.
    • Arguably the largest is it's not Parquet. We'd lose partitioning (though replaceable in a slightly less nice manner), high efficiency compression, and long term stability. Arrow IPC is designed for streaming data structures between Arrow processes in a standard, (near)zero-decoding format, it's exactly the same as how the data is represented in memory.
    • We would have some compression, via IpcWriteOptions although the FAQ - What is the difference between Apache Arrow and Apache Parquet? notes that the Arrow IPC format is not meant for storage, and while it can be compressed, it's no where as close as Parquet can be.
    • As Arrow IPC is not designed for storage, it's also not as stable as Parquet. Apache states what while compatibility between versions is supported it's not a priority.
    • We'd also need to change how things are conceptually stored, ditching the Table format in favour of RecordBatchs. This isn't a real issue and arguable makes more since given our data.
    • We would however, be able to take the same approach as 2. and have a worker thread consuming a queue of results to write. We'd have the same benefits of multithreaded IO and limiting memory usage.
    • We also wouldn't lose the benefits of partitioning. The file would become one large file however it appears we would be able to read in an lookup table for our origin ids then reading in a particular index. How this is implemented I'm not sure but I don't think it would read the whole dataset then only keep the requested part.

Overall option 3. isn't great, it's just here for comparison, it would be much more appealing if we needed to stream this data over a (fast) network or operate directly on the resulting table. In terms of implementation easy 1. is by far would be the quickest and easiest, however it would come with the memory issues. I think option 2. is over the best bet however it does have more moving parts.

Optimistically I think 1. could be done in a day, with 2. taking maybe (?) 2-3, accounting for the idiosyncrasies of Cython, C++, and Arrow.

@pedrocamargo
Copy link
Contributor

EXCELLENT summary, @Jake-Moss . I am inclined to say that option 2 is indeed the ideal, but there is no absolute need to have the saving of path files to be incredibly fast when performing assignment.

Maybe writing choice sets to be used later would be a welcome option, @jamiecook .

@Jake-Moss
Copy link
Contributor Author

I've got an idea for using 2. in a much simpler, and easier to reasonable manner that I'm working on now. We can just batch the computation. Instead of using a complicated worker thread approach to save while generating I think we could just compute a portion of the results, switch to dumping them, switch back to computation. It wouldn't be as fancy or fast but much easier to implement, particularly as a first attempt.

@janzill
Copy link
Contributor

janzill commented Feb 8, 2024

Nice summary @Jake-Moss. Batching sounds reasonable, another approach could potentially be to not care about number of files and related RowGroups while processing data, but instead adding a post-processing step at the end.

@Jake-Moss
Copy link
Contributor Author

I've just pushed the first form of real saving, test compatibility will come in a few moments.

I've managed to implement all the desired features with one real catch, writing to disk requires the gil, it is however, multi threaded on pyarrows end.

I attempted to keep all the disk writing and data transformation in Cython, interop-ing with the Pyarrow Cython and C++ API, however I've found the the Cython API is incredibly undocumented and fragile. Unfortunately working with the C++ API has its caveats and difficulties. I wasn't able to decipher it, perhaps someone with more C++ experience might have been able to, but I'm not sure the investment would be worth the rewards, I don't expect having to reacquire the gil after each batch to have a severe impact on runtime.
In my explorations I've found,

  • multiple ways to crash the Cython compiler related to missing constructors,
  • order dependent cimports, and
  • issues related to non-existent const attributes where Cython gets confused

Due to the inflexibilities of the API, or rather my inability to bend it to my will, I've changed the data model slightly. Instead of having a row per origin id, which has a MapArray of destination ids to lists, we now have a row per OD pair with the list as top level value. I think is this both more flexible from a user perspective and easier to reason about. AFAIK this is just an overall better way to store things.

A small implementation detail to be aware of is that the batching for the disk writing requires that a whole partition be written at once. That is all OD pairs with a specific O must be computed, and dumped to disk at once. Otherwise the previous results will be overwritten. This isn't really an issue and is handled already but it's worth noting that the batching is as granular as it previously was. It also means that attempting to append to an existing dataset using an existing origin value will overwrite the stored results.

The current implementation allows for,

  1. in memory construction only, returning a Pyarrow table,
  2. immediate disk dumping (batched per origin), returning None,
  3. hive dataset partitioning (keyed on origins), and
  4. multithreaded reads, writes, and compute. The current serial choke point is the Pyarrow data structure construction.

In what I consider a bad move, the PyPi wheel installation of pyarrow requires modifying the installation environment to
create symlinks for the linker.
@Jake-Moss
Copy link
Contributor Author

Code is unprofiled, I'm not sure if this is the best approach but it works well.

It works by stacking all the paths in a route set into a big vector, then sorting it. By sorting it all the links become
grouped and we can just count their occurrences. This has the added benefit of the resulting frequency arrays being
sorted so we can bisect then later. Generally this has really simple memory accesses and is easy to read.

Another possible implementation might be to sort each path individually, then walk and merge them all (not adding
duplicates). This is trickier, requires a lot of book keeping to walk n arrays correctly, upside is lower memory usage
provided we sort inplace, if not then it should be the same.
@Jake-Moss Jake-Moss changed the title Saving route choice results Route choice additions Feb 15, 2024
@Jake-Moss
Copy link
Contributor Author

I've done some profiling on the frequency calculation code I pushed yesterday and I believe it's fast enough to not be worth optimising further. On the QLD sample data it makes up about 0.1% of the runtime as reported by perf at 2000hz. Path finding is still the most time consuming part.
image

@Jake-Moss
Copy link
Contributor Author

Regarding the CI, I attempted to fix it locally but had no luck. It's all build issues related to gcc failing to find libarrow. Pyarrow is meant to ship with this but with an ABI tag. The python -c "import pyarrow; pyarrow.create_library_symlinks()" call is supposed to fix this but it's not. A recommendation from and existing issue is to build against the conda version of pyarrow, which is not useful. https://github.com/apache/arrow/issues/22388

@Jake-Moss
Copy link
Contributor Author

Despite CI failures, all tests pass locally with 78.80% coverage

Jake-Moss and others added 15 commits February 16, 2024 14:09
Moves the computation of the path sized logit inside the main multithreaded loop, this lets us batch them as well and
dump them to disk along with the reset of the tables. Enable multithreading by default. Catch a handle of memory leaks.
* OSM Geo-processing using GeoPandas

* OSM Geo-processing using GeoPandas

* OSM Geo-processing using GeoPandas

* OSM Geo-processing using GeoPandas

* Processing with chunking

* Processing with chunking

* Processing with chunking

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* pandas deprecations

* improves griding function

* adds network cleaning

* adds network cleaning

* Allows for use of polygons with rings inside

* Allows for use of polygons with rings inside

* adjusts types

* Update pyproject.toml

Co-authored-by: Jake Moss <[email protected]>

* .

* .

---------

Co-authored-by: pveigadecamargo <[email protected]>
Co-authored-by: Jake Moss <[email protected]>
* Removes legacy code

* updates versions

---------

Co-authored-by: pveigadecamargo <[email protected]>
commit 9337fb611463606b0ca89d1270210fa7fdf46714
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 11:22:48 2024 +1000

    .

commit 3f2c01b
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 10:09:51 2024 +1000

    I give up

commit c77d5e6
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 10:03:52 2024 +1000

    .

commit 9dc3650
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:58:40 2024 +1000

    .

commit b4b945c
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:53:10 2024 +1000

    .

commit e0c32f2
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:38:29 2024 +1000

    .

commit eedb859
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:30:20 2024 +1000

    .

commit d0def11
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:27:33 2024 +1000

    .

commit a6b2a8e
Merge: 0548be6 77edeae
Author: Jake Moss <[email protected]>
Date:   Tue Feb 27 09:19:30 2024 +1000

    Merge branch 'develop' into pedro/ci_test

commit 0548be6
Author: Jake-Moss <[email protected]>
Date:   Tue Feb 27 09:13:07 2024 +1000

    Macos test

commit a10c791
Author: Jake-Moss <[email protected]>
Date:   Wed Feb 21 20:00:01 2024 +1000

    Maybe fix MacOS again

commit cad0579
Author: Jake-Moss <[email protected]>
Date:   Wed Feb 21 19:43:00 2024 +1000

    Maybe MacOS fix

commit 24de8f7
Author: Jake-Moss <[email protected]>
Date:   Wed Feb 21 19:31:08 2024 +1000

    Hopefully fix CI

commit 66ffcc3
Author: pveigadecamargo <[email protected]>
Date:   Wed Feb 21 17:01:57 2024 +1000

    .

commit f934c34
Author: pveigadecamargo <[email protected]>
Date:   Wed Feb 21 16:41:36 2024 +1000

    .

commit 9f2ca20
Author: pveigadecamargo <[email protected]>
Date:   Wed Feb 21 16:30:59 2024 +1000

    .

commit ba0d882
Author: pveigadecamargo <[email protected]>
Date:   Wed Feb 21 16:27:14 2024 +1000

    .

commit 315cbce
Author: Pedro Camargo <[email protected]>
Date:   Wed Feb 21 01:01:23 2024 +1000

    Update pyproject.toml

commit d58e7cc
Author: Pedro Camargo <[email protected]>
Date:   Wed Feb 21 00:58:54 2024 +1000

    Update pyproject.toml

commit dd6723f
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 23:02:05 2024 +1000

    .

commit f7ae37e
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:58:19 2024 +1000

    .

commit 0f954e4
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:54:21 2024 +1000

    .

commit 3dafd88
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:27:05 2024 +1000

    .

commit ebe3a19
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:12:39 2024 +1000

    .

commit 9f4413e
Merge: daf48a5 2e3c234
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:11:22 2024 +1000

    Merge branch 'develop' of github.com:AequilibraE/aequilibrae into pedro/ci_test

commit daf48a5
Merge: 3df73da ccb9cfa
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:11:13 2024 +1000

    .

commit 3df73da
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:10:13 2024 +1000

    .

commit 9448e76
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 22:10:05 2024 +1000

    adds emulation

commit c9f2aaa
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 21:20:12 2024 +1000

    adds emulation

commit b2d5d3d
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 21:16:51 2024 +1000

    adds emulation

commit 2458f9a
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 21:02:26 2024 +1000

    adds emulation

commit e9e660b
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 20:54:57 2024 +1000

    adds emulation

commit f112262
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 20:42:13 2024 +1000

    Add ARM architectures for Linux and mac

commit bae7d0d
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 20:28:15 2024 +1000

    tests cibuildwheels

commit 293457a
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 20:09:43 2024 +1000

    tests cibuildwheels

commit 06b6a44
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 19:31:21 2024 +1000

    tests cibuildwheels

commit 0ebce09
Author: pveigadecamargo <[email protected]>
Date:   Tue Feb 20 19:23:39 2024 +1000

    tests cibuildwheels
@Jake-Moss
Copy link
Contributor Author

Closes #510, closes #488

@Jake-Moss Jake-Moss marked this pull request as ready for review February 28, 2024 00:55
@Jake-Moss
Copy link
Contributor Author

Merging this to make a clean slate for new API and assignment changes

@Jake-Moss Jake-Moss merged commit 0230ae7 into AequilibraE:route_choice Feb 28, 2024
19 checks passed
@Jake-Moss Jake-Moss deleted the route_choice_saving branch February 28, 2024 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants