Route choice additions #508

Jake-Moss · 2024-02-06T07:01:35Z

I don't think the approach I currently have is the way to go but it functions as a POC. I'll revamp it tomorrow utilising the table more now that I have a better understanding.

Ideally we'd be able to

Apply partitioning to the data, reading one origin shouldn't require reading all the origins,
but we should retain locality, reading an origin should (optionally) reading in all the relevant destinations.
And, store varying destinations, not all origins will have the same destinations.

Currently, the approach I had isn't able to be partitioned, although I think it does satisfy the other two.

To allow partitioning I think I'll have to move the origin index up to the main table, this would also let me drop the column index field and the struct entirely. But I think this would force the use of string keys for the columns in the table

* Update graph.py * .

…ro/improves_dependencies

…pendencies Improves dependencies

Jake-Moss · 2024-02-07T07:30:38Z

I think a summary of my reading today would be useful. As I see it we have 3 real approaches to this.

Retain the whole route choice set in memory until we are done with computation. Then use the multithreaded writing options of Arrow. This could be either from Python or from Cython using the C++ interface instead.
- Implementation wise this would be the most straight forward, the stages of compute and dumping would be nicely separated and we'd be able to leverage just about any file formats or compression systems Arrow and Parquet have to offer.
- A major disadvantage would be the memory requirements. Not only would we have to hold everything in memory once, but ~2x that. Arrow object construction is by copy only, the objects are immutable and this is one way they can guarantee that, doing this from Python would also require first constructing Python objects from the C++ data structures, then copying it again to make the Arrow objects.
- This is definitely the most accessible and flexible option, even if we end up using the C++ interface to construct the Arrow objects before writing.
Utilise Parquet writer FileWriter to incrementally write a table at a time. It has more advanced options but those really get into the weeds.
- We'd have to create multiple tables but that should be fine. If it functions the same as the python API then it'll read back as one table.
- The C++ API does have an option to manually create and manage RowGroups through the use of the NewRowGroup function, this however does require manual sizing of the RowGroup and is really getting into the weeds of the Parquet format. It would require careful reading of the Parquet format spec and some statistical analysis of our data in order to not completely destroy the format.
- With either of these it would allow us to use a worker thread that consumers from thread safe queue of results to write, we'd be able to limit the number of results in memory at once and dynamically pause calculations to account for the speed of writing and utilise threads, compression, and anything else Parquet has to offer.
Ditch Parquet in favour of Arrow IPC. This is a pretty big change from what we discussed and has some caveats.
- Arguably the largest is it's not Parquet. We'd lose partitioning (though replaceable in a slightly less nice manner), high efficiency compression, and long term stability. Arrow IPC is designed for streaming data structures between Arrow processes in a standard, (near)zero-decoding format, it's exactly the same as how the data is represented in memory.
- We would have some compression, via IpcWriteOptions although the FAQ - What is the difference between Apache Arrow and Apache Parquet? notes that the Arrow IPC format is not meant for storage, and while it can be compressed, it's no where as close as Parquet can be.
- As Arrow IPC is not designed for storage, it's also not as stable as Parquet. Apache states what while compatibility between versions is supported it's not a priority.
- We'd also need to change how things are conceptually stored, ditching the Table format in favour of RecordBatchs. This isn't a real issue and arguable makes more since given our data.
- We would however, be able to take the same approach as 2. and have a worker thread consuming a queue of results to write. We'd have the same benefits of multithreaded IO and limiting memory usage.
- We also wouldn't lose the benefits of partitioning. The file would become one large file however it appears we would be able to read in an lookup table for our origin ids then reading in a particular index. How this is implemented I'm not sure but I don't think it would read the whole dataset then only keep the requested part.

Overall option 3. isn't great, it's just here for comparison, it would be much more appealing if we needed to stream this data over a (fast) network or operate directly on the resulting table. In terms of implementation easy 1. is by far would be the quickest and easiest, however it would come with the memory issues. I think option 2. is over the best bet however it does have more moving parts.

Optimistically I think 1. could be done in a day, with 2. taking maybe (?) 2-3, accounting for the idiosyncrasies of Cython, C++, and Arrow.

pedrocamargo · 2024-02-07T13:18:13Z

EXCELLENT summary, @Jake-Moss . I am inclined to say that option 2 is indeed the ideal, but there is no absolute need to have the saving of path files to be incredibly fast when performing assignment.

Maybe writing choice sets to be used later would be a welcome option, @jamiecook .

Jake-Moss · 2024-02-08T01:20:09Z

I've got an idea for using 2. in a much simpler, and easier to reasonable manner that I'm working on now. We can just batch the computation. Instead of using a complicated worker thread approach to save while generating I think we could just compute a portion of the results, switch to dumping them, switch back to computation. It wouldn't be as fancy or fast but much easier to implement, particularly as a first attempt.

janzill · 2024-02-08T03:09:14Z

Nice summary @Jake-Moss. Batching sounds reasonable, another approach could potentially be to not care about number of files and related RowGroups while processing data, but instead adding a post-processing step at the end.

Jake-Moss · 2024-02-12T06:17:26Z

I've just pushed the first form of real saving, test compatibility will come in a few moments.

I've managed to implement all the desired features with one real catch, writing to disk requires the gil, it is however, multi threaded on pyarrows end.

I attempted to keep all the disk writing and data transformation in Cython, interop-ing with the Pyarrow Cython and C++ API, however I've found the the Cython API is incredibly undocumented and fragile. Unfortunately working with the C++ API has its caveats and difficulties. I wasn't able to decipher it, perhaps someone with more C++ experience might have been able to, but I'm not sure the investment would be worth the rewards, I don't expect having to reacquire the gil after each batch to have a severe impact on runtime.
In my explorations I've found,

multiple ways to crash the Cython compiler related to missing constructors,
order dependent cimports, and
issues related to non-existent const attributes where Cython gets confused

Due to the inflexibilities of the API, or rather my inability to bend it to my will, I've changed the data model slightly. Instead of having a row per origin id, which has a MapArray of destination ids to lists, we now have a row per OD pair with the list as top level value. I think is this both more flexible from a user perspective and easier to reason about. AFAIK this is just an overall better way to store things.

A small implementation detail to be aware of is that the batching for the disk writing requires that a whole partition be written at once. That is all OD pairs with a specific O must be computed, and dumped to disk at once. Otherwise the previous results will be overwritten. This isn't really an issue and is handled already but it's worth noting that the batching is as granular as it previously was. It also means that attempting to append to an existing dataset using an existing origin value will overwrite the stored results.

The current implementation allows for,

in memory construction only, returning a Pyarrow table,
immediate disk dumping (batched per origin), returning None,
hive dataset partitioning (keyed on origins), and
multithreaded reads, writes, and compute. The current serial choke point is the Pyarrow data structure construction.

In what I consider a bad move, the PyPi wheel installation of pyarrow requires modifying the installation environment to create symlinks for the linker.

Jake-Moss · 2024-02-12T07:03:35Z

For the modifications to CI
https://arrow.apache.org/docs/python/integration/extending.html#building-extensions-against-pypi-wheels

Code is unprofiled, I'm not sure if this is the best approach but it works well. It works by stacking all the paths in a route set into a big vector, then sorting it. By sorting it all the links become grouped and we can just count their occurrences. This has the added benefit of the resulting frequency arrays being sorted so we can bisect then later. Generally this has really simple memory accesses and is easy to read. Another possible implementation might be to sort each path individually, then walk and merge them all (not adding duplicates). This is trickier, requires a lot of book keeping to walk n arrays correctly, upside is lower memory usage provided we sort inplace, if not then it should be the same.

Jake-Moss · 2024-02-16T01:09:34Z

I've done some profiling on the frequency calculation code I pushed yesterday and I believe it's fast enough to not be worth optimising further. On the QLD sample data it makes up about 0.1% of the runtime as reported by perf at 2000hz. Path finding is still the most time consuming part.

Jake-Moss · 2024-02-16T01:17:23Z

Regarding the CI, I attempted to fix it locally but had no luck. It's all build issues related to gcc failing to find libarrow. Pyarrow is meant to ship with this but with an ABI tag. The python -c "import pyarrow; pyarrow.create_library_symlinks()" call is supposed to fix this but it's not. A recommendation from and existing issue is to build against the conda version of pyarrow, which is not useful. https://github.com/apache/arrow/issues/22388

Jake-Moss · 2024-02-16T01:55:05Z

Despite CI failures, all tests pass locally with 78.80% coverage

Moves the computation of the path sized logit inside the main multithreaded loop, this lets us batch them as well and dump them to disk along with the reset of the tables. Enable multithreading by default. Catch a handle of memory leaks.

This reverts commit e3ea0c6.

* OSM Geo-processing using GeoPandas * OSM Geo-processing using GeoPandas * OSM Geo-processing using GeoPandas * OSM Geo-processing using GeoPandas * Processing with chunking * Processing with chunking * Processing with chunking * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * pandas deprecations * improves griding function * adds network cleaning * adds network cleaning * Allows for use of polygons with rings inside * Allows for use of polygons with rings inside * adjusts types * Update pyproject.toml Co-authored-by: Jake Moss <[email protected]> * . * . --------- Co-authored-by: pveigadecamargo <[email protected]> Co-authored-by: Jake Moss <[email protected]>

* Removes legacy code * updates versions --------- Co-authored-by: pveigadecamargo <[email protected]>

commit 9337fb611463606b0ca89d1270210fa7fdf46714 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 11:22:48 2024 +1000 . commit 3f2c01b Author: Jake-Moss <[email protected]> Date: Tue Feb 27 10:09:51 2024 +1000 I give up commit c77d5e6 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 10:03:52 2024 +1000 . commit 9dc3650 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:58:40 2024 +1000 . commit b4b945c Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:53:10 2024 +1000 . commit e0c32f2 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:38:29 2024 +1000 . commit eedb859 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:30:20 2024 +1000 . commit d0def11 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:27:33 2024 +1000 . commit a6b2a8e Merge: 0548be6 77edeae Author: Jake Moss <[email protected]> Date: Tue Feb 27 09:19:30 2024 +1000 Merge branch 'develop' into pedro/ci_test commit 0548be6 Author: Jake-Moss <[email protected]> Date: Tue Feb 27 09:13:07 2024 +1000 Macos test commit a10c791 Author: Jake-Moss <[email protected]> Date: Wed Feb 21 20:00:01 2024 +1000 Maybe fix MacOS again commit cad0579 Author: Jake-Moss <[email protected]> Date: Wed Feb 21 19:43:00 2024 +1000 Maybe MacOS fix commit 24de8f7 Author: Jake-Moss <[email protected]> Date: Wed Feb 21 19:31:08 2024 +1000 Hopefully fix CI commit 66ffcc3 Author: pveigadecamargo <[email protected]> Date: Wed Feb 21 17:01:57 2024 +1000 . commit f934c34 Author: pveigadecamargo <[email protected]> Date: Wed Feb 21 16:41:36 2024 +1000 . commit 9f2ca20 Author: pveigadecamargo <[email protected]> Date: Wed Feb 21 16:30:59 2024 +1000 . commit ba0d882 Author: pveigadecamargo <[email protected]> Date: Wed Feb 21 16:27:14 2024 +1000 . commit 315cbce Author: Pedro Camargo <[email protected]> Date: Wed Feb 21 01:01:23 2024 +1000 Update pyproject.toml commit d58e7cc Author: Pedro Camargo <[email protected]> Date: Wed Feb 21 00:58:54 2024 +1000 Update pyproject.toml commit dd6723f Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 23:02:05 2024 +1000 . commit f7ae37e Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:58:19 2024 +1000 . commit 0f954e4 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:54:21 2024 +1000 . commit 3dafd88 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:27:05 2024 +1000 . commit ebe3a19 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:12:39 2024 +1000 . commit 9f4413e Merge: daf48a5 2e3c234 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:11:22 2024 +1000 Merge branch 'develop' of github.com:AequilibraE/aequilibrae into pedro/ci_test commit daf48a5 Merge: 3df73da ccb9cfa Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:11:13 2024 +1000 . commit 3df73da Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:10:13 2024 +1000 . commit 9448e76 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 22:10:05 2024 +1000 adds emulation commit c9f2aaa Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 21:20:12 2024 +1000 adds emulation commit b2d5d3d Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 21:16:51 2024 +1000 adds emulation commit 2458f9a Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 21:02:26 2024 +1000 adds emulation commit e9e660b Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 20:54:57 2024 +1000 adds emulation commit f112262 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 20:42:13 2024 +1000 Add ARM architectures for Linux and mac commit bae7d0d Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 20:28:15 2024 +1000 tests cibuildwheels commit 293457a Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 20:09:43 2024 +1000 tests cibuildwheels commit 06b6a44 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 19:31:21 2024 +1000 tests cibuildwheels commit 0ebce09 Author: pveigadecamargo <[email protected]> Date: Tue Feb 20 19:23:39 2024 +1000 tests cibuildwheels

Jake-Moss · 2024-02-27T03:14:43Z

Closes #510, closes #488

Jake-Moss · 2024-02-28T00:56:21Z

Merging this to make a clean slate for new API and assignment changes

jamiecook and others added 9 commits February 5, 2024 10:47

Fix minor typos (AequilibraE#505)

2efc8ef

* Update graph.py * .

Processing with chunking

a23270f

Merge branch 'develop' of github.com:AequilibraE/aequilibrae into ped…

ccc0d68

…ro/improves_dependencies

.

560728f

.

608141a

.

99fc49d

Scratch parquet work

0e668a9

Merge pull request AequilibraE#507 from AequilibraE/pedro/improves_de…

2e3c234

…pendencies Improves dependencies

Better structuring for the table

85fdb12

Jake-Moss added 2 commits February 12, 2024 14:27

Add optional immediately disk saving

eecaaa3

Test fixes

945e8a9

Jake-Moss added 4 commits February 12, 2024 16:38

Update CI workflows

e3ea0c6

In what I consider a bad move, the PyPi wheel installation of pyarrow requires modifying the installation environment to create symlinks for the linker.

Move pyarrow to a hard dependency for now

4c39a8b

Linting

f18d13c

CI again

9ad188b

Jake-Moss added 5 commits February 13, 2024 17:34

Warn about duplicate OD pairs, fix memory corruption, disable A*

6c79a17

Fix path order

d3677b6

Add link penalisation

b5ed6a4

Linting

738a657

Jake-Moss changed the title ~~Saving route choice results~~ Route choice additions Feb 15, 2024

Remove debug test

1543549

Jake-Moss and others added 15 commits February 16, 2024 14:09

Add route cost computation and test

31aaaad

Add psl gamma computation, not sure what to test against

71257eb

Add scratch probability implementation

7688d75

Fix probability calculations

b0627e4

Revert "Update CI workflows"

1f195b8

This reverts commit e3ea0c6.

Move the pyarrow symlink creation to setup.py

b44721e

CI

ab37eaf

.

0df7565

.

ccb9cfa

Simply bumping up the version for the next release (AequilibraE#512)

77edeae

* Removes legacy code * updates versions --------- Co-authored-by: pveigadecamargo <[email protected]>

Merge remote-tracking branch 'origin/develop' into route_choice_saving

d5e1121

Re-enable CI

28bfe7d

Jake-Moss marked this pull request as ready for review February 28, 2024 00:55

Jake-Moss merged commit 0230ae7 into AequilibraE:route_choice Feb 28, 2024
19 checks passed

Jake-Moss deleted the route_choice_saving branch February 28, 2024 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route choice additions #508

Route choice additions #508

Jake-Moss commented Feb 6, 2024

Jake-Moss commented Feb 7, 2024 •

edited

Loading

pedrocamargo commented Feb 7, 2024

Jake-Moss commented Feb 8, 2024

janzill commented Feb 8, 2024

Jake-Moss commented Feb 12, 2024

Jake-Moss commented Feb 12, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 27, 2024

Jake-Moss commented Feb 28, 2024

Route choice additions #508

Route choice additions #508

Conversation

Jake-Moss commented Feb 6, 2024

Jake-Moss commented Feb 7, 2024 • edited Loading

pedrocamargo commented Feb 7, 2024

Jake-Moss commented Feb 8, 2024

janzill commented Feb 8, 2024

Jake-Moss commented Feb 12, 2024

Jake-Moss commented Feb 12, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 16, 2024

Jake-Moss commented Feb 27, 2024

Jake-Moss commented Feb 28, 2024

Jake-Moss commented Feb 7, 2024 •

edited

Loading