-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Convert GeoJson Features to GeoArrow (Draft) #2889
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to handle converting properties to arrow as well? That requires handling at least a few different property types.
Vector as ArrowVector, | ||
makeBuilder | ||
} from 'apache-arrow'; | ||
import { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clearer to use import type
if (I think) these are only used as types
let geometryType = ''; | ||
for (let i = 0; i < features.length; i++) { | ||
const {geometry} = features[i]; | ||
if (i === 0) { | ||
geometryType = geometry.type; | ||
} | ||
if (geometry.type.includes('Multi')) { | ||
geometryType = geometry.type; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead I'd suggest building a Set
of geometry types instead, because you probably want to error if you see both Point
and LineString
geometries in a single feature collection (at least until you want to support mixed-geometry arrays, which have a decent amount of complexity).
// make geoarrow builder | ||
const builder = makeBuilder({ | ||
type: pointType, | ||
nullValues: [null] | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw I've never tried the Arrow JS builders.
Like how the geojson-binary geojson conversion is implemented in loaders, in my own work I try to do a two-pass approach where I first count the number of coordinates and thus infer the sizes of buffers needed, and then in a second step allocate the exact size of buffers once and fill them with coordinates.
In your case if you wanted to go that route you could count the number of coordinates you'll have, instantiate a single Float64Array
and set the values with coordinates. For nested types you'd have to manage building your own offset buffers as well, which may be a bit more complex than you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While your two-pass approach is surely more optimized It would be nice to build a baseline with the builders, as this presumably lets us work with more sanely with complex data types, that can have many index columns.
We can add such optimizations later.
// build geometry vector | ||
const geometry = builder.finish().toVector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ever want to handle attributes in the future, ideally each column will have identical chunking. Whereas when you let Arrow JS handle the builders, you don't necessarily have control over the chunking of each column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I was assuming that we if we build the columns in parallel, we can choose to stop "building" after a certain number of rows and then create a RecordBatch from the "partial" vectors for each column, and then start building new colums/vectors again, rinse and repeat?
export function geojsonLineStringToArrow(name: string, lines: Feature[]): GeoArrowReturnType { | ||
// get dimension from the first line | ||
const dimension = (lines[0].geometry as LineString).coordinates[0].length; | ||
const pointFieldName = dimension === 2 ? 'xy' : 'xyz'; | ||
|
||
// get line type | ||
const nullable = false; | ||
const coordType = new ArrowFloat(); | ||
const pointType = new FixedSizeList( | ||
dimension, | ||
new ArrowField(pointFieldName, coordType, nullable) | ||
); | ||
const lineType = new ArrowList(new ArrowField('points', pointType, nullable)); | ||
|
||
// create a field | ||
const metaData: Map<string, string> = new Map([['ARROW:extension:name', 'geoarrow.linestring']]); | ||
const field = new ArrowField(name, lineType, nullable, metaData); | ||
|
||
// make geoarrow builder | ||
const builder = makeBuilder({ | ||
type: lineType, | ||
nullValues: [null] | ||
}); | ||
|
||
for (let i = 0; i < lines.length; i++) { | ||
const coords = (lines[i].geometry as LineString).coordinates; | ||
// @ts-ignore | ||
builder.append(coords); | ||
} | ||
|
||
const geometry = builder.finish().toVector(); | ||
|
||
return { | ||
field, | ||
geometry | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like each of these 6 functions are exactly the same except for the creation of the types. You might be able to deduplicate the second half (filling) of each of these functions.
Thank you very much, @kylebarron! Great comments. I will address them. Yes, this PR is an initial work of converting other format e.g. GeoJSON/Csv to arrow, so I plan to handle attributes in the future. Do you know if there is already pure JS library doing this? Another question: when you mention: Thanks again! |
https://github.com/geoarrow/geoarrow-js/blob/main/src/algorithm/coords.ts has an example of creating arrays from raw Float64Arrays and Int32Arrays for offsets (though offsets there are not modified). You can look at the overloads on |
Signed-off-by: Xun Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start. I'd love to use it for adding arrow output from GeoJSONLoader and CSVLoader etc. For that we also need to be able to create at least a subset of normal / property columns (numbers, booleans, strings, maybe dates).
@@ -0,0 +1,375 @@ | |||
import { | |||
Float64 as ArrowFloat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this renaming tends to get excessive, in other files I have been breaking my own rule (not to use wildcard imports) and used import * as arrow
and then referenced as arrow.Float
. I think this makes the code quite clear.
/** | ||
* Arrow Field and geometry vector | ||
*/ | ||
export type GeoArrowReturnType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems more general, overall I would like us to have support for all column types not just GeoArrow geometry:
export type GeoArrowReturnType = { | |
export type ArrowColumn = { |
break; | ||
} | ||
} | ||
if (geometryType === 'Point') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I find switch
statments much easier to read.
if (i === 0) { | ||
geometryType = geometry.type; | ||
} | ||
if (geometry.type.includes('Multi')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more readable?
if (geometry.type.includes('Multi')) { | |
if (geometry.type.startsWith('Multi')) { |
name: string, | ||
points: Array<Feature | GeoJsonObject> | ||
): GeoArrowReturnType { | ||
const firstPoint = isFeature(points[0]) ? points[0].geometry : points[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Handle lists of just geometries? Do we need such dynamic support? It makes the code a little harder to follow.
// make geoarrow builder | ||
const builder = makeBuilder({ | ||
type: pointType, | ||
nullValues: [null] | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While your two-pass approach is surely more optimized It would be nice to build a baseline with the builders, as this presumably lets us work with more sanely with complex data types, that can have many index columns.
We can add such optimizations later.
// build geometry vector | ||
const geometry = builder.finish().toVector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I was assuming that we if we build the columns in parallel, we can choose to stop "building" after a certain number of rows and then create a RecordBatch from the "partial" vectors for each column, and then start building new colums/vectors again, rinse and repeat?
This WIP PR is trying to add functions to convert GeoJson Features to GeoArrow. These functions could be used as:
arrow-table
loader's option in other loaders e.g. GeoJson/Csv to load spatial data into Arrow memoryI am not confident with the implementation in this PR (missing documentation in js apache-arrow). Please help to take a look if you are interested. Thank you!