Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connect): printSchema #3617

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

feat(connect): printSchema #3617

wants to merge 11 commits into from

Conversation

andrewgazelka
Copy link
Contributor

@andrewgazelka andrewgazelka commented Dec 19, 2024

TODO

  • should we reuse TreeDisplay?
  • remove unwraps

pub trait TreeDisplay {

  • should we make our own?

  • Example own impl that would need to be tested (don't look at seriously!)

pub fn to_tree_string(schema: &Schema) -> eyre::Result<String> {
    let mut output = String::new();
    // Start with root
    writeln!(&mut output, "root")?;
    // Now print each top-level field
    for (name, field) in &schema.fields {
        print_field(&mut output, name, &field.dtype, /*nullable*/ true, 1)?;
    }
    Ok(output)
}

// A helper function to print a field at a given level of indentation.
// level=1 means a single " |-- " prefix, level=2 means
// " |    |-- " and so on, mimicking Spark's indentation style.
fn print_field(
    w: &mut String, 
    field_name: &str, 
    dtype: &DataType, 
    nullable: bool, 
    level: usize
) -> eyre::Result<()> {
    // Construct the prefix for indentation.
    // Spark indentation levels:
    // level 1:  " |-- "
    // level 2:  " |    |-- "
    // level n:  " |" followed by (4*(n-1)) spaces + "-- "
    let indent = if level == 1 {
        format!(" |-- ")
    } else {
        let spaces = " ".repeat(4*(level-1));
        format!(" |{}-- ", spaces)
    };

    // Get a user-friendly string for dtype
    let dtype_str = type_to_string(dtype);

    writeln!(
        w,
        "{}{}: {} (nullable = {})",
        indent, field_name, dtype_str, nullable
    )?;

    // If the dtype is a struct, we must print its child fields with increased indentation.
    if let DataType::Struct(fields) = dtype {
        for field in fields {
            print_field(w, &field.name, &field.dtype, true, level + 1)?;
        }
    }

    Ok(())
}

fn type_to_string(dtype: &DataType) -> String {
    // We want a nice, human-readable type string.
    // Spark generally prints something like "integer", "string", etc.
    // We'll follow a similar style here:
    match dtype {
        DataType::Null => "null".to_string(),
        DataType::Boolean => "boolean".to_string(),
        DataType::Int8
        | DataType::Int16
        | DataType::Int32
        | DataType::Int64
        | DataType::UInt8
        | DataType::UInt16
        | DataType::UInt32
        | DataType::UInt64 => "integer".to_string(), // Spark doesn't differentiate sizes
        DataType::Float32 | DataType::Float64 => "double".to_string(), // Spark calls all floats double for printing
        DataType::Decimal128(_, _) => "decimal".to_string(),
        DataType::Timestamp(_, _) => "timestamp".to_string(),
        DataType::Date => "date".to_string(),
        DataType::Time(_) => "time".to_string(),
        DataType::Duration(_) => "duration".to_string(),
        DataType::Interval => "interval".to_string(),
        DataType::Binary => "binary".to_string(),
        DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
        DataType::Utf8 => "string".to_string(),
        DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays
        DataType::List(_) => "array".to_string(),
        DataType::Struct(_) => "struct".to_string(),
        DataType::Map { .. } => "map".to_string(),
        DataType::Extension(_, _, _) => "extension".to_string(),
        DataType::Embedding(_, _) => "embedding".to_string(),
        DataType::Image(_) => "image".to_string(),
        DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
        DataType::Tensor(_) => "tensor".to_string(),
        DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
        DataType::SparseTensor(_) => "sparse_tensor".to_string(),
        DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),
        #[cfg(feature = "python")]
        DataType::Python => "python_object".to_string(),
        DataType::Unknown => "unknown".to_string(),
    }
}

@github-actions github-actions bot added the feat label Dec 19, 2024
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

codspeed-hq bot commented Dec 19, 2024

CodSpeed Performance Report

Merging #3617 will improve performances by 77.66%

Comparing andrew/print-schema (f4a3051) with main (e59581c)

Summary

⚡ 2 improvements
✅ 25 untouched benchmarks

Benchmarks breakdown

Benchmark main andrew/print-schema Change
test_iter_rows_first_row[100 Small Files] 209.1 ms 117.7 ms +77.66%
test_show[100 Small Files] 23.8 ms 15.6 ms +52.82%

@andrewgazelka andrewgazelka marked this pull request as ready for review December 19, 2024 16:31
Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 86.83386% with 42 lines in your changes missing coverage. Please review.

Project coverage is 77.65%. Comparing base (e59581c) to head (f4a3051).

Files with missing lines Patch % Lines
src/daft-connect/src/display.rs 87.63% 35 Missing ⚠️
src/daft-connect/src/lib.rs 80.00% 7 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3617      +/-   ##
==========================================
- Coverage   77.99%   77.65%   -0.34%     
==========================================
  Files         720      721       +1     
  Lines       88794    91520    +2726     
==========================================
+ Hits        69252    71070    +1818     
- Misses      19542    20450     +908     
Files with missing lines Coverage Δ
src/daft-connect/src/translation/schema.rs 100.00% <100.00%> (ø)
src/daft-connect/src/lib.rs 65.01% <80.00%> (+2.13%) ⬆️
src/daft-connect/src/display.rs 87.63% <87.63%> (ø)

... and 14 files with indirect coverage changes

tests/connect/test_print_schema.py Outdated Show resolved Hide resolved
src/daft-connect/src/display.rs Outdated Show resolved Hide resolved
src/daft-connect/src/display.rs Outdated Show resolved Hide resolved
src/daft-connect/src/display.rs Outdated Show resolved Hide resolved
Comment on lines 81 to 85
DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
DataType::Tensor(_) => "tensor".to_string(),
DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
DataType::SparseTensor(_) => "sparse_tensor".to_string(),
DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think these exist in spark (along with unsized ints). We should check if spark connect has a standard around extension or user defined types. If they don't I'd at least want something in the display to indicate that these are not native spark types, but in fact daft datatypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmmm https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/UserDefinedType.html

can people even define UDTs outside of Java? I don't see a pyspark example

src/daft-connect/src/display.rs Outdated Show resolved Hide resolved
src/daft-connect/src/display.rs Outdated Show resolved Hide resolved
DataType::Binary => "binary".to_string(),
DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
DataType::Utf8 => "string".to_string(),
DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would represent this as a custom type similar to the other non native dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no standard thoughts on something like daft[fixed_size_list]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there seems to be no standard, I'd prefer to separate them into 2 categories.

arrow native datatypes:

for arrow native datatypes such as unsigned integers,fsl, etc, lets go with arrow.<datatype> such as

  • u64 -> arrow.uint64,
  • fsl(u8, 1) -> arrow.fixed_size_list(1)\n --- element: arrow.u8

(this resembles how arrow does extension types)

custom daft datatypes:

for non arrow native ones such as SparseTensor, Image, and so on, let's prefix them with daft.

  • image -> daft.image(<image_mode>) ex: daft.image(RGB)
  • sparsetensor(u8) -> daft.sparse_tensor\n --- element: arrow.u8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be done

Comment on lines +101 to +121
DataType::FixedSizeBinary(_) => "arrow.fixed_size_binary".to_string(),
DataType::Utf8 => "string".to_string(),
DataType::FixedSizeList(_, _) => "arrow.fixed_size_list".to_string(),
DataType::List(_) => "arrow.list".to_string(),
DataType::Struct(_) => "struct".to_string(),
DataType::Map { .. } => "map".to_string(),
DataType::Extension(_, _, _) => "daft.extension".to_string(),
DataType::Embedding(_, _) => "daft.embedding".to_string(),
DataType::Image(_) => "daft.image".to_string(),
DataType::FixedShapeImage(_, _, _) => "daft.fixed_shape_image".to_string(),
DataType::Tensor(_) => "daft.tensor".to_string(),
DataType::FixedShapeTensor(_, _) => "daft.fixed_shape_tensor".to_string(),
DataType::SparseTensor(_) => "daft.sparse_tensor".to_string(),
DataType::FixedShapeSparseTensor(_, _) => "daft.fixed_shape_sparse_tensor".to_string(),
#[cfg(feature = "python")]
DataType::Python => "daft.python".to_string(),
DataType::Unknown => "unknown".to_string(),
DataType::UInt8 => "arrow.ubyte".to_string(),
DataType::UInt16 => "arrow.ushort".to_string(),
DataType::UInt32 => "arrow.uint".to_string(),
DataType::UInt64 => "arrow.ulong".to_string(),
Copy link
Contributor

@universalmind303 universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I was unclear in my previous comment, but this is still not right.

arrow types should just be called what they are

        DataType::UInt8 => "arrow.uint8".to_string(),
        DataType::UInt16 => "arrow.uint16".to_string(),
        DataType::UInt32 => "arrow.uint32".to_string(),
        DataType::UInt64 => "arrow.uint64".to_string(),

and nested datatypes should match how spark does them
for example, lists have the inner rendered as "element"

data = [{"a": [1,2,3], "b": "hello"}]
spark.createDataFrame(data).printSchema()
root
 |-- a: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- b: string (nullable = true)

and for structs Struct{ints: i64, strings: utf8}

root
 |-- struct: struct (nullable = true)
 |    |-- ints: integer (nullable = true)
 |    |-- strings: string (nullable = true)

We'll also want to capture the parameters on them such as FixedSizeList(Int64, 1)

root
 |-- a: arrow.fixed_size_list (size = 1, nullable = true)
 |    |-- element: long (containsNull = true)

or on Image(ImageMode::RGB)

root
 |-- a: daft.image (mode = RGB, nullable = true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants