In many cases, datasets don't require the scale of a massive Spark cluster, which comes with the complexities of managing a distributed system. TrivialDB offers a solution by allowing you to process these datasets on a single node using the familiar DataFrame API. Built on Apache Calcite and Apache Arrow, TrivialDB is designed to replace expensive Spark clusters while maintaining efficient and effective data processing capabilities.
- Printing logical query plan (custom logical plan)
DataFrame df = ctx.inMemory()
.project(Arrays.asList(
col("id"),
col("first_name"),
col("last_name"),
col("salary"),
col("salary")))
.filter(col("id").gt(litLong(0L)));
// System.out.println(df.logicalPlan().pretty());
/**Prints out the following:
* Projection: #id, #first_name, #last_name, #salary, #salary
* Selection: #state = 'CO'
* Scan: ; projection=None
*/
- Running queries
DataFrame df = ctx.inMemory()
.filter(new Eq(new Column("id"), new LiteralInt(1)));
Iterable<RecordBatch> results = ctx.execute(df.logicalPlan());
- Testing integration with Apache calcite and removing custom logical plan classes
- Benchmarking TrivialDB's query engine against custom CSV/Parquet file processing
- Adding support for aggregations and joins
- Adding a network layer to run SQL queries against
- Add support for distributed query plans and expand the query engine to be distributed