missing jar files #100

ami07 · 2019-05-20T14:36:25Z

Hi,
I am trying to generate spark code. However, it returns an error:
[info] Loading project definition from /home/ec2-user/dbtoaster/dbtoaster-backend/project
[info] Set current project to dbtoaster (in build file:/home/ec2-user/dbtoaster/dbtoaster-backend/)
[info] Updating {file:/home/ec2-user/dbtoaster/dbtoaster-backend/}lms...
[info] Resolving EPFL#lms_2.11;0.3-SNAPSHOT ...
[warn] module not found: EPFL#lms_2.11;0.3-SNAPSHOT
[warn] ==== local: tried
[warn] /home/ec2-user/.ivy2/local/EPFL/lms_2.11/0.3-SNAPSHOT/ivys/ivy.xml
[warn] ==== local-preloaded-ivy: tried
[warn] /home/ec2-user/.sbt/preloaded/EPFL/lms_2.11/0.3-SNAPSHOT/ivys/ivy.xml
[warn] ==== local-preloaded: tried
[warn] file:////home/ec2-user/.sbt/preloaded/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[warn] ==== sonatype-snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/EPFL/lms_2.11/0.3-SNAPSHOT/lms_2.11-0.3-SNAPSHOT.pom
[info] Resolving jline#jline;2.12 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: EPFL#lms_2.11;0.3-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] EPFL:lms_2.11:0.3-SNAPSHOT (/home/ec2-user/dbtoaster/dbtoaster-backend/ddbtoaster/lms/build.sbt#L10-36)
[warn] +- ch.epfl.data:dbtoaster-lms_2.11:3.0
sbt.ResolveException: unresolved dependency: EPFL#lms_2.11;0.3-SNAPSHOT: not found

So it cannot resolve the dependency added to build.sbt ""EPFL" %% "lms" % "0.3-SNAPSHOT""
How can I fix this?
P.S. when I run the test units, they also fail with the same error.

losmi83 · 2019-05-20T15:38:23Z

You can try this:

git clone https://github.com/epfldata/lms.git
git checkout booster-develop-0.3
git apply compiler_patch.txt (attached, support for Scala 2.11)
sbt publish-local

Now you should be able to generate code in dbtoaster-backend. If not, let me know.

Note: You can generate Spark code only for TPC-H queries.

compiler_patch.txt

ami07 · 2019-05-20T18:13:34Z

Thank you very much.
I am now able to generate code for some of the example queries.

For a simplified TPCH query that I created (joins lineitem, ), I got this error:
Fatal error: exception Failure("Missing partitioning information for COUNTLINEITEM1_DELTA")

One other issue that is confusing me. For the generated file, in the readme, step 4 is: Compile the generated Spark program for the target execution environment.
Does it mean that I create a new project with an build.sbt that includes the jar files from the lms directory that you shared above, or it means that I need to copy the generated code to somewhere in : dbtoaster-backend/ddbtoaster/lms/ and use the build.sbt in it.
Thanks

losmi83 · 2019-05-21T23:27:27Z

For a simplified TPCH query that I created (joins lineitem, ), I got this error:
Fatal error: exception Failure("Missing partitioning information for COUNTLINEITEM1_DELTA")

This version does not support custom queries over the TPC-H schema. I have just updated the frontend to allow that.

cd dbtoaster-a5
git pull
make

You should be able to generate code for custom TPC-H queries.

One other issue that is confusing me. For the generated file, in the readme, step 4 is: Compile the generated Spark program for the target execution environment.
Does it mean that I create a new project with an build.sbt that includes the jar files from the lms directory that you shared above, or it means that I need to copy the generated code to somewhere in : dbtoaster-backend/ddbtoaster/lms/ and use the build.sbt in it.

To compile generated code, you would need to include Spark jar files and DBToaster runtime jar files. The latter you can find in the distribution under dbtoaster/lib/dbt_scala -- or even better run sbt release and under ddbtoaster/release/lib/dbt_scala you will find the latest DBToaster jar files.

ami07 · 2019-05-30T17:15:24Z

Thank you very much, Milos for your reply.
I am finally able to compile the code. Since the unit tests called in run_spark_weak_experiments.sh and run_spark_strong_experiments.sh (which I found in dbtoaster-backend/ddbtoaster/scripts) do not work, I packaged the generated code into a jar file and I am using spark-submit to execute it on the cluster.
The issue that I am facing right now. From the spark scripts, I understand that the arguments used when calling the spark code are: -xvm -xsc -p 2 -s 1 -w 0 -d 500gb -O3 --batch -xbs 10000000 -x -F HEURISTICS-DECOMPOSE-OVER-TABLES
I cannot find where these are explained. The error I am also getting right now is related to the paths set into the configuration file.

I tried to create a conf directory (similar to the one found in: dbtoaster-backend/ddbtoaster/spark/conf ) and add configuration file after changing these paths, and then package it into a jar. However, it seems that the configuration parameters are set somewhere else.

I thought I would ask before I change the generated code to hard code where the conf file is located.

losmi83 · 2019-05-30T17:37:29Z

See ddbtoaster/spark/conf/spark.config for various spark configuration parameters.

ami07 · 2019-05-30T17:47:45Z

Yes, I created a similar configuration file in my spark project and I changed the values of the configuration parameters (the paths) to where my data and outputs should be. However, for some reason, when I execute the jar file that I create it is still assuming that the values are similar to the defaults in spark.config. So I assumed that this is also set in one of the dbtoaster libraries I am including when I packaged my jar file.
I think I am on the right track so I will find a way to read the correct conf file.
Thanks

ami07 · 2019-06-17T14:47:28Z

Hi,
I have changed the generated code to read from a custom function that reads the conf parameters since the one referenced in the dbtoaster libraries kept crashing. Now the spark code seemed to be running, however, the jobs that actually executes the data fail.
I am attaching screenshots from the query execution (the query joins 3 TPCH tables). Any advice about what could be causing the error or what I might be doing wrong? The details of the jobs does not really tell much. Only referring to an empty queue!
Thanks

Fq4_spark - Details for Job 9_failed.pdf

losmi83 · 2019-06-25T10:44:03Z

Hard to tell, you could try reshuffling your input such that each partition is non-empty.

ami07 · 2019-06-25T14:09:59Z

Thanks Milos for your reply. I understand that the data are just the csv files generated by the TPCH data generator. Or am I missing something?
I made that assumption because of this line in the README file in experiments/datasets: "Put your datasets here (e.g., 1GB/lineitem.csv) "

losmi83 · 2019-06-25T16:34:25Z

Yes, the input is standard TPCH files. If I remember correctly, the code expects that the input is randomly distributed across all nodes to avoid data skew. This would explain your error -- might happen that your data is stored in one partition and others are empty. I would suggest running rdd.repartition after loading your input.

ami07 · 2019-06-25T19:37:59Z

I am storing the data in HDFS, so it should be partitioned among the nodes of the cluster. I will try the rdd.repartition.
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing jar files #100

missing jar files #100

ami07 commented May 20, 2019

losmi83 commented May 20, 2019 •

edited

Loading

ami07 commented May 20, 2019

losmi83 commented May 21, 2019

ami07 commented May 30, 2019

losmi83 commented May 30, 2019

ami07 commented May 30, 2019

ami07 commented Jun 17, 2019

losmi83 commented Jun 25, 2019

ami07 commented Jun 25, 2019

losmi83 commented Jun 25, 2019

ami07 commented Jun 25, 2019

missing jar files #100

missing jar files #100

Comments

ami07 commented May 20, 2019

losmi83 commented May 20, 2019 • edited Loading

ami07 commented May 20, 2019

losmi83 commented May 21, 2019

ami07 commented May 30, 2019

losmi83 commented May 30, 2019

ami07 commented May 30, 2019

ami07 commented Jun 17, 2019

losmi83 commented Jun 25, 2019

ami07 commented Jun 25, 2019

losmi83 commented Jun 25, 2019

ami07 commented Jun 25, 2019

losmi83 commented May 20, 2019 •

edited

Loading