Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge develop into cascading3 #1762

Merged
merged 104 commits into from
Jan 25, 2018
Merged

Conversation

piyushnarang
Copy link
Collaborator

Putting this up to have CI kick off tests so that we can move on to fixing tests and cleaning up stuff I might have messed up as the merge was pretty large.

Joe Nievelt and others added 30 commits May 14, 2015 13:18
Conflicts:
	CHANGES.md
	README.md
	project/Build.scala
	scalding-core/src/main/scala/com/twitter/package.scala
	tutorial/execution-tutorial/ExecutionTutorial.scala
	version.sbt
* Set up repl tutorial

* Move wonderland tutorial under tutorial/
* Clean up temporary files created by forceToDiskExecution

* Try disabling travis caching

* Revert "Try disabling travis caching"

This reverts commit b9a04b6.

* Forward filesToCleanup to parent in cleanCache

* consistency

* synchronize in addFilesToCleanup
…empty-dir

Revert FileSource support for empty directories
…es. (twitter#1627)

* TypedText - Added versions of dailyPrefixSuffix for TSV and CSV sources.

* TypedText - Refactored functions for loading files with different types of delimiters.
…witter#1626)

* Allow supplying better error messages and chained exceptions in InvalidSourceTap / InvalidSourceException

* Add 2ndary constructor to InvalidSourceException
Bump chill version to 0.8.3 and fix started to fail `ConfigTest` because `void` is serializable by default in Kryo now.
…ache_execution

Added API for Execution/Config to work with DistributedCache
…_vkvs

Use JobConf input files list for input size computation used by ReducersEstimators
Using MAP_OUTPUT_BYTES instead of FILE_BYTES_READ in RatioBasedEstimator
…ator

Fixed reduce estimator for paths with a glob pattern
Expand libjars globs in ScaldingShell to match the behavior of Tool.
Piyush Narang and others added 7 commits June 23, 2017 16:07
This PR cherry-picks changes to CHANGES.md file made during 0.17.2 release. Also this PR includes mima check related changes, but disables mima check because it's intended that binary compatibility can be broken (and it is) on develop branch.
…ph-lib

Add summingbird graph library as a zero-dep module
@CLAassistant
Copy link

CLAassistant commented Dec 27, 2017

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@johnynek
Copy link
Collaborator

looks like the hadoop tests has one test failure:

 Buffer(HadoopFlowStep[name: (1/3)], HadoopFlowStep[name: (2/3) output2], HadoopFlowStep[name: (3/3) output]) had size 3 instead of expected size 4 (PlatformTest.scala:654)

Thanks a ton for working on this. Would love to get this merged and move to cascading 3.

@piyushnarang
Copy link
Collaborator Author

Yeah there's an issue in the hadoop test and some failures in the estimator-tests. I think I might have missed something in the merge as there were a bunch of reducer & memory estimator changes that went into develop. I'm working on this on the side so if you're urgently blocked on this, I wouldn't want to hold you up so I can grant push access to my branch (or you can fork this one).

@piyushnarang
Copy link
Collaborator Author

Actually @johnynek looking at the job that fails:

class MergeTwoSinksForceToDiskTypedJob(args: Args) extends Job(args) {
  import TinyJoinAndMergeJob._

  val input1 = TypedPipe.from(joinInput1)
  val input2 = TypedPipe.from(joinInput2)

  val merged = (input1 ++ input2).asKeys.group.size.map { case (k, v) => (k, v.toInt) }

  merged
    .forceToDisk
    .write(output)

  merged
    .write(output2)
}

The check seems to be failing on the fact that we are expecting 4 steps (we get 3 though). Tracking back the original bug - cwensel/cascading#52 it was more regarding the fact that the Cascading3 planner failed on this code path. I'm not sure if the 3 steps is a bad thing?

@johnynek
Copy link
Collaborator

@piyushnarang I agree. I think you can change the test to be <= 4 or == 3.

I think 3 is better right? 1 Map-only job for the second part, and 1 MR job for the forced part, and 1 Map only for the force -> write. Seems like 3 is what I would have expected.

@piyushnarang
Copy link
Collaborator Author

Yeah that's what I was thinking too. Ok, I switched this out. We have some failures in the estimators left and then we should be good to go (famous last words) :-)

@johnynek
Copy link
Collaborator

very exciting.

We are somehow hitting constant problems in planning now as Stripe. Many jobs take hours to plan and some have been stuck in planning for over 36 hours until we killed them.

I am hoping that cascading 3 can fix this issue for us.

@piyushnarang
Copy link
Collaborator Author

hey @johnynek - so our tests are green now. There's a lot of changes here as we've diverged a fair bit. I've taken a pass to sanity check but I might have missed something :-). Looking at the diff against develop (develop...piyushnarang:cascading3) it largely looks like only cascading3 changes there.

@piyushnarang piyushnarang changed the title WIP: Merge develop into cascading3 Merge develop into cascading3 Jan 24, 2018
@johnynek
Copy link
Collaborator

minor issues:

develop...piyushnarang:cascading3#diff-0336b62cdc4bda6c450ac5c6a3e3fc52R273

I don't see why the recursion on the head uses the descriptions but not the tail. Why the change with respect to the current code?

@johnynek
Copy link
Collaborator

I think we should go ahead and merge this after @piyushnarang comments on my above question.

I reviewed the diff. Looks expected. The only issue, I guess is some template taps were removed. I hope that is not a major downstream pain.

Note: this merge is into the twitter:cascading3 branch, not develop.

@pankajroark can someone at Twitter kick-off a test run of a version of scalding against that branch? If it looks good, we should merge. We are really struggling with planner runtime issues at Stripe currently and I am hopeful the new planner will help. We are creating very big jobs (since our data is smaller than yours, people can write much more complex jobs that still complete in reasonable time), so we have hundreds of steps.

@johnynek
Copy link
Collaborator

i'm going to merge this into cascading3 branch and try to fix up with a followup PR changing the build to point to https for more repos.

@johnynek johnynek merged commit a01c942 into twitter:cascading3 Jan 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.