Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding second moment of values per key for Typed-API reduce operations #1279

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

oeddyo
Copy link
Contributor

@oeddyo oeddyo commented May 7, 2015

proof of concept for #1068

@oeddyo
Copy link
Contributor Author

oeddyo commented May 7, 2015

DO NOT MERGE YET.

@oeddyo
Copy link
Contributor Author

oeddyo commented May 16, 2015

@johnynek Hi Oscar. Thanks for walking me through the code today!

I missed one problem to discuss with you, which I note in the code.

so if I'm doing

  var numValuesPerKey = 0L

  val resIter = reduceFnSer.get(key, values)
  while (resIter.hasNext) {
    val tup = Tuple.size(1)
    val t2 = resIter.next

    numValuesPerKey += 1L

    tup.set(0, t2)
    oc.add(tup)
  }
  val valueCountSum = numValuesPerKey
  println("value count = " + numValuesPerKey)

For the test it would print

    value count = 1
    value count = 1
    value count = 1

Which should actually has a "value count = 2" for key 1. (please see test ReduceValueCounterTest for detail)

I have the test in

branch: exie/1068
test-only com.twitter.scalding.ReduceValueCounterTest

should be easy to replicate. Just uncomment the block and comment the block under it. (in Operation.scala line 509-524)

@oeddyo
Copy link
Contributor Author

oeddyo commented May 18, 2015

The reason should due to:

  val resIter = reduceFnSer.get(key, caches.toIterator)
  while (resIter.hasNext) {
    val tup = Tuple.size(1)
    val t2 = resIter.next

    tup.set(0, t2)
    oc.add(tup)
  }

is trying to iterate the reduced result thus it's iterating through how many keys it has. Thus unfortunately we can't use a var to do a count to see how many values are associated with each key here.

@@ -491,13 +497,23 @@ package com.twitter.scalding {
.asScala
.map(_.getObject(0).asInstanceOf[V])

val caches = values.toList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, we can't do this (because it would force everything to memory), but what about something like this:

class CountingIterator[T](wraps: Iterator[T]) extends Iterator[T] {
  private[this] var nextCalls = 0
  def hasNext = wraps.hasNext
  def next = { nextCalls += 1; wraps.next }
  def seen: Int = nextCalls
}

then we could wrap values with this, then call .seen at the end to see how many values went in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@oeddyo
Copy link
Contributor Author

oeddyo commented May 20, 2015

@johnynek Does this look good?


val numValuesPerKey = values.seen

flowProcess.increment(SkewMonitorCounters.KeyCount, SkewMonitorCounters.KeyCount, 1L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the group and counter to be the same string. We want the group to be something like "scalding debug" and the counter value is fine.

@johnynek johnynek changed the title Proof of concept Adding second moment of values per key for Typed-API reduce operations May 21, 2015
@oeddyo
Copy link
Contributor Author

oeddyo commented Jun 4, 2015

After testing it for a couple more times, I confirm it's a bug. Here's how you could re-produce it:

Checkout the code above and uncomment line 523 in scalding/scalding-core/src/main/scala/com/twitter/scalding/Operations.scala

Then in sbt do

test-only com.twitter.scalding.ReduceValueCounterTest

It print out a line (corresponding to the code in CoreTest.scala line 1837)
PRINTING KEY AND GROUP! 0

But if you use same group name as key name, then it gives
PRINTING KEY AND GROUP! 3

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants