Skip to content

Added test and removed dead code for Sanity Checker dealing with maps with same key #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 16, 2018

Conversation

leahmcguire
Copy link
Collaborator

Related issues
#151

Describe the proposed solution
Looks like this error was caused by some dead code in sanity checker. I removed this code and added a test for maps with the same keys.

@leahmcguire leahmcguire requested a review from tovbinm as a code owner October 8, 2018 19:55
@leahmcguire leahmcguire requested a review from Jauntbox October 8, 2018 19:55
val features = Seq(id, target, textMap1, textMap2, doubleMap).transmogrify()
val checked = targetResponse.sanityCheck(features)
val output = new OpWorkflow().setResultFeatures(checked).transform(mapDataFrame)
output.show()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. please add some actual checks. perhaps check metadata produced?
  2. please remove output.show()


nullGroups.groupBy(_._1).foreach {
case (group, cols) =>
require(cols.length == 1, s"Vector column $group has multiple null indicator fields: $cols")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are removing this check, also remove this ignored test -

ignore should "Group groupings separately for transformations computed on same feature" in {

@codecov
Copy link

codecov bot commented Oct 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@b1aec92). Click here to learn what that means.
The diff coverage is 95.34%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #153   +/-   ##
=========================================
  Coverage          ?   86.36%           
=========================================
  Files             ?      299           
  Lines             ?     9749           
  Branches          ?      551           
=========================================
  Hits              ?     8420           
  Misses            ?     1329           
  Partials          ?        0
Impacted Files Coverage Δ
...sforce/op/utils/spark/OpVectorColumnMetadata.scala 75.55% <0%> (ø)
...rce/op/stages/impl/preparators/SanityChecker.scala 91.82% <97.61%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1aec92...44c14bb. Read the comment docs.

grouping <- col.grouping
} yield (grouping, (col, col.index))

nullGroups.groupBy(_._1).foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should take out the whole check for multiple null indicators right now?

Afaik, we still have an issue when a single parent feature is vectorized with two different unary stages (eg. if you were to vectorize a real with both the normal RealVectorizer and also bucketize). SanityChecker will still group these into a single contingency matrix since their grouping will just be the parent feature name, so I think we should leave this check in until we fix that issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really think it is better to fail the whole flow when that happens? I would be better to automatically drop one than fail the flow...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even have some incorrect statistics (in my opinion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least keep it in as a warning we log so that we can monitor when it's happening?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jauntbox I fixed it so that it will remove duplicates from the categorical calculations

* Get the feature grouping qualified by the parent feature name
* @return Optional string of feature grouping
*/
def featureGroup(): Option[String] = grouping.map(g => s"${parentFeatureName.mkString}${g}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont you want to do something similar to s"${parentFeatureName.mkString("_")}${grouping.map("_" + _).getOrElse("")}" (same as we do in makeColName() function)

import com.salesforce.op.stages.impl.feature.{HashSpaceStrategy, RealNNVectorizer, SmartTextMapVectorizer}
import com.salesforce.op.test.{OpEstimatorSpec, TestFeatureBuilder, TestSparkContext}
import com.salesforce.op.utils.json.JsonUtils
=======
Copy link
Collaborator

@tovbinm tovbinm Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire merge conflict?

* Get the feature grouping qualified by the parent feature name
* @return Optional string of feature grouping
*/
def featureGroup(): Option[String] = grouping.map(g => s"${parentFeatureName.mkString("_")}_$g")
Copy link
Collaborator

@tovbinm tovbinm Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire repeating my previous comment here - dont you want to do something similar to s"${parentFeatureName.mkString("_")}${grouping.map("_" + _).getOrElse("")}" (same as we do in makeColName() function)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because before calling that get there is a check on whether this is a None which determines if the categorical stats are computed

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@leahmcguire leahmcguire merged commit a35108b into master Oct 16, 2018
@leahmcguire leahmcguire deleted the lm/sanityFix branch October 16, 2018 17:20
ericwayman pushed a commit that referenced this pull request Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants