-
Notifications
You must be signed in to change notification settings - Fork 397
Added test and removed dead code for Sanity Checker dealing with maps with same key #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
val features = Seq(id, target, textMap1, textMap2, doubleMap).transmogrify() | ||
val checked = targetResponse.sanityCheck(features) | ||
val output = new OpWorkflow().setResultFeatures(checked).transform(mapDataFrame) | ||
output.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- please add some actual checks. perhaps check metadata produced?
- please remove
output.show()
|
||
nullGroups.groupBy(_._1).foreach { | ||
case (group, cols) => | ||
require(cols.length == 1, s"Vector column $group has multiple null indicator fields: $cols") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are removing this check, also remove this ignored test -
TransmogrifAI/core/src/test/scala/com/salesforce/op/stages/impl/preparators/BadFeatureZooTest.scala
Line 107 in f37113b
ignore should "Group groupings separately for transformations computed on same feature" in { |
Codecov Report
@@ Coverage Diff @@
## master #153 +/- ##
=========================================
Coverage ? 86.36%
=========================================
Files ? 299
Lines ? 9749
Branches ? 551
=========================================
Hits ? 8420
Misses ? 1329
Partials ? 0
Continue to review full report at Codecov.
|
grouping <- col.grouping | ||
} yield (grouping, (col, col.index)) | ||
|
||
nullGroups.groupBy(_._1).foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should take out the whole check for multiple null indicators right now?
Afaik, we still have an issue when a single parent feature is vectorized with two different unary stages (eg. if you were to vectorize a real with both the normal RealVectorizer
and also bucketize). SanityChecker will still group these into a single contingency matrix since their grouping will just be the parent feature name, so I think we should leave this check in until we fix that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really think it is better to fail the whole flow when that happens? I would be better to automatically drop one than fail the flow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even have some incorrect statistics (in my opinion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at least keep it in as a warning we log so that we can monitor when it's happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jauntbox I fixed it so that it will remove duplicates from the categorical calculations
* Get the feature grouping qualified by the parent feature name | ||
* @return Optional string of feature grouping | ||
*/ | ||
def featureGroup(): Option[String] = grouping.map(g => s"${parentFeatureName.mkString}${g}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont you want to do something similar to s"${parentFeatureName.mkString("_")}${grouping.map("_" + _).getOrElse("")}"
(same as we do in makeColName()
function)
import com.salesforce.op.stages.impl.feature.{HashSpaceStrategy, RealNNVectorizer, SmartTextMapVectorizer} | ||
import com.salesforce.op.test.{OpEstimatorSpec, TestFeatureBuilder, TestSparkContext} | ||
import com.salesforce.op.utils.json.JsonUtils | ||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire merge conflict?
* Get the feature grouping qualified by the parent feature name | ||
* @return Optional string of feature grouping | ||
*/ | ||
def featureGroup(): Option[String] = grouping.map(g => s"${parentFeatureName.mkString("_")}_$g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire repeating my previous comment here - dont you want to do something similar to s"${parentFeatureName.mkString("_")}${grouping.map("_" + _).getOrElse("")}"
(same as we do in makeColName()
function)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially because you do call .get
here - https://github.com/salesforce/TransmogrifAI/pull/153/files#diff-55c6247553ff142cb1657c1bac3c728fR459
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no because before calling that get there is a check on whether this is a None which determines if the categorical stats are computed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Related issues
#151
Describe the proposed solution
Looks like this error was caused by some dead code in sanity checker. I removed this code and added a test for maps with the same keys.