Skip to content

Commit 5158d97

Browse files
committed
Updated Readme with additional context/details
1 parent cf26228 commit 5158d97

File tree

1 file changed

+80
-11
lines changed

1 file changed

+80
-11
lines changed

README.md

+80-11
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,86 @@
1-
jaglion
1+
Data Anonymizing in Hadoop using Pig
22
=======
3+
Jaglion provides data anonymizing capabilities as Pig UDFs. They are extremely useful if PII data from an on premise Hadoop cluster needs to be moved to a cloud based service for processing (e.g. machine learning, Monte Carlo modelling, ...). One possible approach would be to simply remove all PII related data from the dataset. However this is not feasible because in the case of machine learning, the predictive models and its predictions are related to individual customers.
34

4-
Tools for doing hybrid cloud hadoop jobs with Azure and Cloudera.
5+
This allows us to embed anonymizing and de-anonymizing as part of an on premise Pig data transformation job, ensuring no PII goes off-premise:
6+
Here an example anonymizing data (in this case the ownerId) as part of a Pig script:
57

6-
How to build the User Defined Function (UDF) Jar
8+
A = LOAD 'data/xyz_device' using PigStorage(';') AS (
9+
ownerId: chararray,
10+
specNr: chararray,
11+
senderId: chararray,
12+
deviceData: chararray);
13+
B = FOREACH A GENERATE jaglion.ANONYMIZE(ownerId), specNr, senderId, deviceData;
14+
jaglion.WASBSTORE B INTO 'data/devices/anonym/xyz_device' USING PigStorage (';');
715

8-
# Compile
9-
javac -classpath "\`hbase classpath\`/usr/lib/pig/pig.jar:/usr/lib/hadoop/lib/commons-code-1.4.jar" ANONYMIZE.java DEANONYMIZE.java
16+
And here an example de-anonymizing data:
1017

11-
# Jar
12-
mkdir jaglion
13-
mv ANONYMIZE.class DEANONYMIZE.class jaglion
14-
jar -cf ../bin/jaglion.jar jaglion
18+
A = jaglion.WASBLOAD 'data/results/anonym/xyz_result' using PigStorage(';') AS (
19+
ownerId: chararray,
20+
resultData: chararray);
21+
B = FOREACH A GENERATE jaglion.DEANONYMIZE(ownerId), resultData;
22+
STORE B INTO 'data/xyz_result' USING PigStorage (';');
1523

16-
# Execute local test
17-
java -classpath "\`hbase classpath\`/usr/lib/pig/pig.jar:/usr/lib/hadoop/lib/commons-code-1.4.jar" org.apache.pig.Main -x local test.pig
24+
The anonymizing function uses Hadoop’s HBASE as the persistent key/value store to retrieve and store the PII data correlation. This HBASE instance is part of the on premise Hadoop cluster.
25+
26+
There are two different correlation modes for different level of privacy:
27+
28+
**High privacy mode**
29+
30+
Each time the ANONYMIZE function is called, it returns a unique id. In this mode, even if the customerId is the same, the function will return two different anonymized ids. This mode ensures that off premise data can’t be correlated across the anonymized dimension.
31+
32+
ANONYMIZE(customerId) != ANONYMIZE(customerId)
33+
34+
High privacy mode can be used for independent jobs such as pattern recognition, large scale data transformation or model calculations. However it won’t be useful for jobs which gain insight by correlating multiple dependent data items (such as machine learning).
35+
36+
**Medium privacy mode**
37+
38+
The ANONYMIZE function returns always the same anonymized id for the same request. This mode allows for off premise correlations across the same anonymized ids.
39+
40+
ANONYMIZE(customerId) == ANONYMIZE(customerId)
41+
42+
##How to build the User Defined Function (UDF) Jar
43+
44+
**Compile**
45+
46+
`javac -classpath "\`hbase classpath\`/usr/lib/pig/pig.jar:/usr/lib/hadoop/lib/commons-code-1.4.jar" ANONYMIZE.java DEANONYMIZE.java`
47+
48+
**Jar**
49+
50+
mkdir jaglion
51+
mv ANONYMIZE.class DEANONYMIZE.class jaglion
52+
jar -cf ../bin/jaglion.jar jaglion
53+
54+
##How to use the UDFs in Pig
55+
To use the UDFs, the java package and its dependencies needs to be registered first:
56+
57+
REGISTER /usr/lib/zookeeper/zookeeper.jar;
58+
REGISTER /usr/lib/hbase/hbase-client.jar;
59+
REGISTER /usr/lib/hbase/hbase-common.jar;
60+
REGISTER /usr/lib/hbase/hbase-protocol.jar;
61+
REGISTER /usr/lib/hbase/hbase-hadoop-compat.jar;
62+
REGISTER /usr/lib/hbase/lib/htrace-core.jar;
63+
64+
REGISTER bin/jaglion.jar;
65+
66+
Now, two UDFs can be used within Pig. Assume we loaded the data into A using a statement similar to
67+
68+
A = LOAD 'testdata';
69+
Once A is loaded, the statement below anonymizes the first column in A using medium privacy (the same value will always generate the same anonymized value):
70+
71+
B = FOREACH A GENERATE jaglion.ANONYMIZE($0, 0);
72+
73+
The following statement will generate a unique anonymized value for each value in A, regardless if the source values are the same:
74+
75+
C = FOREACH A GENERATE jaglion.ANONYMIZE($0, 1);
76+
To de-anonymize, we simply call the DEANONYMIZE function:
77+
78+
D = FOREACH B GENERATE jaglion.DEANONYMIZE($0);
79+
E = FOREACH C GENERATE jaglion.DEANONYMIZE($0);
80+
81+
##Execute local test
82+
1. Copy the file "testdata" to HDFS
83+
2. Copy "jaglion.jar" into the directory which is declared when registering the UDFs in Pig (e.g. bin)
84+
3. Run the local test (specifying the "hbase classpath")
85+
86+
java -classpath "\`hbase classpath\`/usr/lib/pig/pig.jar:/usr/lib/hadoop/lib/commons-code-1.4.jar" org.apache.pig.Main -x local test.pig

0 commit comments

Comments
 (0)