Skip to content

Commit b3c3831

Browse files
committed
v1.1.8
- Accept Command Line argument - Change name to "excel_anonymizer.py" - Publish to PYPI - Other small improvements
1 parent 24a9b56 commit b3c3831

File tree

5 files changed

+212
-113
lines changed

5 files changed

+212
-113
lines changed

.gitignore

+4-1
Original file line numberDiff line numberDiff line change
@@ -152,4 +152,7 @@ cython_debug/
152152
#.idea/
153153

154154
# Anonymized Excel Output
155-
anonymized_personal_information.xlsx
155+
personal_information-anonymized.xlsx
156+
157+
# My Upload to PyPI Shortcut
158+
upload.bat

Anonymize_Excel.py

-96
This file was deleted.

README.md

+11-16
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# Anonymize_Excel
1+
# Excel Anonymizer
22
A Python script that anonymizes an Excel file and synthesizes new data in its place.
33

44
![Excel_Anonymized_Demo](https://github.com/Welding-Torch/Anonymize_Excel/assets/46340124/78b03e03-bad0-4cb0-9b84-46e3197e9344)
55
_Convert your sheets with sensitive data into anonymized data._
66

7-
## What is Anonymize_Excel.py
8-
Anonymize_Excel.py is a python script that helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization for private entities in text such as credit card numbers, names, locations, phone numbers, email address, date/time, with more entities to come.
7+
## What is Excel Anonymizer
8+
Excel Anonymizer is a python script that helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization for private entities in text such as credit card numbers, names, locations, phone numbers, email address, date/time, with more entities to come.
99

1010
## Use case
1111
Data anonymization is crucial because it helps protect privacy and maintain confidentiality. If data is not anonymized, sensitive information such as names, addresses, contact numbers, or other identifiers linked to specific individuals could potentially be learned and misused. Hence, by obscuring or removing this personally identifiable information (PII), data can be used freely without compromising individuals’ privacy rights or breaching data protection laws and regulations.
@@ -15,31 +15,26 @@ Anonymization consists of two steps:
1515
1. Identification: Identify all data fields that contain personally identifiable information (PII).
1616
2. Replacement: Replace all PIIs with pseudo values that do not reveal any personal information about the individual but can be used for reference.
1717

18-
Anonymize_Excel.py uses Microsoft Presidio together with Faker framework for anonymization purposes.
18+
Excel Anonymizer uses Microsoft Presidio together with Faker framework for anonymization purposes.
1919

2020
## Quickstart
21-
1. Clone the repository
21+
1. Install Excel Anonymizer
2222
```
23-
git clone https://github.com/Welding-Torch/Anonymize_Excel.git
23+
pip install excel-anonymizer
2424
```
25+
> Note: Spacy will install a Natural Language Processing package on the first run (587.7MB).
2526
26-
2. Install the requirements
27+
2. Download personal_information.xlsx from this repository, and then type
2728
```
28-
pip install presidio_analyzer
29-
pip install presidio_anonymizer
30-
python -m spacy download en_core_web_lg
31-
```
32-
3. Run the demo
33-
```
34-
python Anonymize_Excel.py
29+
excel-anon personal_information.xlsx
3530
```
3631

3732
That's it!
3833

3934
## Usage
40-
To use Anonymize_Excel.py with your Excel file, modify line 8 in the program.
35+
To use Excel Anonymizer with your Excel file, simply input the file.
4136
```
42-
df = pd.read_excel("your_excel_sheet_here.xlsx")
37+
excel-anon your_excel_file_here.xlsx
4338
```
4439

4540
## Author

excel_anonymizer.py

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
'''
2+
Filename: excel_anonymizer.py
3+
Author: Siddharth Bhatia
4+
'''
5+
6+
import argparse
7+
import logging
8+
import logging.config
9+
10+
import pandas as pd
11+
from presidio_analyzer import AnalyzerEngine
12+
from presidio_anonymizer import AnonymizerEngine
13+
from presidio_anonymizer.entities.engine import OperatorConfig
14+
from faker import Faker
15+
16+
def main():
17+
"""Just a main function needed to publish this to PyPI"""
18+
19+
# Disable loggers from all imported modules
20+
logging.config.dictConfig({
21+
'version': 1,
22+
'disable_existing_loggers': True,
23+
})
24+
25+
# Initialize parser
26+
parser = argparse.ArgumentParser(
27+
prog='excel_anonymizer.py',
28+
description='Anonymizes an Excel file and \
29+
synthesizes new data in its place.',
30+
epilog='Made by Siddharth Bhatia')
31+
32+
# Take file as input
33+
parser.add_argument('filename', help="your excel file here")
34+
parser.add_argument('-v', '--verbose',
35+
action='store_true')
36+
37+
# Read arguments from command line
38+
args = parser.parse_args()
39+
40+
filename = args.filename
41+
42+
if args.verbose is True:
43+
logging.basicConfig(format="%(message)s", level=logging.INFO)
44+
logging.info("Verbose output.")
45+
46+
def log(string):
47+
"""Make function for logging."""
48+
if args.verbose is True:
49+
logging.info(string)
50+
51+
df = pd.read_excel(f"{filename}")
52+
log(df)
53+
log("")
54+
55+
# Column values to list, which I will use at the end
56+
columns_ordered_list = df.columns.values.tolist()
57+
log(f"Columns: {columns_ordered_list}")
58+
log("")
59+
60+
# Initialize an empty dictionary to store cell locations and values
61+
cell_data = {}
62+
63+
# Iterate over every cell
64+
for index, row in df.iterrows():
65+
for column in df.columns:
66+
cell_value = row[column]
67+
cell_location = (index, column)
68+
cell_data[cell_location] = cell_value
69+
70+
# log the list of cell values
71+
log(f"Cell Data: {cell_data}")
72+
log("")
73+
log("###")
74+
75+
# Presidio code begins here
76+
analyzer = AnalyzerEngine()
77+
anonymizer = AnonymizerEngine()
78+
79+
# Faker code begins here
80+
fake = Faker()
81+
82+
# Faker Custom Operators
83+
fake_operators = {
84+
"PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
85+
"PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
86+
"LOCATION": OperatorConfig("custom", {"lambda": lambda x: str(fake.country())}),
87+
"EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
88+
"DATE_TIME": OperatorConfig("custom", {"lambda": lambda x: str(fake.date_time())}),
89+
"CREDIT_CARD": OperatorConfig("custom", {"lambda": lambda x: fake.credit_card_number()}),
90+
"US_BANK_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.credit_card_number()}),
91+
#"DEFAULT": OperatorConfig(operator_name="mask",
92+
# params={'chars_to_mask': 10,
93+
# 'masking_char': '*',
94+
# 'from_end': False}),
95+
}
96+
97+
fake = Faker(locale="en_IN")
98+
99+
for location, entity in cell_data.items():
100+
# log every cell with it's location
101+
# log(cell, cell_data[cell])
102+
log(entity)
103+
104+
# Analyze + anonymize it
105+
analyzer_results = analyzer.analyze(text=str(entity), language="en")
106+
log(analyzer_results)
107+
108+
anonymized_results = anonymizer.anonymize(
109+
text=str(entity),
110+
analyzer_results=analyzer_results,
111+
operators=fake_operators,
112+
)
113+
114+
log(f"text: {anonymized_results.text}")
115+
log("")
116+
# then return it to the dictionary
117+
cell_data[location] = anonymized_results.text
118+
log("---")
119+
120+
# log(cell_data)
121+
# OUTPUT: {(0, 'Name'): '<PERSON>', (0, 'Phone Number'): '<PHONE_NUMBER>',
122+
# (1, 'Name'): '<PERSON>', (1, 'Phone Number'): '<PHONE_NUMBER>'}
123+
124+
data = {}
125+
columns = list(set(column for _, column in cell_data))
126+
for (index, column), value in cell_data.items():
127+
data.setdefault(index, [None] * len(columns))
128+
data[index][columns_ordered_list.index(column)] = value
129+
anonymized_df = pd.DataFrame.from_dict(data, columns=columns_ordered_list, orient="index")
130+
log(anonymized_df)
131+
132+
filename = filename.rstrip(".xlsx")
133+
anonymized_df.to_excel(
134+
f"{filename}-anonymized.xlsx",
135+
# Don't save the auto-generated numeric index
136+
index=False
137+
)
138+
139+
print(f"Output generated: {filename}-anonymized.xlsx")
140+
141+
if __name__ == "__main__":
142+
main()

pyproject.toml

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
[build-system]
2+
requires = ["setuptools>=61.2.0", "wheel", "setuptools_scm[toml]>=3.4.3"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "excel_anonymizer"
7+
authors = [{name = "Siddharth Bhatia"}]
8+
description = "Anonymizes an Excel file and synthesizes new data in its place"
9+
readme = "README.md"
10+
classifiers = [
11+
"Development Status :: 5 - Production/Stable",
12+
"Environment :: Console",
13+
"Intended Audience :: Developers",
14+
"Intended Audience :: Education",
15+
"Intended Audience :: End Users/Desktop",
16+
"Intended Audience :: Information Technology",
17+
"License :: OSI Approved :: MIT License",
18+
"Operating System :: OS Independent",
19+
"Operating System :: Unix",
20+
"Operating System :: POSIX :: Linux",
21+
"Operating System :: MacOS :: MacOS X",
22+
"Operating System :: Microsoft :: Windows",
23+
"Programming Language :: Python",
24+
"Programming Language :: Python :: 3",
25+
"Programming Language :: Python :: 3 :: Only",
26+
"Programming Language :: Python :: 3.8",
27+
"Programming Language :: Python :: 3.9",
28+
"Programming Language :: Python :: 3.10",
29+
"Programming Language :: Python :: 3.11",
30+
"Topic :: Office/Business",
31+
"Topic :: Utilities",
32+
"Topic :: Office/Business :: Financial :: Spreadsheet",
33+
]
34+
dependencies = [
35+
"presidio_analyzer",
36+
"presidio_anonymizer",
37+
"pandas",
38+
"pyarrow",
39+
"faker",
40+
"openpyxl",
41+
"en_core_web_lg",
42+
]
43+
44+
#dynamic = ["version"]
45+
version = "1.1.7"
46+
47+
[project.scripts]
48+
excel-anonymizer = "excel_anonymizer:main"
49+
excel-anon = "excel_anonymizer:main"
50+
51+
[tool.setuptools]
52+
py-modules = ["excel_anonymizer"]
53+
include-package-data = false
54+
55+
[tool.setuptools_scm]

0 commit comments

Comments
 (0)