Skip to content

Commit 1499a14

Browse files
authored
Merge pull request #103 from UAL-RE/102-feature-check-if-bag-exist-in-ap-trust-prior-to-bagging
Feat: Check if item version is already preserved before bagging (Issue #102)
2 parents 2813c16 + 937fd45 commit 1499a14

File tree

7 files changed

+580
-45
lines changed

7 files changed

+580
-45
lines changed

.env.sample.ini

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ retries = 3
55
retries_wait = 10
66
institution = 1077
77

8+
[aptrust_api]
9+
url =
10+
user =
11+
token =
12+
items_per_page =
13+
alt_identifier_starts_with =
14+
retries = 3
15+
retries_wait = 10
16+
817
[system]
918
preservation_storage_location =
1019
logs_location =

Config.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ def __init__(self, fileName):
99
def figshare_config(self):
1010
return self.config['figshare_api']
1111

12+
def aptrust_config(self):
13+
return self.config['aptrust_api']
14+
1215
def system_config(self):
1316
return self.config['system']
1417

README.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,23 @@ ReBACH is run via the command line as outlined in the 'How to Run' section of th
2222
## How to run:
2323
- Copy the .env.sample.ini file and give it a name of your choice (e.g. .env.ini).
2424
- Fill out the .env.ini file (IMPORTANT: Make sure not to commit this file to Github)
25-
- url - required: The figshare API url
26-
- token - required: Your auth token to your organization's API
27-
- retries - required: Number of times the script should retry API or file system calls if it is unable to connect. Defaults to 3
28-
- retries_wait - required: Number of seconds the script should wait between call retries if it is unable to connect. Defaults to 10
29-
- institution - required: The Figshare Institution ID for your organization
25+
- figshare_api
26+
- url - required: The figshare API url
27+
- token - required: Your auth token to your organization's API
28+
- retries - required: Number of times the script should retry API or file system calls if it is unable to connect. Defaults to 3
29+
- retries_wait - required: Number of seconds the script should wait between call retries if it is unable to connect. Defaults to 10
30+
- institution - required: The Figshare Institution ID for your organization
31+
- aptrust_api
32+
- url - required: The AP Trust member API url including the version
33+
- user - required: Your user email address on AP Trust
34+
- token - required: Your user secret token on AP Trust
35+
- items_per_page - Maximum number of object to be return per page by the API
36+
- alt_identifier_starts_with - Prefix for alternate identifier in AP Trust
37+
- retries - required: Number of times the script should retry API or file system calls if it is unable to connect. Defaults to 3
38+
- retries_wait - required: Number of seconds the script should wait between call retries if it is unable to connect. Defaults to 10
3039
- preservation_storage_location - required: The file system location where the preservation folders/packages should be created
3140
- logs_location - required: The file system location where logs should be created. This value will override the one in `bagger/config/default.toml` when bagger is used for post-processing (see post_process_script_command setting below).
32-
- additional_precentage_required - required: How much extra space the preservation storage location should have in order to handle files as a percent. This percent is applied to the total storage needed for all files. I.e. if the value of this field is 10 and the amount of storage needed for files is 1 GB, the script will make sure that the preservation storage location has at least 1.1 GB free. Defaults to 10
41+
- additional_percentage_required - required: How much extra space the preservation storage location should have in order to handle files as a percent. This percent is applied to the total storage needed for all files. I.e. if the value of this field is 10 and the amount of storage needed for files is 1 GB, the script will make sure that the preservation storage location has at least 1.1 GB free. Defaults to 10
3342
- pre_process_script_command - optional: The terminal command (including arguments) to invoke a script to be run BEFORE the files are copied and logic applied to the preservation storage (note: this action is not currently implemented)
3443
- post_process_script_command - required: Specifies the method of performing post-processing steps. This can take only two values: the string 'Bagger', or the path to an external script. If the value is set to 'Bagger', the post-processing steps will consist of running the internal `bagger` module. If the value is set to a path to an external script, the post-processing steps will be executed by invoking the external script through the function 'post_process_script_function'. The post-processing steps are executed AFTER the files are copied and logic applied to the preservation storage.
3544
- curation_storage_location - required: The file system location where the Curation files reside

app.py

Lines changed: 76 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -142,15 +142,17 @@ def main():
142142
get_args()
143143
config, log = main()
144144

145-
log.write_log_in_file('info',
146-
"Fetching articles...",
147-
True)
145+
log.write_log_in_file('info', " ", True)
146+
log.write_log_in_file('info', "------- Fetching articles -------", True)
148147
article_obj = Article(config, log, args.ids)
149-
article_data = article_obj.get_articles()
148+
article_data, already_preserved_counts_dict = article_obj.get_articles()
150149

150+
already_preserved_articles_count = len(already_preserved_counts_dict['already_preserved_article_ids'])
151+
already_preserved_versions_count = already_preserved_counts_dict['already_preserved_versions']
151152
published_articles_count = 0
152153
published_articles_versions_count = 0
153154
published_unpublished_count = 0
155+
154156
for i, (k, v) in enumerate(article_data.items()):
155157
published_unpublished_count += 1
156158
if len(v) > 0:
@@ -159,14 +161,12 @@ def main():
159161

160162
log.write_log_in_file('info', "Fetched: "
161163
+ f"Total articles: {published_unpublished_count}, "
162-
+ f"Published articles: {published_articles_count}, "
163-
+ f"Published article versions: {published_articles_versions_count}",
164+
+ f"Published articles: {published_articles_count + already_preserved_articles_count}, "
165+
+ f"Published article versions: {published_articles_versions_count + already_preserved_versions_count}",
164166
True)
165167
print(" ")
166168

167-
log.write_log_in_file('info',
168-
"Fetching collections...",
169-
True)
169+
log.write_log_in_file('info', "------- Fetching collections -------", True)
170170
collection_obj = Collection(config, log, args.ids)
171171
collection_data = collection_obj.get_collections()
172172

@@ -181,52 +181,103 @@ def main():
181181
print(" ")
182182

183183
# Start articles processing after completing fetching data from API
184-
processed_articles_versions_count = article_obj.process_articles(article_data)
184+
processed_articles_versions_count, ap_trust_preserved_article_version_count, wasabi_preserved_versions \
185+
= article_obj.process_articles(article_data)
185186

186187
# Start collections processing after completing fetching data from API and articles processing.
187-
processed_collections_versions_count = collection_obj.process_collections(collection_data)
188+
processed_collections_versions_count, already_preserved_collections_counts = collection_obj.process_collections(collection_data)
189+
already_preserved_collections = len(already_preserved_collections_counts['already_preserved_collection_ids'])
190+
already_preserved_collection_versions = already_preserved_collections_counts['already_preserved_versions']
191+
preserved_collection_versions_in_wasabi = already_preserved_collections_counts['wasabi_preserved_versions']
192+
preserved_collection_versions_in_ap_trust = already_preserved_collections_counts['ap_trust_preserved_versions']
193+
194+
log.write_log_in_file('info', ' ', True)
195+
log.write_log_in_file('info', '------- Summary -------', True)
196+
log.write_log_in_file('info',
197+
f"Total articles: \t\t\t\t\t\t\t\t\t{published_unpublished_count}",
198+
True)
188199

189-
log.write_log_in_file('info', '------- Summary -------')
190200
log.write_log_in_file('info',
191-
"Total articles/published articles: \t\t\t\t\t\t"
192-
+ f'{published_unpublished_count} / {published_articles_count}',
201+
"Total published articles/article versions: \t\t\t\t\t"
202+
+ f'{published_articles_count + already_preserved_articles_count} / '
203+
+ f'{published_articles_versions_count + already_preserved_versions_count}',
193204
True)
205+
194206
log.write_log_in_file('info',
195-
"Total processed articles bags already in preservation storage: \t\t\t"
196-
+ f'{article_obj.processor.duplicate_bag_in_preservation_storage_count}',
207+
"Total count of already preserved (skipped) articles / article versions: \t\t"
208+
+ f'{already_preserved_articles_count} / {already_preserved_versions_count}',
197209
True)
210+
211+
if article_obj.processor.duplicate_bag_in_preservation_storage_count > 0:
212+
log.write_log_in_file('warning',
213+
f'Bagger found {article_obj.processor.duplicate_bag_in_preservation_storage_count} duplicate article(s)',
214+
True)
215+
198216
log.write_log_in_file('info',
199-
"Total articles versions matched/published: \t\t\t\t\t" # todo: exclude already-preserved bags from processing
217+
"Total articles versions matched/published (unskipped): \t\t\t\t"
200218
+ f'{article_obj.no_matched} / {published_articles_versions_count}',
201219
True)
202220
log.write_log_in_file('info',
203221
"Total articles versions processed/matched: \t\t\t\t\t"
204222
+ f'{processed_articles_versions_count} / {article_obj.no_matched}',
205223
True)
224+
log.write_log_in_file('info',
225+
"Total count of already preserved article versions in preservation final remote storage: \t\t"
226+
+ f'{ap_trust_preserved_article_version_count}',
227+
True)
228+
log.write_log_in_file('info',
229+
"Total count of already preserved article versions in preservation staging remote storage: \t"
230+
+ f'{wasabi_preserved_versions}',
231+
True)
232+
206233
log.write_log_in_file('info',
207234
"Total articles versions unmatched (published-matched): \t\t\t\t"
208235
+ f'{article_obj.no_unmatched}',
209236
True)
210237
log.write_log_in_file('info',
211-
"Total processed articles bags successfully preserved \t\t\t\t"
238+
"Total processed articles bags successfully preserved: \t\t\t\t"
212239
+ f'{article_obj.processor.bag_preserved_count}',
213240
True)
241+
242+
log.write_log_in_file('info', "", True)
243+
log.write_log_in_file('info',
244+
"Total collections: \t\t\t\t\t\t\t\t"
245+
+ f'{collections_count}',
246+
True)
247+
log.write_log_in_file('info',
248+
"Total published collections / collection versions: \t\t\t\t"
249+
+ f'{collections_count} / {collections_versions_count}',
250+
True)
251+
214252
log.write_log_in_file('info',
215-
"Total collections/published collections: \t\t\t\t\t\t"
216-
+ f'{collections_count} / {collections_count}',
253+
"Total count of already preserved (skipped) collections / collection versions: \t"
254+
+ f'{already_preserved_collections} / {already_preserved_collection_versions}',
217255
True)
256+
218257
log.write_log_in_file('info',
219258
"Total collections versions processed/published: \t\t\t\t\t"
220-
+ f'{processed_collections_versions_count} / {collections_versions_count}',
259+
+ f'{processed_collections_versions_count} / {collections_versions_count - already_preserved_collection_versions}',
221260
True)
261+
262+
if collection_obj.processor.duplicate_bag_in_preservation_storage_count > 0:
263+
log.write_log_in_file('warning',
264+
f'Bagger found {collection_obj.processor.duplicate_bag_in_preservation_storage_count} duplicate collection(s)',
265+
True)
266+
267+
log.write_log_in_file('info',
268+
"Total count of already preserved collection versions in preservation final remote storage: \t"
269+
+ f'{preserved_collection_versions_in_ap_trust}',
270+
True)
271+
222272
log.write_log_in_file('info',
223-
"Total collections already preserved: \t\t\t\t\t\t"
224-
+ f'{collection_obj.processor.duplicate_bag_in_preservation_storage_count}',
273+
"Total count of already preserved collection versions in preservation staging remote storage: \t"
274+
+ f'{preserved_collection_versions_in_wasabi}',
225275
True)
226276

227-
if processed_articles_versions_count != published_articles_versions_count or processed_collections_versions_count != collections_versions_count:
277+
if processed_articles_versions_count != published_articles_versions_count or \
278+
processed_collections_versions_count != (collections_versions_count - already_preserved_collection_versions):
228279
log.write_log_in_file('warning',
229-
'The number of articles versions or collections versions sucessfully processed is different'
280+
'The number of articles versions or collections versions successfully processed is different'
230281
+ ' than the number fetched. Check the log for details.', True)
231282

232283
log.write_log_in_file('info',

0 commit comments

Comments
 (0)