Skip to content

Commit c39222d

Browse files
kopporsubhramit
andauthored
Fix XMP import (#12833)
* Fix level for PicaXmlParser * Use withers * Use "withField" * Add debub statement * Modernize test * Add 2024_SPLC_Becker.pdf * Reuse object * Use well-known array list * Refactor xmpUtilReader to avoid double-loading of the PDF * Fail faster * Simplify test code * Reorder for better readability * Fix casing * Fix comments in XmpUtilReaderTest * Revert "Reorder for better readability" This reverts commit 7f398ee. * Make code more readable * Add support for UnknownField{name='rights'} * WIP: Fix of XmpUtilReader * Add skeletton for test * Introduce constant * Map more fields * Fix links * Add links to related classes * Add initial Markdown doc * Remove unnecessary method * Revert "WIP: Fix of XmpUtilReader" This reverts commit 85cf9f1. * Try to cache DOM parser * Merge also merges type * Remove "doi:" prefix at identifier * Return one entry only * Try to have org.jabref.logic.importer.fileformat.pdf.PdfMergeMetadataImporterTest#importRelativizesFilePath working * Fix tabs * Update CHANGELOG.md Co-authored-by: Subhramit Basu <[email protected]> * Fix path relativation * Use withers * Use "List.of" * Add comment on testing * Re-use merge logig of BibEntry * Add support of merging file field * Streamline code in PdfMergeMetadataImporter * Adapt test to real data * Refine JavaDoc * Fix issue in field writing * Introduce BibEntryCompare * Ease test * Add month handling during loading of XMP data * Add refined JavaDoc * Fix checkstyle * Remvoe "// TODO" --------- Co-authored-by: Subhramit Basu <[email protected]>
1 parent 4605b5f commit c39222d

24 files changed

+518
-239
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
2525
- We added a feature for enabling drag-and-drop of files into groups [#12540](https://github.com/JabRef/jabref/issues/12540)
2626
- We added support for reordering keywords via drag and drop, automatic alphabetical ordering, and improved pasting and editing functionalities in the keyword editor. [#10984](https://github.com/JabRef/jabref/issues/10984)
2727
- We added a new functionality where author names having multiple spaces in-between will be considered as separate user block as it does for " and ". [#12701](https://github.com/JabRef/jabref/issues/12701)
28+
- We enhanced support for parsing XMP metadata from PDF files. [#12829](https://github.com/JabRef/jabref/issues/12829)
2829
- We added a "Preview" header in the JStyles tab in the "Select style" dialog, to make it consistent with the CSL styles tab. [#12838](https://github.com/JabRef/jabref/pull/12838)
2930

3031
### Changed
@@ -86,6 +87,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
8687
- We fixed an issue where duplicate items cannot be removed correctly when merging groups or keywords. [#12585](https://github.com/JabRef/jabref/issues/12585)
8788
- We fixed an issue where JabRef displayed an incorrect deletion notification when canceling entry deletion [#12645](https://github.com/JabRef/jabref/issues/12645)
8889
- We fixed an issue where JabRef displayed an incorrect deletion notification when canceling entry deletion. [#12645](https://github.com/JabRef/jabref/issues/12645)
90+
- We fixed an issue where JabRref wrote wrong field names into the PDF. [#12833](https://github.com/JabRef/jabref/pulls/12833)
8991
- We fixed an issue where an exception would occur when running abbreviate journals for multiple entries. [#12634](https://github.com/JabRef/jabref/issues/12634)
9092
- We fixed an issue where JabRef displayed dropdown triangle in wrong place in "Search for unlinked local files" dialog [#12713](https://github.com/JabRef/jabref/issues/12713)
9193
- We fixed an issue where JabRef would not open if an invalid external journal abbreviation path was encountered. [#12776](https://github.com/JabRef/jabref/issues/12776)

docs/code-howtos/testing.md

+16
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,22 @@ Fetcher tests can be run locally by executing the Gradle task `fetcherTest`. Thi
150150

151151
Alternatively, if one is using IntelliJ, this can also be done by double-clicking the `fetcherTest` task under the `other` group in the Gradle Tool window (`JabRef > Tasks > other > fetcherTest`).
152152

153+
## "No matching tests found"
154+
155+
In case the output is "No matching tests found", the wrong test category is used.
156+
157+
Check "Run/Debug Configurations"
158+
159+
Example
160+
161+
```gradle
162+
:databaseTest --tests "org.jabref.logic.importer.fileformat.pdf.PdfMergeMetadataImporterTest.pdfMetadataExtractedFrom2024SPLCBecker"
163+
```
164+
165+
This tells Gradle that `PdfMergeMetadataImporterTest` should be executed as database test.
166+
However, it is marked as `@FetcherTest`.
167+
Thus, change `:databaseTest` to `:fetcherTest` to get the test running.
168+
153169
## Advanced testing and further reading
154170

155171
On top of basic unit testing, there are more ways to test a software:

docs/code-howtos/xmp-parsing.md

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
parent: Code Howtos
3+
---
4+
# XMP Parsing
5+
6+
Example XMP metadata from a PDF file (src/test/resources/org/jabref/logic/importer/fileformat/pdf/2024_SPLC_Becker.pdf):
7+
8+
```xml
9+
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
10+
<x:xmpmeta xmlns:x="adobe:ns:meta/">
11+
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
12+
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
13+
<dc:format>application/pdf</dc:format>
14+
<dc:identifier>doi:10.1145/3646548.3672587</dc:identifier>
15+
</rdf:Description>
16+
<rdf:Description rdf:about="" xmlns:prism="http://prismstandard.org/namespaces/basic/2.1/">
17+
<prism:doi>10.1145/3646548.3672587</prism:doi>
18+
<prism:url>https://doi.org/10.1145/3646548.3672587</prism:url>
19+
</rdf:Description>
20+
<rdf:Description rdf:about="" xmlns:crossmark="http://crossref.org/crossmark/1.0/">
21+
<crossmark:MajorVersionDate>2024-09-02</crossmark:MajorVersionDate>
22+
<crossmark:CrossmarkDomainExclusive>true</crossmark:CrossmarkDomainExclusive>
23+
<crossmark:CrossMarkDomains>
24+
<rdf:Seq>
25+
<rdf:li>dl.acm.org</rdf:li>
26+
</rdf:Seq>
27+
</crossmark:CrossMarkDomains>
28+
<crossmark:DOI>10.1145/3646548.3672587</crossmark:DOI>
29+
</rdf:Description>
30+
<rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
31+
<pdfx:CrossMarkDomains>
32+
<rdf:Seq>
33+
<rdf:li>dl.acm.org</rdf:li>
34+
</rdf:Seq>
35+
</pdfx:CrossMarkDomains>
36+
<pdfx:CrossmarkDomainExclusive>true</pdfx:CrossmarkDomainExclusive>
37+
<pdfx:doi>10.1145/3646548.3672587</pdfx:doi>
38+
<pdfx:CrossmarkMajorVersionDate>2024-09-02</pdfx:CrossmarkMajorVersionDate>
39+
</rdf:Description>
40+
</rdf:RDF>
41+
</x:xmpmeta>
42+
<?xpacket end="w"?>
43+
```
44+
45+
`org.apache.xmpbox.xml.DomXmpParser` cannot ignore unknown namespaces. Therefore, we need to exact the known elements.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
package org.jabref.logic.bibtex.comparator;
2+
3+
import java.util.Collection;
4+
import java.util.Collections;
5+
import java.util.SequencedSet;
6+
7+
import org.jabref.model.entry.BibEntry;
8+
import org.jabref.model.entry.field.Field;
9+
10+
public class BibEntryCompare {
11+
public enum Result { SUBSET, EQUAL, SUPERSET, DISJUNCT, DISJUNCT_OR_EQUAL_FIELDS, DIFFERENT }
12+
13+
/**
14+
* @return first {Result} second, e.g., if first is a subset of second, then Result.SUBSET is returned.
15+
*/
16+
public static Result compareEntries(BibEntry first, BibEntry second) {
17+
if (first.equals(second)) {
18+
return Result.EQUAL;
19+
}
20+
21+
SequencedSet<Field> fieldsFirst = first.getFields();
22+
SequencedSet<Field> secondFields = second.getFields();
23+
24+
if (fieldsFirst.containsAll(secondFields)) {
25+
if (isSubSet(second, first)) {
26+
return Result.SUPERSET;
27+
}
28+
return Result.DIFFERENT;
29+
}
30+
31+
if (secondFields.containsAll(fieldsFirst)) {
32+
if (isSubSet(first, second)) {
33+
return Result.SUBSET;
34+
}
35+
return Result.DIFFERENT;
36+
}
37+
38+
if (Collections.disjoint(fieldsFirst, secondFields)) {
39+
return Result.DISJUNCT;
40+
}
41+
42+
fieldsFirst.retainAll(secondFields);
43+
if (isSubSet(first, second, fieldsFirst)) {
44+
return Result.DISJUNCT_OR_EQUAL_FIELDS;
45+
}
46+
47+
return Result.DIFFERENT;
48+
}
49+
50+
private static boolean isSubSet(BibEntry candidateSubSet, BibEntry candidateSuperSet) {
51+
return isSubSet(candidateSubSet, candidateSuperSet, candidateSubSet.getFields());
52+
}
53+
54+
private static boolean isSubSet(BibEntry candidateSubSet, BibEntry candidateSuperSet, Collection<Field> fields) {
55+
for (Field field: fields) {
56+
String subValue = candidateSubSet.getField(field).get();
57+
boolean isEqualValue = candidateSuperSet.getField(field)
58+
.filter(superValue -> superValue.equals(subValue))
59+
.isPresent();
60+
if (!isEqualValue) {
61+
return false;
62+
}
63+
}
64+
return true;
65+
}
66+
}

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

+58-64
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,18 @@
33
import java.io.IOException;
44
import java.nio.file.Path;
55
import java.util.ArrayList;
6+
import java.util.HashSet;
67
import java.util.List;
78
import java.util.Objects;
89
import java.util.Optional;
910
import java.util.Set;
1011
import java.util.stream.Stream;
1112

1213
import org.jabref.logic.FilePreferences;
14+
import org.jabref.logic.cleanup.RelativePathsCleanup;
1315
import org.jabref.logic.importer.EntryBasedFetcher;
1416
import org.jabref.logic.importer.FetcherException;
17+
import org.jabref.logic.importer.IdBasedFetcher;
1518
import org.jabref.logic.importer.ImportFormatPreferences;
1619
import org.jabref.logic.importer.ParseException;
1720
import org.jabref.logic.importer.ParserResult;
@@ -24,14 +27,12 @@
2427
import org.jabref.logic.importer.fileformat.pdf.PdfImporter;
2528
import org.jabref.logic.importer.fileformat.pdf.PdfVerbatimBibtexImporter;
2629
import org.jabref.logic.importer.fileformat.pdf.PdfXmpImporter;
27-
import org.jabref.logic.importer.util.FileFieldParser;
2830
import org.jabref.logic.l10n.Localization;
2931
import org.jabref.logic.util.StandardFileType;
3032
import org.jabref.logic.util.io.FileUtil;
3133
import org.jabref.model.database.BibDatabaseContext;
3234
import org.jabref.model.entry.BibEntry;
3335
import org.jabref.model.entry.LinkedFile;
34-
import org.jabref.model.entry.field.Field;
3536
import org.jabref.model.entry.field.StandardField;
3637

3738
import org.apache.pdfbox.pdmodel.PDDocument;
@@ -48,12 +49,13 @@ public class PdfMergeMetadataImporter extends PdfImporter {
4849

4950
private static final Logger LOGGER = LoggerFactory.getLogger(PdfMergeMetadataImporter.class);
5051

51-
private final ImportFormatPreferences importFormatPreferences;
5252
private final List<PdfImporter> metadataImporters;
5353

54-
public PdfMergeMetadataImporter(ImportFormatPreferences importFormatPreferences) {
55-
this.importFormatPreferences = importFormatPreferences;
54+
private final DoiFetcher doiFetcher;
55+
private final ArXivFetcher arXivFetcher;
56+
private final IsbnFetcher isbnFetcher;
5657

58+
public PdfMergeMetadataImporter(ImportFormatPreferences importFormatPreferences) {
5759
// TODO: Evaluate priorities of these {@link PdfBibExtractor}s.
5860
this.metadataImporters = new ArrayList<>(List.of(
5961
new PdfVerbatimBibtexImporter(importFormatPreferences),
@@ -65,6 +67,12 @@ public PdfMergeMetadataImporter(ImportFormatPreferences importFormatPreferences)
6567
if (importFormatPreferences.grobidPreferences().isGrobidEnabled()) {
6668
this.metadataImporters.add(2, new PdfGrobidImporter(importFormatPreferences));
6769
}
70+
doiFetcher = new DoiFetcher(importFormatPreferences);
71+
arXivFetcher = new ArXivFetcher(importFormatPreferences);
72+
73+
isbnFetcher = new IsbnFetcher(importFormatPreferences);
74+
// .addRetryFetcher(new EbookDeIsbnFetcher(importFormatPreferences))
75+
// .addRetryFetcher(new DoiToBibtexConverterComIsbnFetcher(importFormatPreferences))
6876
}
6977

7078
/**
@@ -99,6 +107,7 @@ private List<BibEntry> extractCandidatesFromPdf(Path filePath, PDDocument docume
99107
for (PdfImporter metadataImporter : metadataImporters) {
100108
try {
101109
List<BibEntry> extractedEntries = metadataImporter.importDatabase(filePath, document);
110+
LOGGER.debug("Importer {} extracted {}", metadataImporter.getName(), extractedEntries);
102111
candidates.addAll(extractedEntries);
103112
} catch (Exception e) {
104113
LOGGER.error("Got an exception while importing PDF file", e);
@@ -111,43 +120,17 @@ private List<BibEntry> extractCandidatesFromPdf(Path filePath, PDDocument docume
111120
private List<BibEntry> fetchIdsOfCandidates(List<BibEntry> candidates) {
112121
List<BibEntry> fetchedCandidates = new ArrayList<>();
113122

123+
// Collects Ids already looked for - to avoid multiple calls for one id
124+
final Set<String> fetchedIds = new HashSet<>();
125+
114126
for (BibEntry candidate : candidates) {
115-
Optional<String> doi = candidate.getField(StandardField.DOI);
116-
if (doi.isPresent()) {
117-
try {
118-
new DoiFetcher(importFormatPreferences)
119-
.performSearchById(doi.get())
120-
.ifPresent(fetchedCandidates::add);
121-
} catch (FetcherException e) {
122-
LOGGER.error("Fetching failed for DOI \"{}\".", doi, e);
123-
}
124-
}
127+
fetchData(candidate, StandardField.DOI, doiFetcher, fetchedIds, fetchedCandidates);
125128

126-
Optional<String> eprint = candidate.getField(StandardField.EPRINT);
127-
if (eprint.isPresent()) {
128-
// This code assumes that `eprint` field refers to an arXiv preprint, which is not correct.
129-
// One should also check if `archivePrefix` is equal to `arXiv`, and handle other cases too.
130-
try {
131-
new ArXivFetcher(importFormatPreferences)
132-
.performSearchById(eprint.get())
133-
.ifPresent(fetchedCandidates::add);
134-
} catch (FetcherException e) {
135-
LOGGER.error("Fetching failed for arXiv ID \"{}\".", eprint.get(), e);
136-
}
137-
}
129+
// This code assumes that `eprint` field refers to an arXiv preprint, which is not correct.
130+
// One should also check if `archivePrefix` is equal to `arXiv`, and handle other cases too.
131+
fetchData(candidate, StandardField.EPRINT, arXivFetcher, fetchedIds, fetchedCandidates);
138132

139-
Optional<String> isbn = candidate.getField(StandardField.ISBN);
140-
if (isbn.isPresent()) {
141-
try {
142-
new IsbnFetcher(importFormatPreferences)
143-
// .addRetryFetcher(new EbookDeIsbnFetcher(importFormatPreferences))
144-
// .addRetryFetcher(new DoiToBibtexConverterComIsbnFetcher(importFormatPreferences))
145-
.performSearchById(isbn.get())
146-
.ifPresent(fetchedCandidates::add);
147-
} catch (FetcherException e) {
148-
LOGGER.error("Fetching failed for ISBN \"{}\".", isbn.get(), e);
149-
}
150-
}
133+
fetchData(candidate, StandardField.ISBN, isbnFetcher, fetchedIds, fetchedCandidates);
151134

152135
// TODO: Handle URLs too.
153136
// However, it may have problems if URL refers to the same identifier in DOI, ISBN, or arXiv.
@@ -156,41 +139,52 @@ private List<BibEntry> fetchIdsOfCandidates(List<BibEntry> candidates) {
156139
return fetchedCandidates;
157140
}
158141

159-
private static BibEntry mergeCandidates(Stream<BibEntry> candidates) {
160-
BibEntry entry = new BibEntry();
161-
162-
// Functional style is used here instead of imperative like in `extractCandidatesFromPdf` or `fetchIdsOfCandidates`,
163-
// because they have checked exceptions.
142+
/**
143+
* @param candidate The BibEntry to look for the field
144+
* @param field The field to look for
145+
* @param fetcher The fetcher to use
146+
* @param fetchedIds The already fetched ids (will be updated)
147+
* @param fetchedCandidates New candidate (will be updated)
148+
*/
149+
private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher fetcher, Set<String> fetchedIds, List<BibEntry> fetchedCandidates) {
150+
candidate.getField(field)
151+
.filter(id -> !fetchedIds.contains(id))
152+
.ifPresent(id -> {
153+
fetchedIds.add(id);
154+
try {
155+
fetcher.performSearchById(id)
156+
.ifPresent(fetchedCandidates::add);
157+
} catch (FetcherException e) {
158+
LOGGER.error("Fetching failed for id \"{}\".", id, e);
159+
}
160+
});
161+
}
164162

165-
candidates.forEach(candidate -> {
166-
if (BibEntry.DEFAULT_TYPE.equals(entry.getType())) {
167-
entry.setType(candidate.getType());
168-
}
163+
private static BibEntry mergeCandidates(Stream<BibEntry> candidates) {
164+
final BibEntry entry = new BibEntry();
165+
candidates.forEach(entry::mergeWith);
169166

170-
Set<Field> presentFields = entry.getFields();
171-
172-
candidate
173-
.getFieldMap()
174-
.entrySet()
175-
.stream()
176-
// Don't merge FILE fields that point to a stored file as we set that to filePath anyway.
177-
// Nevertheless, retain online links.
178-
.filter(fieldEntry ->
179-
!(StandardField.FILE == fieldEntry.getKey()
180-
&& FileFieldParser.parse(fieldEntry.getValue()).stream().noneMatch(LinkedFile::isOnlineLink)))
181-
// Only overwrite non-present fields
182-
.filter(fieldEntry -> !presentFields.contains(fieldEntry.getKey()))
183-
.forEach(fieldEntry -> entry.setField(fieldEntry.getKey(), fieldEntry.getValue()));
184-
});
167+
// Retain online links only
168+
List<LinkedFile> onlineLinks = entry.getFiles().stream().filter(LinkedFile::isOnlineLink).toList();
169+
entry.clearField(StandardField.FILE);
170+
entry.addFiles(onlineLinks);
185171

186172
return entry;
187173
}
188174

175+
/**
176+
* Imports the BibTeX data from the given PDF file and relativized the paths of each linked file based on the context and the file preferences.
177+
*/
189178
public ParserResult importDatabase(Path filePath, BibDatabaseContext context, FilePreferences filePreferences) throws IOException {
190179
Objects.requireNonNull(context);
191180
Objects.requireNonNull(filePreferences);
192181

193-
return importDatabase(filePath);
182+
ParserResult parserResult = importDatabase(filePath);
183+
184+
RelativePathsCleanup relativePathsCleanup = new RelativePathsCleanup(context, filePreferences);
185+
parserResult.getDatabase().getEntries().forEach(entry -> relativePathsCleanup.cleanup(entry));
186+
187+
return parserResult;
194188
}
195189

196190
@Override

src/main/java/org/jabref/logic/importer/fileformat/PicaXmlParser.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ private BibEntry parseEntry(Element e) {
9898
List<Element> datafields = getChildren("datafield", e);
9999
for (Element datafield : datafields) {
100100
String tag = datafield.getAttribute("tag");
101-
LOGGER.debug("tag: {}", tag);
101+
LOGGER.trace("tag: {}", tag);
102102

103103
// genre/type of the entry https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat&val=0500&katalog=Standard
104104
if ("002@".equals(tag)) {

src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfXmpImporter.java

+3-1
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,15 @@
1717
public class PdfXmpImporter extends PdfImporter {
1818

1919
private final XmpPreferences xmpPreferences;
20+
private final XmpUtilReader xmpUtilReader;
2021

2122
public PdfXmpImporter(XmpPreferences xmpPreferences) {
2223
this.xmpPreferences = xmpPreferences;
24+
xmpUtilReader = new XmpUtilReader();
2325
}
2426

2527
public List<BibEntry> importDatabase(Path filePath, PDDocument document) throws IOException {
26-
return new XmpUtilReader().readXmp(filePath, xmpPreferences);
28+
return xmpUtilReader.readXmp(filePath, document, xmpPreferences);
2729
}
2830

2931
@Override

0 commit comments

Comments
 (0)