Skip to content

Commit 6be4f20

Browse files
authored
Merge pull request #11054 from QualitativeDataRepository/IQSS/10108-StataMimeTypeRefinementForDIrectUpload
IQSS/10108 Stata mimetype refinement for direct upload
2 parents 18a837d + df068fa commit 6be4f20

File tree

7 files changed

+362
-86
lines changed

7 files changed

+362
-86
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
The version of Stata files is now detected during S3 direct upload (as it was for normal uploads), allowing ingest of Stata 14 and 15 files that have been uploaded directly. See [the guides](https://dataverse-guide--11054.org.readthedocs.build/en/11054/developers/big-data-support.html#features-that-are-disabled-if-s3-direct-upload-is-enabled), #10108, and #11054.

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3872,7 +3872,7 @@ The fully expanded example above (without environment variables) looks like this
38723872
Currently the following methods are used to detect file types:
38733873
38743874
- The file type detected by the browser (or sent via API).
3875-
- Custom code that reads the first few bytes. As explained at :ref:`s3-direct-upload-features-disabled`, this method of file type detection is not utilized during direct upload to S3, since by nature of direct upload Dataverse never sees the contents of the file. However, this code is utilized when the "redetect" API is used.
3875+
- Custom code that reads the first few bytes. As explained at :ref:`s3-direct-upload-features-disabled`, most of these methods are not utilized during direct upload to S3, since by nature of direct upload Dataverse never sees the contents of the file. However, this code is utilized when the "redetect" API is used.
38763876
- JHOVE: https://jhove.openpreservation.org . Note that the same applies about direct upload to S3 and the "redetect" API.
38773877
- The file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``.
38783878
- The file name (e.g. "Dockerfile") is used, defined in a file called ``MimeTypeDetectionByFileName.properties``.

doc/sphinx-guides/source/developers/big-data-support.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Features that are Disabled if S3 Direct Upload is Enabled
4444
The following features are disabled when S3 direct upload is enabled.
4545

4646
- Unzipping of zip files. (See :ref:`compressed-files`.)
47-
- Detection of file type based on JHOVE and custom code that reads the first few bytes. (See :ref:`redetect-file-type`.)
47+
- Detection of file type based on JHOVE and custom code that reads the first few bytes except for the refinement of Stata file types to include the version. (See :ref:`redetect-file-type`.)
4848
- Extraction of metadata from FITS files. (See :ref:`fits`.)
4949
- Creation of NcML auxiliary files (See :ref:`netcdf-and-hdf5`.)
5050
- Extraction of a geospatial bounding box from NetCDF and HDF5 files (see :ref:`netcdf-and-hdf5`) unless :ref:`dataverse.netcdf.geo-extract-s3-direct-upload` is set to true.

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/CreateNewDataFilesCommand.java

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
import static edu.harvard.iq.dataverse.util.FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT;
5454
import static edu.harvard.iq.dataverse.util.FileUtil.createIngestFailureReport;
5555
import static edu.harvard.iq.dataverse.util.FileUtil.determineFileType;
56-
import static edu.harvard.iq.dataverse.util.FileUtil.determineFileTypeByNameAndExtension;
56+
import static edu.harvard.iq.dataverse.util.FileUtil.determineRemoteFileType;
5757
import static edu.harvard.iq.dataverse.util.FileUtil.getFilesTempDirectory;
5858
import static edu.harvard.iq.dataverse.util.FileUtil.saveInputStreamInTempFile;
5959
import static edu.harvard.iq.dataverse.util.FileUtil.useRecognizedType;
@@ -574,6 +574,8 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
574574
} else {
575575
// Direct upload.
576576

577+
finalType = StringUtils.isBlank(suppliedContentType) ? FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT : suppliedContentType;
578+
577579
// Since this is a direct upload, and therefore no temp file associated
578580
// with it, we may, OR MAY NOT know the size of the file. If this is
579581
// a direct upload via the UI, the page must have already looked up
@@ -593,18 +595,6 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
593595
}
594596
}
595597

596-
// Default to suppliedContentType if set or the overall undetermined default if a contenttype isn't supplied
597-
finalType = StringUtils.isBlank(suppliedContentType) ? FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT : suppliedContentType;
598-
String type = determineFileTypeByNameAndExtension(fileName);
599-
if (!StringUtils.isBlank(type)) {
600-
//Use rules for deciding when to trust browser supplied type
601-
if (useRecognizedType(finalType, type)) {
602-
finalType = type;
603-
}
604-
logger.fine("Supplied type: " + suppliedContentType + ", finalType: " + finalType);
605-
}
606-
607-
608598
}
609599

610600
// Finally, if none of the special cases above were applicable (or
@@ -635,6 +625,30 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
635625
DataFile datafile = FileUtil.createSingleDataFile(version, newFile, newStorageIdentifier, fileName, finalType, newCheckSumType, newCheckSum);
636626

637627
if (datafile != null) {
628+
if (newStorageIdentifier != null) {
629+
// Direct upload case
630+
// Improve the MIMEType
631+
// Need the owner for the StorageIO class to get the file/S3 path from the
632+
// storageIdentifier
633+
// Currently owner is null, but using this flag will avoid making changes here
634+
// if that isn't true in the future
635+
boolean ownerSet = datafile.getOwner() != null;
636+
if (!ownerSet) {
637+
datafile.setOwner(version.getDataset());
638+
}
639+
String type = determineRemoteFileType(datafile, fileName);
640+
if (!StringUtils.isBlank(type)) {
641+
// Use rules for deciding when to trust browser supplied type
642+
if (useRecognizedType(finalType, type)) {
643+
datafile.setContentType(type);
644+
}
645+
logger.fine("Supplied type: " + suppliedContentType + ", finalType: " + finalType);
646+
}
647+
// Avoid changing
648+
if (!ownerSet) {
649+
datafile.setOwner(null);
650+
}
651+
}
638652

639653
if (warningMessage != null) {
640654
createIngestFailureReport(datafile, warningMessage);

src/main/java/edu/harvard/iq/dataverse/ingest/IngestableDataChecker.java

Lines changed: 90 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -143,13 +143,29 @@ public String[] getTestFormatSet() {
143143
return this.testFormatSet;
144144
}
145145

146+
/*ToDo
147+
* Rather than making these tests just methods, perhaps they could be implemented as
148+
* classes inheriting a common interface. In addition to the existing ~test*format methods,
149+
* the interface could include a method indicating whether the test requires
150+
* the whole file or, if not, how many bytes are needed. That would make it easier to
151+
* decide whether to use the test on direct/remote uploads (where retrieving a big file may not be worth it,
152+
* but retrieving the 42 bytes needed for a stata check or the ~491 bytes needed for a por check) could be.
153+
*
154+
* Could also add a method to indicate which mimetypes the test can identify/refine which
155+
* might make it possible to replace FileUtil.useRecognizedType(String, String) at some point.
156+
*
157+
* It might also make sense to make this interface broader than just the current ingestable types,
158+
* e.g. to support the NetCDF, graphML and other checks in the same framework. (Some of these might only
159+
* support using a file rather than a bytebuffer though.)
160+
*/
161+
146162
// test methods start here ------------------------------------------------
147163
/**
148164
* test this byte buffer against SPSS-SAV spec
149165
*
150166
*
151167
*/
152-
public String testSAVformat(MappedByteBuffer buff) {
168+
public String testSAVformat(ByteBuffer buff) {
153169
String result = null;
154170
buff.rewind();
155171
boolean DEBUG = false;
@@ -192,7 +208,7 @@ public String testSAVformat(MappedByteBuffer buff) {
192208
* test this byte buffer against STATA DTA spec
193209
*
194210
*/
195-
public String testDTAformat(MappedByteBuffer buff) {
211+
public String testDTAformat(ByteBuffer buff) {
196212
String result = null;
197213
buff.rewind();
198214
boolean DEBUG = false;
@@ -311,7 +327,7 @@ public String testDTAformat(MappedByteBuffer buff) {
311327
* test this byte buffer against SAS Transport(XPT) spec
312328
*
313329
*/
314-
public String testXPTformat(MappedByteBuffer buff) {
330+
public String testXPTformat(ByteBuffer buff) {
315331
String result = null;
316332
buff.rewind();
317333
boolean DEBUG = false;
@@ -359,7 +375,7 @@ public String testXPTformat(MappedByteBuffer buff) {
359375
* test this byte buffer against SPSS Portable (POR) spec
360376
*
361377
*/
362-
public String testPORformat(MappedByteBuffer buff) {
378+
public String testPORformat(ByteBuffer buff) {
363379
String result = null;
364380
buff.rewind();
365381
boolean DEBUG = false;
@@ -525,7 +541,7 @@ public String testPORformat(MappedByteBuffer buff) {
525541
* test this byte buffer against R data file
526542
*
527543
*/
528-
public String testRDAformat(MappedByteBuffer buff) {
544+
public String testRDAformat(ByteBuffer buff) {
529545
String result = null;
530546
buff.rewind();
531547

@@ -607,11 +623,10 @@ public String testRDAformat(MappedByteBuffer buff) {
607623

608624
// public instance methods ------------------------------------------------
609625
public String detectTabularDataFormat(File fh) {
610-
boolean DEBUG = false;
611-
String readableFormatType = null;
626+
612627
FileChannel srcChannel = null;
613628
FileInputStream inp = null;
614-
629+
615630
try {
616631
// set-up a FileChannel instance for a given file object
617632
inp = new FileInputStream(fh);
@@ -621,63 +636,7 @@ public String detectTabularDataFormat(File fh) {
621636

622637
// create a read-only MappedByteBuffer
623638
MappedByteBuffer buff = srcChannel.map(FileChannel.MapMode.READ_ONLY, 0, buffer_size);
624-
625-
//this.printHexDump(buff, "hex dump of the byte-buffer");
626-
627-
buff.rewind();
628-
dbgLog.fine("before the for loop");
629-
for (String fmt : this.getTestFormatSet()) {
630-
631-
// get a test method
632-
Method mthd = testMethods.get(fmt);
633-
//dbgLog.info("mthd: " + mthd.getName());
634-
635-
try {
636-
// invoke this method
637-
Object retobj = mthd.invoke(this, buff);
638-
String result = (String) retobj;
639-
640-
if (result != null) {
641-
dbgLog.fine("result for (" + fmt + ")=" + result);
642-
if (DEBUG) {
643-
out.println("result for (" + fmt + ")=" + result);
644-
}
645-
if (readableFileTypes.contains(result)) {
646-
readableFormatType = result;
647-
}
648-
dbgLog.fine("readableFormatType=" + readableFormatType);
649-
} else {
650-
dbgLog.fine("null was returned for " + fmt + " test");
651-
if (DEBUG) {
652-
out.println("null was returned for " + fmt + " test");
653-
}
654-
}
655-
} catch (InvocationTargetException e) {
656-
Throwable cause = e.getCause();
657-
// added null check because of "homemade.zip" from https://redmine.hmdc.harvard.edu/issues/3273
658-
if (cause.getMessage() != null) {
659-
err.format(cause.getMessage());
660-
e.printStackTrace();
661-
} else {
662-
dbgLog.info("cause.getMessage() was null for " + e);
663-
e.printStackTrace();
664-
}
665-
} catch (IllegalAccessException e) {
666-
e.printStackTrace();
667-
} catch (BufferUnderflowException e){
668-
dbgLog.info("BufferUnderflowException " + e);
669-
e.printStackTrace();
670-
}
671-
672-
if (readableFormatType != null) {
673-
break;
674-
}
675-
}
676-
677-
// help garbage-collect the mapped buffer sooner, to avoid the jvm
678-
// holding onto the underlying file unnecessarily:
679-
buff = null;
680-
639+
return detectTabularDataFormat(buff);
681640
} catch (FileNotFoundException fe) {
682641
dbgLog.fine("exception detected: file was not foud");
683642
fe.printStackTrace();
@@ -688,8 +647,73 @@ public String detectTabularDataFormat(File fh) {
688647
IOUtils.closeQuietly(srcChannel);
689648
IOUtils.closeQuietly(inp);
690649
}
650+
return null;
651+
}
652+
653+
public String detectTabularDataFormat(ByteBuffer buff) {
654+
boolean DEBUG = false;
655+
String readableFormatType = null;
656+
657+
// this.printHexDump(buff, "hex dump of the byte-buffer");
658+
659+
buff.rewind();
660+
dbgLog.fine("before the for loop");
661+
for (String fmt : this.getTestFormatSet()) {
662+
663+
// get a test method
664+
Method mthd = testMethods.get(fmt);
665+
// dbgLog.info("mthd: " + mthd.getName());
666+
667+
try {
668+
// invoke this method
669+
Object retobj = mthd.invoke(this, buff);
670+
String result = (String) retobj;
671+
672+
if (result != null) {
673+
dbgLog.fine("result for (" + fmt + ")=" + result);
674+
if (DEBUG) {
675+
out.println("result for (" + fmt + ")=" + result);
676+
}
677+
if (readableFileTypes.contains(result)) {
678+
readableFormatType = result;
679+
}
680+
dbgLog.fine("readableFormatType=" + readableFormatType);
681+
} else {
682+
dbgLog.fine("null was returned for " + fmt + " test");
683+
if (DEBUG) {
684+
out.println("null was returned for " + fmt + " test");
685+
}
686+
}
687+
} catch (InvocationTargetException e) {
688+
Throwable cause = e.getCause();
689+
// added null check because of "homemade.zip" from
690+
// https://redmine.hmdc.harvard.edu/issues/3273
691+
if (cause.getMessage() != null) {
692+
err.format(cause.getMessage());
693+
e.printStackTrace();
694+
} else {
695+
dbgLog.info("cause.getMessage() was null for " + e);
696+
e.printStackTrace();
697+
}
698+
} catch (IllegalAccessException e) {
699+
e.printStackTrace();
700+
} catch (BufferUnderflowException e) {
701+
dbgLog.info("BufferUnderflowException " + e);
702+
e.printStackTrace();
703+
}
704+
705+
if (readableFormatType != null) {
706+
break;
707+
}
708+
}
709+
710+
// help garbage-collect the mapped buffer sooner, to avoid the jvm
711+
// holding onto the underlying file unnecessarily:
712+
buff = null;
713+
691714
return readableFormatType;
692715
}
716+
693717

694718
/**
695719
* identify the first 5 bytes
@@ -737,7 +761,7 @@ private long getBufferSize(FileChannel fileChannel) {
737761
return BUFFER_SIZE;
738762
}
739763

740-
private int getGzipBufferSize(MappedByteBuffer buff) {
764+
private int getGzipBufferSize(ByteBuffer buff) {
741765
int GZIP_BUFFER_SIZE = 120;
742766
/*
743767
note:

0 commit comments

Comments
 (0)