|
| 1 | +# Study Data Export |
| 2 | + |
| 3 | +This package contains the code for exporting study data from the database to a file format. The export process involves several steps, including: |
| 4 | +1. Retrieving the study data from the database. |
| 5 | +2. Transforming the data into a suitable format for export. |
| 6 | +3. Writing the transformed data to a file. |
| 7 | + |
| 8 | +The implementation is done with minimum dependencies on the rest of the code to ensure that the code is lightweight, performant and easy to move to a separate web application if needed. |
| 9 | +To make export process take less RAM, the code uses a streaming approach to read and write data. On the database side, the code uses a cursor to read data in chunks, and on the web controller side, the code uses a streaming response to write data in chunks. |
| 10 | +This allows the code to handle large datasets without running out of memory. |
| 11 | + |
| 12 | +## Usage |
| 13 | + |
| 14 | +Set `feature.study.export` to `true` in the application properties file to enable the dynamic study export mode. |
| 15 | +This mode allows the user to export study with `/export/study/{studyId}.zip` link. |
| 16 | + |
| 17 | +## 10 minute timeout |
| 18 | + |
| 19 | +The export process is designed to complete within 10 minutes. If the export takes longer than that, it will be terminated. This is to ensure that the export process does not block the server for too long and to prevent resource exhaustion. |
| 20 | +If you want to increase the timeout, you can set the `feature.study.export.timeout_ms` property in the application properties file. The value is in milliseconds, and the default value is `600000` (10 minutes). |
| 21 | +Setting it to `-1` will disable the timeout and allow the export process to run indefinitely. However, this is not recommended as it can lead to resource exhaustion and performance issues. |
| 22 | + |
| 23 | +## Supported Formats |
| 24 | + |
| 25 | +The following formats are supported for export: |
| 26 | + |
| 27 | +| GENETIC_ALTERATION_TYPE | DATATYPE | SUPPORTED | |
| 28 | +|---------------------------------------------------------|---|---| |
| 29 | +| CANCER_TYPE | CANCER_TYPE | Yes | |
| 30 | +| CLINICAL | PATIENT_ATTRIBUTES | Yes | |
| 31 | +| CLINICAL | SAMPLE_ATTRIBUTES | Yes | |
| 32 | +| CLINICAL | TIMELINE | Yes | |
| 33 | +| PROTEIN_LEVEL | LOG2-VALUE | Yes | |
| 34 | +| PROTEIN_LEVEL | Z-SCORE | Yes | |
| 35 | +| PROTEIN_LEVEL | CONTINUOUS | Yes | |
| 36 | +| COPY_NUMBER_ALTERATION | DISCRETE | Yes | |
| 37 | +| COPY_NUMBER_ALTERATION | CONTINUOUS | Yes | |
| 38 | +| COPY_NUMBER_ALTERATION | DISCRETE_LONG | No | |
| 39 | +| COPY_NUMBER_ALTERATION | LOG2-VALUE | Yes | |
| 40 | +| COPY_NUMBER_ALTERATION | SEG | Yes | |
| 41 | +| MRNA_EXPRESSION | CONTINUOUS | Yes | |
| 42 | +| MRNA_EXPRESSION | Z-SCORE | Yes | |
| 43 | +| MRNA_EXPRESSION | DISCRETE | Yes | |
| 44 | +| MUTATION_EXTENDED | MAF | Yes | |
| 45 | +| MUTATION_UNCALLED | MAF | Yes | |
| 46 | +| METHYLATION | CONTINUOUS | Yes | |
| 47 | +| GENE_PANEL_MATRIX | GENE_PANEL_MATRIX | Yes | |
| 48 | +| STRUCTURAL_VARIANT | SV | Yes | |
| 49 | +| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | LIMIT-VALUE | Yes | |
| 50 | +| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | BINARY | Yes | |
| 51 | +| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | CATEGORICAL | Yes | |
| 52 | +| Cancer study meta file | | Yes | |
| 53 | +| Case lists | | Yes | |
| 54 | +| GISTIC_GENES_AMP | Q-VALUE | No | |
| 55 | +| GISTIC_GENES_DEL | Q-VALUE | No | |
| 56 | +| MUTSIG | Q-VALUE | No | |
| 57 | +| GENESET_SCORE | GSVA-SCORE | No | |
| 58 | +| GENESET_SCORE | P-VALUE | No | |
| 59 | +| Study tags | | No | |
| 60 | +| Resource Definition | | No | |
| 61 | +| Study Resource | | No | |
| 62 | +| Patient Resrouce | | No | |
| 63 | +| Sample Resource | | No | |
| 64 | + |
| 65 | +### namespaces meta property is not supported |
| 66 | + |
| 67 | +Mutations, CNA and SV data has `namespaces` meta property that provide a way to load arbitrary data into cBioPortal. |
| 68 | +We do not support exporting this data atm. It can be added later if needed. |
| 69 | + |
| 70 | +## Caveats |
| 71 | + |
| 72 | +The exported study data files won't look exactly the same as the original study data files. |
| 73 | +## What's lost in translation? |
| 74 | +- If your data includes `Hugo_Symbol` but not `Entrez_Gene_Id`, cBioPortal will try to find the matching gene using its database. As a result, the exported data might include `Hugo_Symbol` values that weren’t in your original file, these could be related gene names that replace gene aliases found in your data. |
| 75 | + - The export always adds both `Hugo_Symbol` and `Entrez_Gene_Id` with complete values, even if the original file had only one column or was missing some values. |
| 76 | +- The cBioPortal loader filters out certain mutations (e.g. not coding mutations), so the exported MAF file may not include all mutations from the original file. |
| 77 | +- The exported files will not contain the original file names, but rather the file names will be generated based on the data type. |
| 78 | +- `TIMELINE` data will be exported file per `EVENT_TYPE` despite how original files were structured. |
| 79 | + - If `STYLE_COLOR` or `STYLE_SHAPE` columns are present in the timeline data, in case of no value for some events, the default values will be used: |
| 80 | + - `STYLE_COLOR` will be set to `#1f77b4` (light blue). |
| 81 | + - `STYLE_SHAPE` will be set to `circle`. |
| 82 | + - These values are used by default by cBioPortal to render the timeline events in the UI. |
| 83 | +- `DISCRETE_LONG` will not be exported as such as there is no information in the database that marks the data as long. Instead, it will be exported as `DISCRETE`. |
| 84 | +- `HGVSp_Short` of the MAF file will be computed from `mutation_event`.`PROTEIN_CHANGE` by adding the `p.` prefix (if it's not `MUTATED`). |
| 85 | + - The protein change could be read from `Amino_Acid_Change` as fallback field in the original files, but there is no way of knowing where the protein change has been parsed originally from. |
| 86 | + - As `Amino_Acid_Change` can contain not valid HGVSp value, you might end up with `HGVSp_Short` that is not valid HGVSp value. Although, it should not stop you from loading the file into cBioPortal and get the protein change parsed correctly. |
0 commit comments