Skip to content

Commit 54d9853

Browse files
authored
Implement RFC95 Generating Study data files (#11482)
Enable HTTP streaming by eliminating unnecessary response copying. To enable streaming for Study Data Export (RFC95). Unlike caching http requests `ContentCachingRequestWrapper` for enabling multiple request body reads, wrapping responses in `ContentCachingResponseWrapper` is not necessary and not used anywhere. The application caching behaviour (with `@Cachable` annotation) does not depend on any of these. Port clinical export - Introduce long table - Convert records back to pojos. Record fields are mapped by position with mybatis - Write tests for clinical data table - Make conditional Java Config Enable genetic profile and MAF data export Enable case list export Port POC unit tests. Make them pass Port Export integration smoke test Port import-export functionality test Enable dynamic study export mode on CI for running tests Fix import export study data differences Refactor the code Fix order of columsn for the clinical file Refactor exporters layer Make test to pass Return 404 http status when no study exists Make sure the rows are ordered by sample/patient IDs We do defensieve assumption check to make sure clinical file rows will be formed correctly Check clinical attributes for duplicates Clean code from done TODO comments Run study export in low priority custom thread pool Specify the type of MAF exporter Should we maybe currupt file if download fails? Remove unneeded comments Make sure input stream closes Add support of MRNA Expression data type export Add support for generic data types Add README to the package Add MRNA and Generic Assay data type to the test study Remove patient level generic assay data to fix the build Fix special attribute values not showing for the first line toRow() gave different results on the first and subsequent calls. The first calls were made for getting the header only. Skip generic properties with mistmatching id They were blocking for the followup rows Fix skipping rows in generic assay data MyBatis does not like if result does not have <id ...> If it is not specified, it picks first field as a key. Rows that would have the same key will be skipped! Update import/export test study to new fixes Fix metadata tests Fix unit tests Improve test coverage for all exporters Expand MRNA export support to Z-SCORE and DISCRETE Support export of cancer types We export the cancer type of the study with all its parents Do not export patient level generic assay data We are not going to support patient level data export. Without this fix the code exported sample header for patient level data which would fail during the load to cBioPortal Fix number of columns for cancer type file Export clinical timeline aka events Support protein level data export Support more generic assay data types Add suport for exporting mutation uncalled data type Add CNA contineous and log2 data export Lower case p in phosphosite to mark it as such Support methylation data type export Support CNA discrete data type Support export of CNA Segment data Support Structural Variant Data Export Support exporting gene panel matrix data Remove pipe output as not reliable Use forward only cursors where possible for memory optimisation Corrupt zip file intentionally in case of exception We don't want to do partial export in case of exception Warn about incorrect format for phosphoprotein, not crash Provide a way to increase timeout for async requests Increase it to 10 minutes by default. Fix sonar cube reported issues Add README.txt file to the exported study data warn that there might be not all data types exported Write NA instad of blank string for absent gene panel Validator complains about blank strings Move code to fail zip streaming to the factory Add posibility to export study under alternative study id Enable filtering exported data by sample id Move pre authorization check to the service layer Preparation to support export of virtual studies Move check for presense of study into the service As a preparation to support of Virtual Study export Enable downloading virtual studies Use CI session service instead of default remote one Fixing the test for study impor export on the build Improve splitting gene name to hugo symbol and phosphosite Phosphosite can contain underscore in it's id Override name, description, pmid and cancer type of virtual study that have one physical study in the definition Test Virtual Study download that is defined with multiple Materialised Studies Refresh dynamic Virtual Studies before export Add dynamic_study_export_mode property Improve performance by parsing comma-separated values once not in the loop! It becomes very slow when amount of samples are significant Improve start-stop time logging around performance critical code It helped to identify some performance issues Document export request timeout Restore HGVSp_Short from PROTEIN_CHANGE` Export tumor seq allele 1 and 2 from the right db field Account for blank strings in protein change Export a file per clinical timeline event type Apply spotless reformatting of the code Export data with default style color and shape if attributes present but there is no value. The loader does not like empty values for these columns. Change timeline circle color to the default used by the timeline widget Substitute with default value even when nullable or blank value specified Update study_es_0_import_export with multiple timeline files A timeline file per EVENT_TYPE Update import export study to test HGVSp_Short, Tumor_Seq_Allele1 and Tumor_Seq_Allele2 parsing Expand unit test to check defaults for STYLE_COLOR and STYLE_SHAPE Document supported file formats by export Add Caveats section to the export README Add notes about time line export particularities to README Document export particularities of Hugo_Symbol and Entrez_Gene_Id columns Add note about mutations filtering during the upload Blanken negative entrez ids Mention that we do not support namespaces meta tag for mutation, CNA and SV Fix order of custom timeline attributes so import/export test can pass after sorting rows and applying diff command to test equality Blanken negative entrez ids in SV data type Fix removing last row while exporting gene panel matrix Was happening only if there is only one panel on the last row Rename zip output stream writer from factory to service and move it to services package Remove unused local variable Rename Virtual Study aware service. Add Decorator to the name To document the pattern in the name Break down long VS export function to smaller ones Add javadoc to mappers of export fucntionality Use read-only transaction for export functionality It make code faster and ensure nothing gets changed accidently Do not write Virtual Study definition file if user has no access to it Fix order of samples when exporting data for sample ids e.g. for Virtual Studies Before order of samples in header changed arbitrary! Introduce repositories layer Do not prepend number of samples to the meta description We decided that mentioning it in meta study file is enough Export mutation strand as was imported Apply spotless style corrections after conflict resolution in session service classes Rename feature flag from dynamic_study_export_mode to feature.study.export There is a discussion to have feature.<highlevel>.<featureName> format for all feature flags Fix static and dynamic virtual study tests Break down long method in timeline exporter Remove commented code in export config Define ant throw an export exceptions with informative message Create and use DESCRIPTION contant instead of repeating it 3 times Make abstract exporter constructors protected Organise constructors and closing methods Reduce cognitive complexity of genetic alteration tsv exporter Refactor write tsv data method Add nested comment explaining empty methods Remove obsolete TODOs Refactore meta description update and path formation Remove use of stream peak in VS service Fix sonarqube reported issues for tests Update study_es_0_import_export to be consistent with study_es_0 Add single-study virtual study integration test Add multi-study virtual study integration test Fix multi virtual study test Fix multi virtual study test Fix multi virtual study test Do not re-load test studies to test VS export Rely on data load in previous steps Remove unnecessary reloading on gene panels and gene sets data
1 parent d07c959 commit 54d9853

File tree

191 files changed

+12067
-218
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

191 files changed

+12067
-218
lines changed

.github/workflows/integration-test.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,9 @@ jobs:
4545
cat $PORTAL_SOURCE_DIR/src/main/resources/application.properties | \
4646
sed 's|spring.datasource.url=.*|spring.datasource.url=jdbc:mysql://cbioportal-database:3306/cbioportal?useSSL=false|' | \
4747
sed 's|spring.datasource.username=.*|spring.datasource.username=cbio_user|' | \
48-
sed 's|spring.datasource.password=.*|spring.datasource.password=somepassword|' \
48+
sed 's|spring.datasource.password=.*|spring.datasource.password=somepassword|' | \
49+
sed 's|session.service.url=.*|session.service.url=http://cbioportal-session:5001/api/sessions/my_portal/|' | \
50+
sed 's|feature.study.export=.*|feature.study.export=true|' \
4951
> application.properties
5052
- name: 'Dump Properties'
5153
working-directory: ./cbioportal-docker-compose
@@ -71,6 +73,11 @@ jobs:
7173
working-directory: ./cbioportal-docker-compose
7274
run: |
7375
$PORTAL_SOURCE_DIR/test/integration/test_load_study.sh
76+
- name: 'TEST - Import and Export of study_es_0_import_export'
77+
if: steps.startup.conclusion == 'success'
78+
working-directory: ./cbioportal-docker-compose
79+
run: |
80+
$PORTAL_SOURCE_DIR/test/integration/test_import_export.sh
7481
- name: 'TEST - Add OncoKB annotations to study'
7582
if: steps.startup.conclusion == 'success'
7683
working-directory: ./cbioportal-docker-compose

pom.xml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,10 @@
426426
<version>1.5.1</version>
427427
<scope>provided</scope>
428428
</dependency>
429+
<dependency>
430+
<groupId>com.zaxxer</groupId>
431+
<artifactId>HikariCP</artifactId>
432+
</dependency>
429433
</dependencies>
430434

431435
<dependencyManagement>
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Study Data Export
2+
3+
This package contains the code for exporting study data from the database to a file format. The export process involves several steps, including:
4+
1. Retrieving the study data from the database.
5+
2. Transforming the data into a suitable format for export.
6+
3. Writing the transformed data to a file.
7+
8+
The implementation is done with minimum dependencies on the rest of the code to ensure that the code is lightweight, performant and easy to move to a separate web application if needed.
9+
To make export process take less RAM, the code uses a streaming approach to read and write data. On the database side, the code uses a cursor to read data in chunks, and on the web controller side, the code uses a streaming response to write data in chunks.
10+
This allows the code to handle large datasets without running out of memory.
11+
12+
## Usage
13+
14+
Set `feature.study.export` to `true` in the application properties file to enable the dynamic study export mode.
15+
This mode allows the user to export study with `/export/study/{studyId}.zip` link.
16+
17+
## 10 minute timeout
18+
19+
The export process is designed to complete within 10 minutes. If the export takes longer than that, it will be terminated. This is to ensure that the export process does not block the server for too long and to prevent resource exhaustion.
20+
If you want to increase the timeout, you can set the `feature.study.export.timeout_ms` property in the application properties file. The value is in milliseconds, and the default value is `600000` (10 minutes).
21+
Setting it to `-1` will disable the timeout and allow the export process to run indefinitely. However, this is not recommended as it can lead to resource exhaustion and performance issues.
22+
23+
## Supported Formats
24+
25+
The following formats are supported for export:
26+
27+
| GENETIC_ALTERATION_TYPE | DATATYPE | SUPPORTED |
28+
|---------------------------------------------------------|---|---|
29+
| CANCER_TYPE | CANCER_TYPE | Yes |
30+
| CLINICAL | PATIENT_ATTRIBUTES | Yes |
31+
| CLINICAL | SAMPLE_ATTRIBUTES | Yes |
32+
| CLINICAL | TIMELINE | Yes |
33+
| PROTEIN_LEVEL | LOG2-VALUE | Yes |
34+
| PROTEIN_LEVEL | Z-SCORE | Yes |
35+
| PROTEIN_LEVEL | CONTINUOUS | Yes |
36+
| COPY_NUMBER_ALTERATION | DISCRETE | Yes |
37+
| COPY_NUMBER_ALTERATION | CONTINUOUS | Yes |
38+
| COPY_NUMBER_ALTERATION | DISCRETE_LONG | No |
39+
| COPY_NUMBER_ALTERATION | LOG2-VALUE | Yes |
40+
| COPY_NUMBER_ALTERATION | SEG | Yes |
41+
| MRNA_EXPRESSION | CONTINUOUS | Yes |
42+
| MRNA_EXPRESSION | Z-SCORE | Yes |
43+
| MRNA_EXPRESSION | DISCRETE | Yes |
44+
| MUTATION_EXTENDED | MAF | Yes |
45+
| MUTATION_UNCALLED | MAF | Yes |
46+
| METHYLATION | CONTINUOUS | Yes |
47+
| GENE_PANEL_MATRIX | GENE_PANEL_MATRIX | Yes |
48+
| STRUCTURAL_VARIANT | SV | Yes |
49+
| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | LIMIT-VALUE | Yes |
50+
| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | BINARY | Yes |
51+
| GENERIC_ASSAY (sample level only, PATIENT_LEVEL: false) | CATEGORICAL | Yes |
52+
| Cancer study meta file | | Yes |
53+
| Case lists | | Yes |
54+
| GISTIC_GENES_AMP | Q-VALUE | No |
55+
| GISTIC_GENES_DEL | Q-VALUE | No |
56+
| MUTSIG | Q-VALUE | No |
57+
| GENESET_SCORE | GSVA-SCORE | No |
58+
| GENESET_SCORE | P-VALUE | No |
59+
| Study tags | | No |
60+
| Resource Definition | | No |
61+
| Study Resource | | No |
62+
| Patient Resrouce | | No |
63+
| Sample Resource | | No |
64+
65+
### namespaces meta property is not supported
66+
67+
Mutations, CNA and SV data has `namespaces` meta property that provide a way to load arbitrary data into cBioPortal.
68+
We do not support exporting this data atm. It can be added later if needed.
69+
70+
## Caveats
71+
72+
The exported study data files won't look exactly the same as the original study data files.
73+
## What's lost in translation?
74+
- If your data includes `Hugo_Symbol` but not `Entrez_Gene_Id`, cBioPortal will try to find the matching gene using its database. As a result, the exported data might include `Hugo_Symbol` values that weren’t in your original file, these could be related gene names that replace gene aliases found in your data.
75+
- The export always adds both `Hugo_Symbol` and `Entrez_Gene_Id` with complete values, even if the original file had only one column or was missing some values.
76+
- The cBioPortal loader filters out certain mutations (e.g. not coding mutations), so the exported MAF file may not include all mutations from the original file.
77+
- The exported files will not contain the original file names, but rather the file names will be generated based on the data type.
78+
- `TIMELINE` data will be exported file per `EVENT_TYPE` despite how original files were structured.
79+
- If `STYLE_COLOR` or `STYLE_SHAPE` columns are present in the timeline data, in case of no value for some events, the default values will be used:
80+
- `STYLE_COLOR` will be set to `#1f77b4` (light blue).
81+
- `STYLE_SHAPE` will be set to `circle`.
82+
- These values are used by default by cBioPortal to render the timeline events in the UI.
83+
- `DISCRETE_LONG` will not be exported as such as there is no information in the database that marks the data as long. Instead, it will be exported as `DISCRETE`.
84+
- `HGVSp_Short` of the MAF file will be computed from `mutation_event`.`PROTEIN_CHANGE` by adding the `p.` prefix (if it's not `MUTATED`).
85+
- The protein change could be read from `Amino_Acid_Change` as fallback field in the original files, but there is no way of knowing where the protein change has been parsed originally from.
86+
- As `Amino_Acid_Change` can contain not valid HGVSp value, you might end up with `HGVSp_Short` that is not valid HGVSp value. Although, it should not stop you from loading the file into cBioPortal and get the protein change parsed correctly.

0 commit comments

Comments
 (0)