Skip to content

Commit 88df644

Browse files
authored
Merge pull request #91 from itrujnara/dev
Address comments from Austyn's review
2 parents fb4359a + 3dbb5a9 commit 88df644

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+142
-316
lines changed

CITATIONS.md

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -34,31 +34,21 @@
3434

3535
> Huang H, McGarvey PB, Suzek BE, Mazumder R, Zhang J, Chen Y, Wu CH. A comprehensive protein-centric ID mapping service for molecular data integration. Bioinformatics. 2011 Apr 15;27(8):1190-1. doi: 10.1093/bioinformatics/btr101. PMID: 21478197; PMCID: PMC3072559.
3636
37-
- [AlphaFold](https://deepmind.google/technologies/alphafold)
37+
- [Diamond](https://github.com/bbuchfink/diamond)
3838

39-
> Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
39+
> Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x
4040
41-
- [AlphaFold Database](https://alphafold.ebi.ac.uk)
41+
- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/)
4242

43-
> Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D439–D444, https://doi.org/10.1093/nar/gkab1061
43+
> O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
4444
45-
- [T-COFFEE](https://tcoffee.org)
45+
- [Ensembl](https://www.ensembl.org)
4646

47-
> Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17. doi: 10.1006/jmbi.2000.4042. PMID: 10964570.
48-
49-
- [IQTREE](https://iqtree.org)
50-
51-
> B.Q. Minh, H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A. von Haeseler, R. Lanfear (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534. https://doi.org/10.1093/molbev/msaa015
52-
53-
> D.T. Hoang, O. Chernomor, A. von Haeseler, B.Q. Minh, L.S. Vinh (2018) UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol., 35:518–522. https://doi.org/10.1093/molbev/msx281
54-
55-
- [FastME](https://atgc-montpellier.fr/fastme/)
56-
57-
> Vincent Lefort, Richard Desper, Olivier Gascuel, FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program, Molecular Biology and Evolution, Volume 32, Issue 10, October 2015, Pages 2798–2800, https://doi.org/10.1093/molbev/msv150
47+
> Sarah C Dyer, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Vianey Paola Barrera-Enriquez, Arne Becker, Ruth Bennett, Martin Beracochea, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Lahcen I Campbell, Manuel Carbajo Martinez, Mehrnaz Charkhchi, Lucas A Cortes, Claire Davidson, Sukanya Denni, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Oluwadamilare Falola, Reham Fatima, Thiago Genez, Jose Gonzalez Martinez, Tatiana Gurbich, Matthew Hardy, Zoe Hollis, Toby Hunt, Mike Kay, Vinay Kaykala, Diana Lemos, Disha Lodha, Nourhen Mathlouthi, Gabriela Alejandra Merino, Ryan Merritt, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, José G Pérez-Silva, Malcolm Perry, Ivana Piližota, Daniel Poppleton, Irina Prosovetskaia, Shriya Raj, Ahamed Imran Abdul Salam, Shradha Saraf, Nuno Saraiva-Agostinho, Swati Sinha, Botond Sipos, Vasily Sitnik, Emily Steed, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Francesca Floriana Tricomi, Ian Tsang, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Natalie L Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Jitender Cheema, Jorge Batista da Rocha, Nishadi H De Silva, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jon Keatley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, Guy Naamati, John Tate, Stephen J Trevanion, Andrea Winterbottom, Bethany Flint, Adam Frankish, Sarah E Hunt, Robert D Finn, Mallory A Freeberg, Peter W Harrison, Fergal J Martin, and Andrew D Yates. Ensembl 2025. Nucleic Acids Res. 2025, 53(D1):D948–D957. PMID: 39656687
5848

5949
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
6050

61-
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
51+
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
6252
6353
## Software packaging/containerisation tools
6454

bin/clustal2fasta.py

Lines changed: 0 additions & 31 deletions
This file was deleted.

bin/clustal2phylip.py

Lines changed: 0 additions & 31 deletions
This file was deleted.

bin/csv_adorn.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Convert a list of IDs into a CSV file with a header.
7+
8+
This is required for csv merge to work."""
9+
610
import sys
711

812

913
def csv_adorn(path: str, header: str) -> None:
10-
"""
11-
Convert a list of IDs into a CSV file with a header. Used for later table merge.
12-
"""
1314
print(f"id,{header}")
1415
with open(path) as f:
1516
any_data = False

bin/ensembl2uniprot.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Convert Ensembl IDs to UniProt IDs using the UniProt mapping API."""
7+
68
import sys
79

810
from utils import check_id_mapping_results_ready, safe_get, safe_post
911

1012

1113
def ensembl2uniprot(ensembl_ids: list[str]) -> list[str]:
12-
"""
13-
Convert a list of Ensembl IDs to UniProt IDs using the UniProt mapping API.
14-
"""
14+
"""Convert a list of Ensembl IDs to UniProt IDs using the UniProt mapping API."""
1515
if len(ensembl_ids) == 0:
1616
return []
1717

bin/fetch_afdb_structures.py

Lines changed: 0 additions & 58 deletions
This file was deleted.

bin/fetch_ensembl_idmap.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch Ensembl species identifiers and their NCBI taxon IDs from the Ensembl API."""
7+
68
import requests
79

810

bin/fetch_ensembl_sequences.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch protein sequences from Ensembl using the Ensembl REST API."""
7+
68
import csv
79
import sys
810

911
from utils import list_to_file, safe_post, SequenceInfo, split_ids
1012

1113
def fetch_slice(ids: list[str], idmap: dict[str,str]) -> list[SequenceInfo]:
14+
"""Fetch taxon IDs and sequences for given protein IDs from Ensembl."""
1215
hits = {}
1316
# fetch taxon information
1417
payload = {"ids": ids}
@@ -43,6 +46,7 @@ def fetch_slice(ids: list[str], idmap: dict[str,str]) -> list[SequenceInfo]:
4346

4447

4548
def fetch_ensembl(ids: list[str], idmap_path: str) -> list[SequenceInfo]:
49+
"""Fetch taxon IDs and sequences for given protein IDs from Ensembl in slices of 100."""
4650
taxon_map = {}
4751
with open(idmap_path) as f:
4852
for it in csv.reader(f):

bin/fetch_inspector_group.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch orthologs for a given UniProt ID from the OrthoInspector database."""
7+
68
import sys
79

810
from utils import safe_get
911

1012

1113
def fetch_inspector_by_id(uniprot_id: str, db_id: str = "Eukaryota2019") -> None:
12-
"""
13-
Fetch orthologs for a given UniProt ID from the OrthoInspector database.
14-
"""
14+
"""Fetch orthologs for a given UniProt ID from the OrthoInspector database."""
1515
url = f"https://lbgi.fr/api/orthoinspector/{db_id}/protein/{uniprot_id}/orthologs"
1616
res = safe_get(url)
1717

bin/fetch_oma_by_sequence.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch OMA entry for a given protein sequence from the OMA browser API."""
7+
68
import sys
79
from warnings import warn
810

bin/fetch_oma_group.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,14 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch members of an OMA group by ID."""
7+
68
import sys
79
from warnings import warn
810
from utils import safe_get
911

1012

1113
def main() -> None:
12-
"""
13-
Fetch members of an OMA group by ID.
14-
"""
1514
if len(sys.argv) < 2:
1615
raise ValueError("Too few arguments. Usage: fetch_oma_group_by_id.py <id>")
1716

bin/fetch_oma_groupid.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Get OMA group ID from a UniProt ID."""
7+
68
import sys
79
from warnings import warn
810

911
from utils import safe_get
1012

1113

1214
def main() -> None:
13-
"""
14-
Get OMA group ID from a UniProt ID.
15-
"""
1615
if len(sys.argv) < 2:
1716
raise ValueError("Not enough arguments. Usage: fetch_oma_groupid.py <filename>")
1817

bin/fetch_oma_sequences.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch protein sequences from the OMA database using the OMA REST API."""
7+
68
import sys
79

810
from utils import list_to_file, safe_post, SequenceInfo, split_ids
911

1012

1113
def fetch_slice(ids: list[str]) -> list[SequenceInfo]:
12-
"""
13-
Fetch sequences for given UniProt IDs from the OMA database.
14-
"""
14+
"""Fetch sequences for given UniProt IDs from the OMA database."""
1515
payload = {"ids": ids}
1616

1717
res = safe_post("https://omabrowser.org/api/protein/bulk_retrieve/", json=payload)
@@ -31,6 +31,7 @@ def fetch_slice(ids: list[str]) -> list[SequenceInfo]:
3131

3232

3333
def fetch_seqs_oma(ids: list[str]) -> list[SequenceInfo]:
34+
"""Fetch sequences for given UniProt IDs from the OMA database in slices of 100."""
3435
seqs = []
3536
for s in split_ids(ids, 100):
3637
seqs = seqs + fetch_slice(s)

bin/fetch_oma_taxid_by_id.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch OMA taxon ID by UniProt ID."""
7+
68
import sys
79
from warnings import warn
810

bin/fetch_panther_group.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,15 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch members of a Panther group by ID."""
7+
68
import sys
79
from warnings import warn
810

911
from utils import safe_get
1012

1113

1214
def main() -> None:
13-
"""
14-
Fetch members of a Panther group by ID.
15-
"""
1615
if len(sys.argv) < 3:
1716
raise ValueError("Too few arguments. Usage: fetch_panther_group.py <id> <organism>")
1817

bin/fetch_refseq_sequences.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Written by Igor Trujnara, released under the MIT license
44
# See https://opensource.org/license/mit for details
55

6+
"""Fetch protein sequences from the RefSeq database using the NCBI eutils API."""
7+
68
import sys
79
from xml.dom import minidom
810

@@ -11,21 +13,25 @@
1113

1214

1315
def get_taxid(node: minidom.Element) -> str:
16+
"""Extract the taxid from the XML object."""
1417
taxid = node.getElementsByTagName("TSeq_taxid")[0].firstChild.wholeText
1518
return taxid
1619

1720

1821
def get_sequence(node: minidom.Element) -> str:
22+
"""Extract the sequence from the XML object."""
1923
seq = node.getElementsByTagName("TSeq_sequence")[0].firstChild.wholeText
2024
return seq
2125

2226

2327
def get_prot_id(node: minidom.Element) -> str:
28+
"""Extract the protein ID from the XML object."""
2429
prot_id = node.getElementsByTagName("TSeq_accver")[0].firstChild.wholeText.split(".")[0]
2530
return prot_id
2631

2732

2833
def fetch_slice(ids: list[str], db: str = "protein") -> list[SequenceInfo]:
34+
"""Fetch sequences for given protein IDs from the RefSeq database."""
2935
id_string = ",".join(ids)
3036
fasta = Entrez.efetch(db=db, id=id_string, rettype="fasta", retmode="xml")
3137
seqs = minidom.parse(fasta).getElementsByTagName("TSeq")
@@ -35,6 +41,7 @@ def fetch_slice(ids: list[str], db: str = "protein") -> list[SequenceInfo]:
3541

3642

3743
def fetch_sequences(ids: list[str], db: str = "protein") -> list[SequenceInfo]:
44+
"""Fetch sequences for given protein IDs from the RefSeq database in slices of 100."""
3845
seqs = []
3946
for s in split_ids(ids, 100):
4047
seqs += fetch_slice(s, db)

0 commit comments

Comments
 (0)