Skip to content

Question about sgt_constructor (stuck at 100% w/o error) - mafft/dvtditr #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Edouard94 opened this issue May 2, 2025 · 11 comments
Open

Comments

@Edouard94
Copy link

Dear @robert-ervin-jones,

I would just like to have some insights on my last PhyloFisher run and specifically on the sgt_constructor.py script.

This is the code I ran so far (I copy pasted a custom database folder from another user to my own user directory for this run):

# Make config file
cat <<EOF > config.ini
[PATHS]
database_folder = /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/database/
input_file = /home/Shared/MicrosporidiaInInsectGenomes/input_metadata_final_2025.tsv
orthomcl = /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/database/orthomcl
color_conf = /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/database/tree_colors.tsv
EOF

# Run PhyloFisher
# Working directory
/home/Shared/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025

conda activate fisher

## Copy /home/Shared/PhyloFisher_MicrosporidiainInsectGenomes/ in /home/edouard/

## Reset database
backup_restoration.py -d ~/PhyloFisher_MicrosporidiainInsectGenomes/database --restore 1

## Build the database (run script in ~/PhyloFisher_MicrosporidiainInsectGenomes/database)
build_database.py -t 20

## Collect putative homologs from input taxa 
nohup fisher.py -t 18 > fisher_out.log 2>&1 &

## Produce preliminary statistics about newly input data
informant.py -i fisher_out_Mar.30.2025/

## Collect taxa, and homologs for gene tree construction
working_dataset_constructor.py -i fisher_out_Mar.30.2025/

## Construct gene trees
nohup sgt_constructor.py -i working_dataset_constructor_out_Mar.31.2025 -t 18 > sgt_out.log 2>&1 &

The sgt_constructor.py step seems to take a long time and I just wanted to check if this was normal.

These were the last lines of my sgt_out.log:

[Wed Apr 30 07:51:35 2025]
Finished job 529.
955 of 960 steps (99%) done
Select jobs to execute...

[Wed Apr 30 07:51:35 2025]
rule length_filter_bmge:
    input: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/length_filtration/bmge/srprNP586132.pre_bmge
    output: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/length_filtration/bmge/srprNP586132.bmge
    log: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/logs/length_filter_bmge/srprNP586132.log
    jobid: 528
    wildcards: gene=srprNP586132

Activating conda environment: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/.snakemake/conda/bf565f4a0de62ac5a13b463e75fe6256
[Wed Apr 30 07:51:36 2025]
Finished job 528.
956 of 960 steps (100%) done

So there are no new steps since Wed Apr 30 but a mafft/dvtditr command is running in the background:

/home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/.snakemake/conda/d702536c412493b8b3d4fedd22532d8d/libexec/mafft/dvtditr -W 0.00001 -E 0.0 -s 0.6 -C 1 -t 0 -F -l 2.7 -z 50 -b 62 -: -f -1.53 -Q 100.0 -h 0 -I 16 -X 0.1 -p BAATARI2 -K 0

Is this normal behaviour for sgt_constructor?

So far I have these output files in my sgt folder:

prequal/
logs/
length_filtration/

Here is the full sgt_constructor.py log: sgt_out.log

Thank you for your insights on this Robert.

Best wishes,
Edouard

@robert-ervin-jones
Copy link
Member

Hi @Edouard94,

Thank you for providing this well documented information. What happens when you just try to resubmit? It's strange to me 956 of 960 steps complete but there are no errors in the log file. At least by rerunning we can see which genes are being problematic and at which step. At the moment the log is kind of hard to tease that out without there being error messages.

Best,
Robert

@Edouard94
Copy link
Author

Edouard94 commented May 7, 2025

Hi @robert-ervin-jones,

Thanks for your quick response.

After running twice the sgt_constructor script, it seems to stop at the gene srprNP586132.

Run 1:

[Fri Apr  4 21:53:35 2025]
rule length_filter_bmge:
    input: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Mar.31.2025/length_filtration/bmge/srprNP586132.pre_bmge
    output: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Mar.31.2025/length_filtration/bmge/srprNP586132.bmge
    log: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Mar.31.2025/logs/length_filter_bmge/srprNP586132.log
    jobid: 702
    wildcards: gene=srprNP586132

Activating conda environment: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/.snakemake/conda/bf565f4a0de62ac5a13b463e75fe6256
[Fri Apr  4 21:53:36 2025]
Finished job 702.
956 of 960 steps (100%) done

Run 2:

[Wed Apr 30 07:51:35 2025]
rule length_filter_bmge:
    input: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/length_filtration/bmge/srprNP586132.pre_bmge
    output: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/length_filtration/bmge/srprNP586132.bmge
    log: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/sgt_constructor_out_Apr.25.2025/logs/length_filter_bmge/srprNP586132.log
    jobid: 528
    wildcards: gene=srprNP586132

Activating conda environment: /home/edouard/PhyloFisher_MicrosporidiainInsectGenomes/WGS_TSA_March2025/.snakemake/conda/bf565f4a0de62ac5a13b463e75fe6256
[Wed Apr 30 07:51:36 2025]
Finished job 528.
956 of 960 steps (100%) done

And the mafft/dvtditr from my last run (run 2, started the 30th of April) is still running in the background.

I am attaching the results of the last run here (link expires in 3 day): https://we.tl/t-R4mk7eOSmq

And here the orthologue (database) fasta file for this specific gene: srprNP586132.txt

Let me know if I can share other files.

Thanks again for your help!

@Edouard94 Edouard94 changed the title Question about sgt_constructor - mafft/dvtditr Question about sgt_constructor (stops at 100%) - mafft/dvtditr May 7, 2025
@Edouard94 Edouard94 changed the title Question about sgt_constructor (stops at 100%) - mafft/dvtditr Question about sgt_constructor (stops at 100% w/o error) - mafft/dvtditr May 7, 2025
@robert-ervin-jones
Copy link
Member

Hi @Edouard94,

Can you send me the output from your job scheduler?

@Edouard94
Copy link
Author

Hi Robert,

Do you mean this one:
sgt_out.log

@robert-ervin-jones
Copy link
Member

Yes. Thank you! I think what is happening is that you still have an instance running since you used nohup. I think srprNP586132 is fine. There must be one gene still trying to get through mafft.

@Edouard94
Copy link
Author

Edouard94 commented May 7, 2025

Yes, I think you are right, that would also explain the mafft/dvtditr command running in the background? (for days)

Should I try to run the script without nohup?

@robert-ervin-jones
Copy link
Member

I don't think that's necessary. I think you will just need to wait for it to finish. It is not uncommon for mafft to take a really long time on files with many and/or long sequences.

@Edouard94
Copy link
Author

Ok good to know, I will update you when the mafft/dvtditr command comes to an end!

Thank you for your help Robert.

@robert-ervin-jones
Copy link
Member

No problem! Just for your reference it seems to be rpo-CNP585937 that is still in mafft.

@Edouard94
Copy link
Author

Edouard94 commented May 7, 2025

Ah nice, did you just find out thanks to the log?

It is the biggest ortholog file in the database, so it makes sense. And maybe the most prevalent gene in the input proteomes as well?

@Edouard94 Edouard94 changed the title Question about sgt_constructor (stops at 100% w/o error) - mafft/dvtditr Question about sgt_constructor (stuck at 100% w/o error) - mafft/dvtditr May 7, 2025
@robert-ervin-jones
Copy link
Member

I did. It could be. I didn't check to see how many sequences there were in that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants