-
Notifications
You must be signed in to change notification settings - Fork 9
conkit-plot peval fails to match sequences #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sadiogo, I think the problem here might be that you are providing a MSA file (conkit.jones) instead of a FASTA file with the sequence of the protein of interest. Have you tried:
Where |
Hi @FilomenoSanchez, I did try that. I get the following traceback when I use a single fasta sequence (ungapped) :
When I use a single fasta sequence, but gapped, I get the same By the way, I provided my own alignment when I used conkit-predict to generate the .mat and .rr files. Does this affect anything? |
The traceback you have shown in your last comment occurs when the residue numbering in the contact map and the sequence do not match, particularly when the sequence in the FASTA file is shorter than it should. For example, the sequence in the FASTA file has 100 residues but the residue numbers in the contact map go all the way up to 200. I think this might have occurred because you provided a gapped alignment as input for
|
Thanks, that worked! In retrospect, I had a gapped fasta alignment which I converted to a3m using |
I am not sure whether A3M format specifications explicitly prohibit gaps in the first sequence of the alignment? In any case, it is not trivial to remove gaps from the template sequence when converting from FASTA to A3M, as this would require to recalculate the alignment with all the sequences included in the MSA all over again (AFAIK), and it's well beyond conkit's original purpose. I think the only realistic fix here would be to make conkit interpret gaps in the FASTA sequence just as if they were unknown residues. |
My knowledge on a3m files is entirely based on my experience using the MPI Bioinformatics Toolkit. From what I've seen, the first sequence is always gapless (which is propitious, since it is generally the query sequence). In any case, they have a service there called |
Yes indeed it seems that the A3M includes a gapless query sequence (my knowledge about this format is also limited). The |
One last question, @FilomenoSanchez. When plotting the reference structure to assess false positive contacts, the structure being added has to satisfy one or both of these conditions (?):
I think these questions also apply to ConPlot? (very nice server, btw) |
In the case of ConPlot:
You would need a PDB with the following residue numbers:
However, a PDB with missing residues 2-3 would still be valid, as long as the residue numbering is consistent:
Also please note that ConPlot cannot interpret gaps in the FASTA sequence. Conkit is a bit more clever dealing with gaps in PDB files, but I would generally recommend following the logic explained here when working with any of these two tools. |
Okay, so from what you told me, this should work(?):
PDB:
Which is equivalent to saying:
|
Yes, this should work (if you have noticed it doesn't work in ConPlot, please open an issue at rigdenlab/conplot). Is the residue sequence mismatch intentional (residue 1 is MET in the FASTA and ILE in PDB)? The residue names are parsed from the FASTA sequence and are used in the tooltip display that appears when you hover over contacts and the diagonal of the plot (in ConPlot), so that information won't match the PDB (however this input will still produce a contact map). |
Yes, it's intentional. I was simulating the situation wherein I have predicted a contact map from an alignment of homologous sequences and want to compare it with the contact map of a distant homologous structure (e.g., <30% identity). In this case, I just need to trim any eventual insertions that the template structure may have (be them at the N- or C-terminus, or within the structure). The a3m format is especially good for detecting these insertions because they will appear as lowercase letters in the alignment between the query sequence and the template structure. |
It didn't work in both Conkit and ConPlot; I've opened a separate issue in the Conplot project.
That tries to simulate the following situation:
PDB:
Which is equivalent to saying:
I get the following traceback:
|
You would usually get this error when there are multiple residues sharing the same residue number in the PDB. In this case there are several residues with repeated numbers (38, 95, 172, 173, 186 and 221) which are causing the problem. They have been added using insertion codes, but conkit does not support this. What I would suggest here is that you re-assign the residue numbers in the PDB file so that it matches the sequence in the FASTA file and there are no repeated residue numbers. For the record, input files can be found here: rigdenlab/conplot#139 |
It worked in ConPlot, but in Conkit I still get a traceback:
It appears I've gone somewhere forbidden lol For plot purposes, ConPlot solved my problem. But I'm insisting on using Conkit because I want to run the |
This is a bit strange, I wasn't able to reproduce this error. I am sending attached the data (data.zip) I used and the plot I created when I ran the following command:
I am using conkit 0.12.0 on python 3.8. What version of conkit and python are you using? You can check the exact conkit version if you open a python terminal and type the following: import conkit
conkit.__version__ |
I reproduced your call using those files and got the same error. |
I've installed python 3.8 and attempted to install conkit, but I get the following error:
|
You are missing some packages, you can fix that using |
Worked! But for some reason I had to use I reran
and still get the weird traceback:
|
Could you try using the latest commit instead of version 0.12? You can replace your current version with the latest commit doing the following (you might need to create the directory
|
I had to do Thanks @FilomenoSanchez ! |
Excellent! Not sure what caused the error but I suppose it will get fixed with the next version release. |
Ah! Now I tried running:
and I got the traceback again:
|
This time I was able to reproduce the error, I will need some time to figure out what is going on here... I'll come back to you when I have something. |
I believe the
and was able to get a precision score at least (although the score seem a bit too 'precise' given the plot):
|
This is a bug, I located the source of the problem and I will be pushing a fix at some point later today. Once I do this you will be able to fix this by pulling the latest commit to the repository you cloned on your machine. If you don't want to wait until I push the fix, you can fix this in your local repository with the following changes: diff --git a/conkit/command_line/conkit_plot.py b/conkit/command_line/conkit_plot.py
index 40bdc95..14f5f66 100644
--- a/conkit/command_line/conkit_plot.py
+++ b/conkit/command_line/conkit_plot.py
@@ -520,6 +520,8 @@ def main(argv=None):
else:
pdb = conkit.io.read(args.pdbfile, "pdb")[0]
+ pdb.sequence = seq
+ pdb.set_sequence_register()
pdb = pdb.as_contactmap()
con_matched = con.match(pdb, renumber=True, remove_unmatched=True) |
I suspect the diff --git a/conkit/command_line/conkit_precision.py b/conkit/command_line/conkit_precision.py
index c27b548..fd94ebf 100644
--- a/conkit/command_line/conkit_precision.py
+++ b/conkit/command_line/conkit_precision.py
@@ -79,7 +79,11 @@ def main():
pdb = conkit.io.read(args.pdbfile, args.pdbformat)[args.pdbchain]
else:
pdb = conkit.io.read(args.pdbfile, args.pdbformat)[0]
+
seq = conkit.io.read(args.seqfile, args.seqformat)[0]
+ pdb.sequence = seq
+ pdb.set_sequence_register()
+ pdb = pdb.as_contactmap()
con = conkit.io.read(args.confile, args.conformat)[0]
con.sequence = seq |
I will try modifying those. But before I do, I've come up on another issue. Instead of plotting the query sequence against a distant homologous structure, I attempted to plot it against a structure model (thus, same sequence as query) built by using the distant structure as template. I ran the following:
and I got the following traceback:
The same thing happens when using I've attached those files should you want to test it yourself. |
You are not specifying the correct formats, the contact predictions that you are trying to use are in the CASP MODE 2 format (these are actually inter-residue distance predictions). Also the sequence is in FASTA format not A3M. The command should be:
|
I had tried using the
But when I upload using the CASPRR-MODE 1 format, it works. Because of this I didn't attempt using capmode2 in Conkit. Give it a try yourself trying to upload that distance file in ConPlot with caspmode2. As for the fasta and a3m format, they are essentially the same if you have just one sequence in the file, so the a3m tags works too. You've used it yourself for fasta file lol:
|
Yes the sequence format does not matter in this case. Regarding your input, the problem here is that in your predictions there are inter-residue distances that have probabilities higher than 1. ConPlot has stricter user-input sanity checks than Conkit, which is why the former complains and the latter doesn't. For example, line no. 4 in
Distance bin |
That's strange. So the CASPRR-MODE1 in ConPlot accepts that anyway? That file was generated by Raptor Contact Predict, and they specifically say in here: Contact result file |
Yes ConPlot only has this sanity check when using CASPRR-MODE2 for two reasons:
Regarding your file, it seems that it doesn't follow CASP ROLL as the authors claim:
Which clearly doesn't match this file:
|
Yeah, makes no sense at all. They must've performed some type of normalization on the probabilities, just can't figure what it was. Or maybe they forgot to apply the Softmax function; by applying it you make Do you reckon this affects the Precision evaluation? I've changed the files as you showed me. Now when I run:
I get:
which makes a lot more sense than the previous 0.941441. I've also managed to produce the precision evaluation plots. |
Probably this file format issue doesn't affect |
Same is true for plotting then, since the |
If they forgot to apply the softmax function as you suggest, then I think |
Also, I would suggest trRosetta (https://yanglab.nankai.edu.cn/trRosetta/). It let's you provide a MSA as input, and from my personal experience it gives more accurate results than Raptor-X. I also never had issues with file formatting when using the trRosetta server. Both ConKit and ConPlot are compatible with the output created with trRosetta (the format is |
Nice one! I'm trying that now. I typically use trRosetta for homology modeling, but since they have options "for not using templates and homologous sequences", I can also use them for the contact maps :) |
Some feedback on trRosetta contact maps: the accuracy/precision skyrocketed! I found two setbacks though; first, I can't actually open the file to see what's on it. How to you do it? And second, the file is gigantic (200mb in my case) and I can't open it in Conplot Server because it takes too long. |
Yes from my personal experience trRosetta produces better results than Raptor-X. If you want to obtain even more accurate results you could give a try to AlphaFold, which also produces distance predictions compatible with ConKit and ConPlot. You will need to get it installed and run it on your machine, it is also available in some google colab servers but they don't give back the distance predictions, only the protein model. I'm not sure if it accepts MSAs as input or what homology modelling options it has, but it is worth a try. Regarding your questions:
|
I use AlphaFold quite a lot by means of the ColabFold notebooks, which allows adjusting many of the input parameters. The output provided there is also much better, and includes a .raw file that appears to be a contact prediction. In any case, the problem is AlphaFold is "too" precise. First, because it adds homologous sequences to the input MSA (in the advanced notebook it is possible to avoid this, but I haven't actually converted the msa.pickle output to fasta to confirm that the sequences are unchanged; I wrote a script for that but can't be able to find it anymore). Second, and most importantly, AlphaFold updates the MSA (and the distogram) during every iteration into the neural network's transformer (the 'Evoformer'). Basically, it optimizes the MSA to produce the best contact prediction possible. I don't want this, I want the contact prediction to be based entirely upon the original shallow MSA, without any modifications to its columns. So I guess trRosetta is better for my current purpose. As for the npz file, I will try to convert with conkit and see how that goes! Thanks for all the help and suggestions! |
I just merged the bug fix, you should be able to pull the latest commit and use |
If you give me the exact instructions for merging, I'll do it. That way I'll also fix the other minor commits you made. |
Go to the directory where you cloned the repo (so for example |
Good thing I pulled the commit, cause I bumped into an error that got fixed after that. I've managed to convert the npz to psicov, casmode2 and casprr, but I get a trace back when trying to convert into ccmpred:
I am now running conplot locally and things are running smoothly. However, when I try setting L/0 to show all contacts, I get a completely black plot (I've attached the image). I that how it should work? A minor detail about Conplot: if I accidently try to upload a file with the wrong format, I can't upload the correct file after I receive the error message. To be able to upload, I have to change the file format, click upload, close the upload window, change the file format back to the one I wanted and click upload. A question, How do I cite conkit? I've only found your paper on conplot. |
Hi, the ccmpred error that you found seems to be another bug. I just opened an issue for it. The reason is that this is a very old parser that got written years ago when conkit was still python2 & python3 compatible (now it's only py3) and we required a few input checks that are no longer needed. It seems this part of the code got forgotten when we were changing things. Just as before, I will be fixing the code later today, but in the meanwhile you can do the following: diff --git a/conkit/io/ccmpred.py b/conkit/io/ccmpred.py
index 4291489..d61c44c 100644
--- a/conkit/io/ccmpred.py
+++ b/conkit/io/ccmpred.py
@@ -120,13 +120,7 @@ class CCMpredParser(ContactFileParser):
------
:exc:`RuntimeError`
More than one contact map in the hierarchy
- :exc:`TypeError`
- Python3 requires f_handle to be in `wb` or `ab` mode
-
"""
- # Python3 support requires bytes mode
- if sys.version_info.major == 3 and not (f_handle.mode == "wb" or f_handle.mode == "ab"):
- raise TypeError("Python3 requires f_handle to be in 'wb' or 'ab' mode")
# Double check the type of hierarchy and reconstruct if necessary
contact_file = self._reconstruct(hierarchy) Removing those lines should be fine, as I said the check is no longer needed. Regarding conplot, the behaviour that you observe for L/0 is completely normal. When you set the L factor to 0, all the information in the input file gets displayed. In a trrosetta file (and also ccmpred file) the information for all the possible residue pairs is included, even if |
The citation for conkit is here: https://doi.org/10.1093/bioinformatics/btx148 |
You were right, they were similar but not identical.
I had a problem like this with javascript and ruby on rails once. Took me a whole day to fix it. It appears trivial but is actually a lot of trouble for a minor user experience benefit. |
General Information
Example
A minimal example to reproduce the error:
The .jones, .rr and .mat file were generated from the conkit-predict script, which worked fine. I used them to exclude the possibility that my original files were the problem; my original goal was use a reference structure to determine which contacts were true positives, but I was getting a traceback. So I tried just plotting without the reference structure and I got the same traceback.
Traceback
The Python traceback
The text was updated successfully, but these errors were encountered: