Pairk - tutorial

introduction

Pairk is a Python package for quantifying the conservation of motifs within intrinsically disordered regions (IDRs). It will run on any input protein sequences that you provide but it is designed for disordered regions, where multiple sequence alignments are particularly difficult to interpret and unreliable

the pairk method can be split into 2 steps, where each step has configurable options:

Step 1: pairwise k-mer alignment - aligns all of the k-mers in the query sequence to all of the k-mers in the homolog sequences. For each query k-mer, the best matching k-mer from each homolog is selected and recorded. The results are returned in a PairkAln object which provides methods for accessing the results and writing/reading the results to/from files. - The way in which the best matching query k-mer - homolog k-mer match is selected is adjustable. The currently implemented methods use either scoring matrices to score the match or use residue embedding distances from the ESM2 protein language model to score the match. Both methods are described in detail below

Step 2: k-mer conservation - calculates conservation scores for each query k-mer using the results of Step 1. The conservation score results are stored in a PairkConservation object which also provides methods for plotting the results as well as writing/reading the results to/from files. - The function used to score the conservation is adjustable.

make sure that pairk is installed before running this notebook (see the README.md file for installation instructions)

import

Most of the functionality of pairk is accessed through the main pairk module, which can be imported directly.

import pairk

load example data

pairk comes with some example data that we can use to test the methods. The example data is stored in the pairk.example1 object.

ex1 = pairk.example1

This example is imported as a python object and holds data that is compatible with the main pairk methods

for example, we can access the IDR sequences in a dictionary with ex1.idr_dict or the query id with ex1.query_id

for id, seq in ex1.idr_dict.items():
    print(id, seq)

9606_0:00294e TNLGTVNAAAPAQPSTGPKTGTTQPNGQIPQATHSVSAVLQEAQRHAETSKDKKPALGNHHDPAVPRAPHAPKSSLPPPPPVRRSSDTSGSPATPLKAKGTGGGGLPAPPDDFLPPPPPPPPLDDPELPPPPPDFMEPPPDFVPPPPPSYAGIAGSELPPPPPPPPAPAPAPVPDSARPPPAVAKRPPVPPKRQENPGHPGGAGGGEQDFMSDLMKALQKKRGNVS
9793_0:005123 TNVGTGNAATPAPPSTGLKTGTAQANGQIPQAAHSVSTVLNEADRQVDTPKDKKPALSNHDPGTPRAQHLPKSSLPPPPPVRRSSDTSSSPVMPAKGAAGGLPPLLDDSLPPPPPPPPLEDDELPPPPPDFDDAPPNFVPPPPPWDAGASLPPPPPPPPPALALAPEATKPSPVVAKRPPVPPKRQENPAPASGGGGGEQDFMSDLMKALQKKRGNVA
1706337_0:000fc7 TNLGTVNAAPPAPSSTGVKTGTTQANGQIPQAAHSMSTVLGEAQRQVETTKDKKSGLGSHDPGAPRAQTLPKSSLPPPPPVRRSSEVGCGSPGTSPKVKGAAAGFPAPPHDLLPPPPPPPPLEDDELPPPPPDFSDAPPDFVPPPPPPSFAGDAGSSLPPPPPPPALAPEAAKPTPVVVKRPPAPPKRQANPGPPGGGGGEQDFMSDLMKALQKKRSNMP
51337_0:001b5a TNLGTVNTATPAQPSTGFKTGSSQPNGQIPQTIPSVSAGLQEAQRHETIKDKKPSLSSTEPGAPRDPPGARSSLPPPPPPVRRSSDTCARAASPFPAPPDDLPPPPPPPPLEDPAMLPPPPALPEPPPDCVPPPPPPPGPGPQPARPSPGAGRRPPVPPKRQENPGLPSAGAGGEQDFMSDLMKALQKRGHMP
9568_0:004ae1 TNLGTVNAAAPAQPSTGPINGTAQPNGQMPQAAHSVSAVLQEAQRHAETSKVKPARPINGTAQPNGQMPQAAHSVSAVLQEAQRHAETSKRPSPAVAKRPPMPPKRHENPGTPSGAGGGEQDFMSDLMKALQKKRGNVS
43346_0:004190 TNLGTVNAVPPAPPSTGVKTGTTQANGQLPQATQSMNTALGEDWRQLETTKDKKPGPGLGSHDPGAPRAQPLPKSSLPPPPPPVRRSSDVGGAPPPSFAEDLPPPPPPPPALAPESVRTPPVVVKRPPPPPKRQENPGPPGGGGGEQDFMSDLMKALQKKRGNVS
885580_0:00488c TNLGTVNTAAAAQPPAGLRTGTSQPNGQLPRDAPCLSDALRESQRQADTSKVGPEGQGGHELPDRPKAAFQASTALAASPYSPKH
10181_0:00305d TNLGTVSTATAAQPPTGLRTSTTQHNGQLPQAAPHLSAVLQEAQRQAEASKVGPEGNRPKLQQISAPSPPSKPEAVFWHLLLFPASENPPYCNFT
1415580_0:000900 PNLSKVDPTVTARSSSSGAVQANGQIHQNVIPVISTNPEAFKRAEDKKPNVGRKPDQAQLQPVPSSNHQQSPKVALHASKIPPPAPARISSQAYSSALTLPSNVKNVNANVLLPPPPSPSPPPPDAFPLPPPCNNDLPPPPDDFYDPPPDFLPPPPPCFATGDRAQLPPGPPLPPPPPSSNQPKPFMKKPVPLPPKRQDITSLHSEQPSLAGPTPVGGGGGQPDFMSDLMKALQKKRGSTS
61221_0:00105a AKSGNTESIVAVGTSAKGSTHANGQVPQSITLSKTDSSETGKSAEMPKVKKPDNTADSIQAPTPNTPQLKHQRKAGGSHSAPPMPPQRVSSAVTAPLQLPTNAEGKGKVCPSDAAEFPPPPESMLPPPELEDLPLPPPPPPEYFESPPDFIPPPPPSCAVAVSAGAPPLPPPPPSASLPRMPLSIKKKPPPPPRRQEESAGQAGLPKPSAPPPKTETAGQGDFMSDLMKALEKKRGATS
7897_0:0033c5 ANFAKVQSQNTTASSTGSVQANGHAGQVTPVSVSFSEAWKRGDVGKEKQSNDGQDNLPPPPPPPPPPMQGFMNEAFPPPPSLPPIASGSLPPPLRASASAPAPPPISNNFPPPLDELSPPPDDFDFPEPPPDFLPPPPTVSASGVPPPPPPPPPPPAPTAASQPTPLPKKSVPPRRQENTTLSQPRGGGGGGQPDFMSDLTKALQKKRGNAS
8407_0:002bff SSLSSSTTPTGAPQGNGQISQNVANVSSSFSDAWKRGETAKDKQQPTEVRKPEQKISLPPSTKQPPPAPVRRPSNAHVVGTPPLPIKAKPVTSNMPPPPPPAEASQWGDDFLPPPPPPELLDTPPNFLPPPPPSFNSESDYPAPPQFTNVGSAGGPPPPPPPPPPPPAALSPKSAPPQLPVKKLPPKPPMRRDSTGQRPNQQNSLMTNGGGAGGQPDFMSDLMSALQKKRSTTT
173247_0:004550 MNRSTPSSSNPSTPSPTIKAKTPNQANGHAPKPQPTAPDSMDFGNFPPPPSADILPPPPPDPAFPPPPPSLPAKSSSRPVAPQHKLPANFPPPPMAMDNLPPPPLPPPIDDSPEAPPDFLPPPPPAAGFGSLPPPPLSMNSLPPPPHFGGMDQSLPPPPPDPEFLPPPPPEPVFTGAGAPPPPPPPPPPPPAQAAAVPRAPVRPSGSVRKVPPAPPKRTTPSLQVGGGGGGGDFMSELMVAMQKKRGDH
30732_0:0046dd TNRSTPSSSNPSTPSPTVKAKAQSQANGHAPKPQPGPVSQDFGHLPPPPPPCPNDDLPPPPPDPVFPPPPPPLAAKRSPKTAGRSQHPQGNFPPPPPEMDHLPPPPPMEESPPDFLPPPPPMNSLPHPPPPPASFGGVDHSLPPPPPDPEFLPPPPPDPQVTGGGGPPPPPPPPPPPPPASAPAPRGALRPTGSAKKMPPAPPKRTTPVMGGGGGGGGGGGGDFMSELMKAMQKKRSDQ
241271_0:0048e4 TNRTPVSSNQSTPSPTIKAKSPNQANGHAAKPQPGPESQDFGNIPPPPPPPPPPMTGFLPPPPPDPVLPPPPPLLAAKSPKPSPPQRNLPTNFPPPAMDNLPPPPPPPMDDSFEDPPDFLPPPPPAAGFGSLPPPPPPVNSFPPPPPSAGFGGMGQSLPPPPPDPGFLPPPPPQPMFTGGGTIPPPPPPPPPPTAAPRAPVRPTGSVKKAPPAPPKRTTPSLHGGGGGGGGGGGDFMSELMMAMNKKRGTT
8103_0:0045e4 TNRSTPSSSNPSTPSPPIKAKSPGQANGHALKPEPGPVAQDFGHVPPPPPADILPPPADILPPPPPQTFLPPPPPPLAAKSSSKPSLPQRHLPTNFPPPPPAMINLPRPPQPPPTDDASEAPPDFLPPPPPAAGFSPLFPPPPPLNALPLPPPPVSFRVEDRSLPPPPPDPGFLPPPPPMFTGAGAPPPPPPPPPPPRVAVRPAGSVKKRPPAVPKRTTPSLRGGGGGDFMSELALAMNKKRSAH
56723_0:00152f TNRSTPSSSSSSTPSPTIKAKSPSQANGHAPKPQPGPVSQDFGNVPPPPPPMANILPPPRPDAFLPPAPPPLARKNSAKPPPPQRHLPTNFPPPPPAMDNLPPPPPPPPMDDALEAPPDFLPPPPPAAGFGSLPPPPPPSNSFPPPPPPGSFGSMGQSLPPPPPDPGFLPPPPPQPVFTGAGAPPPPPPPPPPPTAAAAPRAPVRPSGSVKKIPPATPKRTTPSLQGGGGGGGGGGGGGGDFMSELMLAMNKKRST
210632_0:004c0c ANRSTSSSSNSSTPSPTIKAKSSSQANGIFPKPGPAPQDFGDLPPPPPLAANILPPPPPEPGLPPPPPPPPPQAAKGSAKPAPPKRQMPANFPPPPTAMDNLPPPPPPPPIDNSEAPPDFLPPPPPASGFGSFPPPPPLNSLPPPPRPGGFGGMDQSLPPPPPDPEFLPPPPPPPQAVFTGGGAPPPPPPPPPPPAAAAPSTAIPRVGLRPAGSLKKLPPAPPKRTTPSMQGSGGGGGGDFMSELMLAMQKKRGDHP
31033_0:00264e TNCSKPSSDGPPTPPTTIKASPANGHVPKPPPGAAPQDVFPPPPPPMDILPPPPPDPAFPPPPPPLMAKRSPKPSAGHRQAPGDHLPPPPLAPPHDDASEDPPDFLPPPPPSFDSLPPPPPGMSAFPPPPPLLGFSETSQPLPPPPPDPELLLPPPPASMISTGAGAPPPPPPPPPAAAASPRPAPTASGSVRKRPPAPPKRTTPALHGSGGGAGEGAGGGDFMSELMKAMNKKRADHS
63155_0:004c86 TNRSTPSSSNQSTPSPTVKAKSPNQANGHPPKPQPGPISQAPFPPPPLAEVLPPPPPDPVLPPPPPMPAKSSAKPSPPKRQQQSNFPPPPPELDNLPPPPPPPPTDDTAEAPPDFLPPPPPAVGFGSLPPPPPSFGGVGQSLPPPPPDPQSLPPPPPDPVFIGAGAPPPPPPPPPPPAPGAPVTTLRPAVRPSGSLKKVPPAPPQRNTPSVSGGGGGGGGDFMSELMLAMQKKRGAQ
7994_0:004d71 TSMNRQPSPSTSNTSTPSPTPKAKTANGHASQPRSETVPKAPSNQSAFPPPPPPADFLPPPPPDPTLPPPPPPPPALPVKKESNPPRSAPQRSQPAFPPPPPAMDFSLPPPPPPSDDLEMPPDFLPPPPPAPGGFMGGDLLPPPPPEPFHAPLPPPPAAFHPPPAVHPPPQATGGDLPPPPPPPPPPPPAPAAFHQTPSVRKVGPPPPKRTTPSLAAPSGGDFMSELMLAMNKKRGGQ
109280_0:00369f ASCSVTSGQKQSQSQVANGHANNTSPSPLPPPPPPLGEDLPPPPPPPPQLGKTLPPAPPPLGATLPTPPPPPGGTPPPPPPPRRNPPPYPRHLPHISELYPLRRLLLLYQLPTVHSLVLLFQPNLPPNPSPNHDVQRPISRCLPGPQITFPPPPPPPVDDSPPDFLPPPPPAANFGSHPPPPPPVKTLPPPPPHMKTLPPARLSFKSTNLPPPPPDPGFLPPPLTGVPPPPPPPPPPPPTTAAAGPRRAPVRPSGSLKKMPPPPPKRSTPSLHGRRDGDRGDGDGGGGGDFMSELMRAMQKKRDPH
150288_0:004e5a HPPDRCSTSSDNLNSQIGQSAPTPDECIEQDEPPPDFIPPPPPGYMAIL

print(ex1.query_id)

9606_0:00294e

Step 1: pairwise k-mer alignment

there are 2 main ways to run the k-mer alignment step:

Scoring matrix alignment
- These methods use a scoring matrix to score the query k-mer to homolog k-mer matches and select the best scoring match from each homolog.
Embedding distance alignment
- This method uses the Euclidean distance between the query k-mer residue embeddings from a protein large language model (such as ESM2) and homolog k-mer residue embeddings and selects the lowest distance match from each homolog.

We will only go through one of the scoring matrix alignment methods in this tutorial. They are all described in detail in PairK alignment - in depth

k-mer alignment - `pairk.pairk_alignment()`

an exhaustive comparison of all k-mers in the query sequence with all k-mers in the homologs. Uses a scoring matrix to score the query k-mer to homolog k-mer matches and select the best scoring match from each homolog.

The pairk.pairk_alignment() function takes the following arguments:

idr_dict: a dictionary of IDR sequences, where the keys are the sequence ids and the values are the sequences. Includes the query sequence (the sequence to split into k-mers and align with the homologs).
query_id: a query sequence id (the sequence to split into k-mers and align with the homologs). This id should be present in idr_dict.
k: the length of the k-mers

These inputs are provided by the user. There are helper functions in the pairk library that might help with this (see functions in pairk.utilities).

aln_results = pairk.pairk_alignment(
    idr_dict=ex1.idr_dict,
    query_id=ex1.query_id,
    k=5,
)

To specify the scoring matrix used, you can pass the name of the matrix to the matrix_name argument.

To see the available matrices, use the pairk.print_available_matrices() function.

pairk.print_available_matrices()

biopython-builtin matrices (aligner compatible):
BENNER22
BENNER6
BENNER74
BLASTN
BLASTP
BLOSUM45
BLOSUM50
BLOSUM62
BLOSUM80
BLOSUM90
DAYHOFF
FENG
GENETIC
GONNET1992
HOXD70
JOHNSON
JONES
LEVIN
MCLACHLAN
MDM78
MEGABLAST
NUC.4.4
PAM250
PAM30
PAM70
RAO
RISLER
SCHNEIDER
STR
TRANS

other matrices:
grantham_similarity_normx100_aligner_compatible
BLOSUM62
EDSSMat50
grantham
grantham_similarity_norm

k-mer alignment results - `pairk.PairkAln`

The results of all of the pairwise k-mer alignment methods are returned as a pairk.PairkAln object.

The actual “alignments” are stored as matrices in the pairk.PairkAln object. The main matrices are:

orthokmer_matrix - the best matching k-mers from each homolog for each query k-mer
position_matrix - the positions of the best matching k-mers in the homologs
score_matrix - the scores of the best matching k-mers

Each matrix is a pandas DataFrame where the index is the start position of the k-mer in the query sequence. The columns are the query k-mers + the homolog sequence ids.

The pairk.PairkAln object has some useful methods for accessing the data. For example, you can get the best matching k-mers for a query k-mer by its position in the query sequence using the pairk.get_pseudo_alignment() method (or by directly accessing the dataframes). You can also plot the matrices as heatmaps, save the results to a json file, and load the results from that file

example: accessing the DataFrames from the pairk.PairkAln object directly

aln_results.score_matrix

	query_kmer	9793_0:005123	1706337_0:000fc7	51337_0:001b5a	9568_0:004ae1	43346_0:004190	885580_0:00488c	10181_0:00305d	1415580_0:000900	61221_0:00105a	...	30732_0:0046dd	241271_0:0048e4	8103_0:0045e4	56723_0:00152f	210632_0:004c0c	31033_0:00264e	63155_0:004c86	7994_0:004d71	109280_0:00369f	150288_0:004e5a
0	TNLGT	22.0	28.0	28.0	28.0	28.0	28.0	28.0	10.0	9.0	...	13.0	12.0	13.0	13.0	9.0	8.0	13.0	8.0	14.0	9.0
1	NLGTV	16.0	29.0	29.0	29.0	29.0	29.0	29.0	16.0	8.0	...	11.0	13.0	11.0	13.0	7.0	7.0	11.0	6.0	14.0	10.0
2	LGTVN	16.0	29.0	29.0	29.0	29.0	29.0	23.0	10.0	7.0	...	11.0	8.0	9.0	10.0	7.0	8.0	10.0	8.0	16.0	4.0
3	GTVNA	19.0	26.0	23.0	26.0	26.0	23.0	17.0	15.0	10.0	...	11.0	9.0	10.0	8.0	8.0	9.0	11.0	9.0	9.0	6.0
4	TVNAA	18.0	25.0	22.0	25.0	22.0	22.0	16.0	14.0	11.0	...	11.0	9.0	8.0	12.0	8.0	11.0	11.0	10.0	11.0	5.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
217	LQKKR	31.0	31.0	19.0	31.0	31.0	3.0	9.0	31.0	24.0	...	26.0	19.0	19.0	19.0	26.0	19.0	26.0	19.0	26.0	0.0
218	QKKRG	29.0	23.0	16.0	29.0	29.0	5.0	3.0	29.0	22.0	...	23.0	22.0	16.0	16.0	29.0	17.0	29.0	22.0	23.0	-1.0
219	KKRGN	29.0	23.0	18.0	29.0	29.0	2.0	2.0	23.0	21.0	...	17.0	23.0	15.0	17.0	23.0	18.0	21.0	22.0	14.0	-1.0
220	KRGNV	29.0	19.0	20.0	29.0	29.0	8.0	8.0	17.0	15.0	...	9.0	17.0	8.0	11.0	14.0	9.0	13.0	14.0	11.0	2.0
221	RGNVS	23.0	12.0	13.0	27.0	27.0	9.0	13.0	15.0	13.0	...	10.0	11.0	8.0	13.0	7.0	9.0	7.0	9.0	8.0	9.0

222 rows × 23 columns

example: access the best matching k-mers for the query k-mer at position 4:

print(aln_results.orthokmer_matrix.loc[4])

query_kmer          TVNAA
9793_0:005123       TGNAA
1706337_0:000fc7    TVNAA
51337_0:001b5a      TVNTA
9568_0:004ae1       TVNAA
43346_0:004190      TVNAV
885580_0:00488c     TVNTA
10181_0:00305d      TVSTA
1415580_0:000900    NVNAN
61221_0:00105a      AVSAG
7897_0:0033c5       TVSAS
8407_0:002bff       SQNVA
173247_0:004550     TPNQA
30732_0:0046dd      TVKAK
241271_0:0048e4     PVNSF
8103_0:0045e4       PLNAL
56723_0:00152f      TAAAA
210632_0:004c0c     TIKAK
31033_0:00264e      TIKAS
63155_0:004c86      TVKAK
7994_0:004d71       TSNTS
109280_0:00369f     TTAAA
150288_0:004e5a     NLNSQ
Name: 4, dtype: object

example: access the best matching k-mers for the query k-mer at position 4 using the get_pseudo_alignment method. (the returned list includes the query k-mer sequence)

aln_results.get_pseudo_alignment(4)

['TVNAA',
 'TGNAA',
 'TVNAA',
 'TVNTA',
 'TVNAA',
 'TVNAV',
 'TVNTA',
 'TVSTA',
 'NVNAN',
 'AVSAG',
 'TVSAS',
 'SQNVA',
 'TPNQA',
 'TVKAK',
 'PVNSF',
 'PLNAL',
 'TAAAA',
 'TIKAK',
 'TIKAS',
 'TVKAK',
 'TSNTS',
 'TTAAA',
 'NLNSQ']

you can search for a specific kmer to get its positions. You can then use the positions to query the matrices.

aln_results.find_query_kmer_positions('LPPPP')

[75, 113, 127, 157]

aln_results.get_pseudo_alignment(75)

['LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'PPMPP',
 'LPPPP',
 'LPDRP',
 'APSPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'IPPPP']

aln_results.orthokmer_matrix.loc[[75, 113, 127, 157]].T

	75	113	127	157
query_kmer	LPPPP	LPPPP	LPPPP	LPPPP
9793_0:005123	LPPPP	LPPPP	LPPPP	LPPPP
1706337_0:000fc7	LPPPP	LPPPP	LPPPP	LPPPP
51337_0:001b5a	LPPPP	LPPPP	LPPPP	LPPPP
9568_0:004ae1	PPMPP	PPMPP	PPMPP	PPMPP
43346_0:004190	LPPPP	LPPPP	LPPPP	LPPPP
885580_0:00488c	LPDRP	LPDRP	LPDRP	LPDRP
10181_0:00305d	APSPP	APSPP	APSPP	APSPP
1415580_0:000900	LPPPP	LPPPP	LPPPP	LPPPP
61221_0:00105a	LPPPP	LPPPP	LPPPP	LPPPP
7897_0:0033c5	LPPPP	LPPPP	LPPPP	LPPPP
8407_0:002bff	LPPPP	LPPPP	LPPPP	LPPPP
173247_0:004550	LPPPP	LPPPP	LPPPP	LPPPP
30732_0:0046dd	LPPPP	LPPPP	LPPPP	LPPPP
241271_0:0048e4	LPPPP	LPPPP	LPPPP	LPPPP
8103_0:0045e4	LPPPP	LPPPP	LPPPP	LPPPP
56723_0:00152f	LPPPP	LPPPP	LPPPP	LPPPP
210632_0:004c0c	LPPPP	LPPPP	LPPPP	LPPPP
31033_0:00264e	LPPPP	LPPPP	LPPPP	LPPPP
63155_0:004c86	LPPPP	LPPPP	LPPPP	LPPPP
7994_0:004d71	LPPPP	LPPPP	LPPPP	LPPPP
109280_0:00369f	LPPPP	LPPPP	LPPPP	LPPPP
150288_0:004e5a	IPPPP	IPPPP	IPPPP	IPPPP

Note

the k-mers are defined by position rather than sequence. You could easily make a variant of this method that uses the unique sequences instead. It would make the method slightly faster. The reason that I didn’t do this is because I wanted to mimic the LLM embedding version of Pairk, where identical k-mers have different embeddings and thus are treated as different k-mers.Inclusion of duplicate k-mers does alter the final z-scores, so it’s something to be aware of.

example: plot a heatmap of the matrices

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
aln_results.plot_position_heatmap(ax)
ax.xaxis.set_visible(False)

../_images/pairk_tutorial_simplified_62_0.png

example: save the results to a file using write_to_file and load them back into python using from_file:

aln_results.write_to_file('./aln_results.json')
aln_results = pairk.PairkAln.from_file('./aln_results.json')
print(aln_results)

PairkAln object for 222 query k-mers
query sequence: TNLGTVNAAAPAQPSTGPKTGTTQPNGQIPQATHSVSAVLQEAQRHAETSKDKKPALGNHHDPAVPRAPHAPKSSLPPPPPVRRSSDTSGSPATPLKAKGTGGGGLPAPPDDFLPPPPPPPPLDDPELPPPPPDFMEPPPDFVPPPPPSYAGIAGSELPPPPPPPPAPAPAPVPDSARPPPAVAKRPPVPPKRQENPGHPGGAGGGEQDFMSDLMKALQKKRGNVS
k-mer length: 5

Step 2: k-mer conservation

In this step, the query k-mer and the best matching homolog k-mers are treated as a gapless multiple sequence alignment with ‘k’ columns, which we call a “pseudo-MSA”. Column-wise conservation scores are calculated for each position in each pseudo-MSA. All of the conservation scores are then converted to z-scores to give the relative conservation of each k-mer position compared to the rest of the query IDR. The conservation score results are stored in a PairkConservation object which also provides methods for plotting the results and reading/writing the results from/to files.

calculate k-mer conservation - `pairk.calculate_conservation()`

the main method for Step 2 is the pairk.calculate_conservation() method. It simply takes the pairk.PairkAln object as input, along with a columnwise conservation scoring function and returns a pairk.PairkConservation object.

The columnwise conservation scoring function can be any function that takes a string of residues (a column of an alignment) as an input and returns a float (conservation score). You can use custom functions here, but pairk comes with a few built-in functions from Capra and Singh 2007 (DOI: 10.1093/bioinformatics/btm270) available in the pairk.capra_singh_functions module. The pairk.capra_singh_functions.property_entropy() is the default function used by pairk.calculate_conservation().

Example

aln_results = pairk.pairk_alignment(
    idr_dict=ex1.idr_dict,
    query_id=ex1.query_id,
    k=5,
    matrix_name="EDSSMat50"
)
conservation_results = pairk.calculate_conservation(
    aln_results,
)

example usage: using a different conservation scoring function:

from pairk import capra_singh_functions
column = 'NNNNNNNNNKNSNNNNNNNNSSN'
print(capra_singh_functions.shannon_entropy(column))

0.8161170118989496

aln_results = pairk.pairk_alignment(
    idr_dict=ex1.idr_dict,
    query_id=ex1.query_id,
    k=5,
)
conservation_results = pairk.calculate_conservation(
    pairk_aln_results=aln_results,
    score_func=capra_singh_functions.shannon_entropy
)

k-mer conservation results - `pairk.PairkConservation`

The pairk.calculate_conservation() method returns a pairk.PairkConservation object.

The returned pairk.PairkConservation object has matrices with similar structure as pairk.PairkAln object matrices, except that they are numpy arrays instead of pandas dataframes.

orthokmer_arr - the best matching k-mers from each homolog for each query k-mer - analogous to the orthokmer_matrix in the pairk.PairkAln object
score_arr - the conservation scores for each position in the pseudo-MSA of each query k-mer
z_score_arr - the conservation score z-scores for each position in the pseudo-MSA of each query k-mer

If n is the number of k-mers in the query sequence and m is the number of homologs (including the query sequence), the matrices will have the dimensions:

orthokmer_arr: (n, m)
score_arr: (n, k)
z_score_arr: (n, k)

accessing the results

The row index of the arrays correspond to the starting position of the query k-mer in the query IDR.

For example, to access the conservation scores for the k-mer at position 4 in the query IDR, you would access the 4th row of the arrays: .score_arr[4, :].

k_mer_position = 4
print(f"query k-mer at position {k_mer_position}: {conservation_results.orthokmer_arr[k_mer_position, 0]}")
print(f"pseudo-MSA for the query k-mer at position {k_mer_position} (including the query k-mer): {conservation_results.orthokmer_arr[k_mer_position, :]}")
print(f"scores for each position of the k-mer at position {k_mer_position}:")
print(conservation_results.score_arr[k_mer_position, :])
print(f"z scores for each position of the k-mer at position {k_mer_position}:")
print(conservation_results.z_score_arr[k_mer_position, :])

query k-mer at position 4: TVNAA
pseudo-MSA for the query k-mer at position 4 (including the query k-mer): ['TVNAA' 'TGNAA' 'TVNAA' 'TVNTA' 'TVNAA' 'TVNAV' 'TVNTA' 'TVSTA' 'NVNAN'
 'AVSAG' 'TVSAS' 'SQNVA' 'TPNQA' 'TVKAK' 'PVNSF' 'PLNAL' 'TAAAA' 'TIKAK'
 'TIKAS' 'TVKAK' 'TSNTS' 'TTAAA' 'NLNSQ']
scores for each position of the k-mer at position 4:
[0.69754401 0.48589838 0.6438038  0.64920359 0.44093655]
z scores for each position of the k-mer at position 4:
[ 0.17610372 -1.01357184 -0.12597394 -0.09562132 -1.26630561]

plotting the results

There are several plotting functions available from the pairk.PairkConservation object shown below.

example usage: plotting background score distributions

ax = conservation_results.plot_background_distribution()

../_images/pairk_tutorial_simplified_87_0.png

example usage: plotting conservation scores

fig, ax = plt.subplots(figsize=(7,1.5))
conservation_results.plot_score_barplot(k_mer_position, score_type='score', ax=ax)
ax.set_title('conservation scores')
fig, ax = plt.subplots(figsize=(7,1.5))
conservation_results.plot_score_barplot(k_mer_position, score_type='z_score', ax=ax)
_ = ax.set_title('conservation z-scores')

../_images/pairk_tutorial_simplified_89_1.png

../_images/pairk_tutorial_simplified_89_2.png

example usage: display a pseudo-MSA as a sequence logo

fig, ax = plt.subplots(figsize=(7,1.5))
_ = conservation_results.plot_sequence_logo(k_mer_position, ax=ax)

../_images/pairk_tutorial_simplified_91_1.png

fig, axd = conservation_results.plot_conservation_mosaic(
    position=0,
    score_type='z_score',
    figsize=(8, 3)
)

../_images/pairk_tutorial_simplified_92_0.png

here is an annotated version of the mosaic plot:

getting the average conservation score for a query k-mer

You can use the pairk.PairkConservation.get_average_score() function to get the average conservation score for a k-mer position.

example: get the average conservation score for the query k-mer at position 4

conservation_results.get_average_score(4, score_type='z_score')

-0.4650737963634926

The pairk.PairkConservation.get_average_score() function takes a position_mask as an optional argument that will only consider the conservation scores for the positions in the mask when calculating the average score. This is useful if you want to exclude certain positions from the average score calculation.

example: get the average conservation score for the query k-mer at position 0, but only consider the conservation scores for positions 1, and 3 within the k-mer

position_mask = [0, 1, 0, 1, 0]
conservation_results.get_average_score(0, score_type='z_score', position_mask=position_mask)

0.197148283025409

You could also do a weighted average from manually extracted conservation scores.

example: get the weighted average conservation score for the query k-mer at position 0 (using some arbitrary weights)

import numpy as np
np.average(conservation_results.z_score_arr[0, :], weights=[0.1, 1, 0.5, 1, 10])

0.31155698039226354

Writing and Reading Results from Files

You can save the results to a file with pairk.PairkConservation.write_results_to_file() and load them back in with pairk.PairkConservation.read_results_from_file().

example usage: save the results to a file and load them back in

_=conservation_results.write_results_to_file('./conservation_results.npz')

conservation_results= pairk.PairkConservation.read_results_from_file('./conservation_results.npz')

advanced customization

Step 1 and Step 2 can be modified, but this requires a bit of knowledge of the inner workings of the pairk package and the source code would have to be modified directly and probably installed in editable mode (see Installation_). If you want to modify the package, here’s roughly how I organized the code in the pairk directory (the source code is available on github. Below file paths are relative to the github repo root directory:

pairk/backend/ - the main code for the package is located here. The main pairwise k-mer alignment and k-mer conservation functions are defined in files within this directory.
pairk/__init__.py - User-facing functions are imported into the main pairk/__init__.py file so that they are accessible when the package is imported. I think it also simplifies the import statements for users. Use this init file to find where pairk’s main functions are defined within the directory structure if you want to modify any of the functions above. You could also modify the init file to make any new functions you create easy to access.
pairk/data/ - data installed along with the package is stored here. This includes the scoring matrices and example data. The scoring matrices are stored in the pairk/data/matrices/ folder.

The easiest customization to make would be to add a new scoring matrix. To do this, you would add a new matrix file to the pairk/data/matrices/ folder. The tools should be able to automatically find the matrix file and add it to the available scoring matrices. It will be named after the name of the file. Use pairk.print_available_matrices() to confirm (make sure you’ve installed pairk as an editable install for changes to take affect). You could then use the new matrix in relevant methods by passing the matrix name as an argument. If this doesn’t work, you may need to modify the code that reads the matrices in the pairk/backend/tools/matrices.py file.

!tree ../pairk/ -I __pycache__

../pairk/
├── __init__.py
├── _version.py
├── backend
│   ├── __init__.py
│   ├── conservation
│   │   ├── __init__.py
│   │   ├── capra_singh_functions
│   │   │   ├── __init__.py
│   │   │   └── capra_singh_2007_scores.py
│   │   └── kmer_conservation.py
│   ├── exceptions.py
│   ├── kmer_alignment
│   │   ├── __init__.py
│   │   ├── esm_embedding_distance.py
│   │   ├── needleman_tools.py
│   │   ├── scoring_matrix.py
│   │   └── scoring_matrix_needleman.py
│   └── tools
│       ├── __init__.py
│       ├── esm_tools.py
│       ├── matrices.py
│       ├── pairwise_tools.py
│       ├── plotting_tools.py
│       ├── pssms.py
│       └── sequence_utils.py
├── data
│   ├── README.md
│   ├── __init__.py
│   ├── example_alignment_9606_0_00294e-idraln-555_to_971-idr-440_to_665.fasta
│   ├── matrices
│   │   ├── BLOSUM62
│   │   ├── EDSSMat50
│   │   ├── __init__.py
│   │   ├── grantham.csv
│   │   ├── grantham_similarity_norm.csv
│   │   └── grantham_similarity_normx100_aligner_compatible
│   └── pairk_plotstyle.mplstyle
├── examples.py
├── py.typed
├── single_kmer
│   └── __init__.py
├── tests
│   ├── __init__.py
│   └── test_pairk.py
└── utilities.py

10 directories, 36 files