PairK alignment - in depth

there are 2 main ways to run the k-mer alignment step:

  • Scoring matrix alignment

    • These methods use a scoring matrix to score the query k-mer to homolog k-mer matches and select the best scoring match from each homolog.

  • Embedding distance alignment

    • This method uses the Euclidean distance between the query k-mer residue embeddings from a protein large language model (such as ESM2) and homolog k-mer residue embeddings and selects the lowest distance match from each homolog.

The results from these methods are stored in a pairk.PairkAln. Each of these methods is described in more detail below.

Scoring matrix alignment

There are 2 implementations of the scoring matrix method:

  • pairk.pairk_alignment - the original implementation. This is a bit slow because it does an exhaustive comparison of all k-mers in the query sequence with all k-mers in the homologs. This method is mainly included because it is the most flexible and would be the easiest to modify if you wanted to develop a new scoring scheme to compare k-mers.

  • pairk.pairk_alignment_needleman - a faster implementation that uses the Needleman-Wunsch algorithm (as implemented in Biopython) to align the k-mers. This is faster and should yield the same results.

inputs common to both functions are:

  • idr_dict: a dictionary of IDR sequences, where the keys are the sequence ids and the values are the sequences. Includes the query sequence (the sequence to split into k-mers and align with the homologs).

  • query_id: a query sequence id (the sequence to split into k-mers and align with the homologs). This id should be present in idr_dict.

  • k: the length of the k-mers

These inputs can be generated many ways, and there are helper functions in the pairk library to help with this (see functions in pairk.utilities).

Scoring matrix - pairk.pairk_alignment()

aln_results = pairk.pairk_alignment(
    idr_dict=ex1.idr_dict,
    query_id=ex1.query_id,
    k=5,
)

To specify the scoring matrix used, you can pass the name of the matrix to the matrix_name argument.

To see the available matrices, use the pairk.print_available_matrices() function.

pairk.print_available_matrices()
biopython-builtin matrices (aligner compatible):
BENNER22
BENNER6
BENNER74
BLASTN
BLASTP
BLOSUM45
BLOSUM50
BLOSUM62
BLOSUM80
BLOSUM90
DAYHOFF
FENG
GENETIC
GONNET1992
HOXD70
JOHNSON
JONES
LEVIN
MCLACHLAN
MDM78
MEGABLAST
NUC.4.4
PAM250
PAM30
PAM70
RAO
RISLER
SCHNEIDER
STR
TRANS

other matrices:
grantham_similarity_normx100_aligner_compatible
BLOSUM62
EDSSMat50
grantham
grantham_similarity_norm

Scoring matrix - pairk.pairk_alignment_needleman()

This method returns the same results as pairk.pairk_alignment(), but it is faster.

The difference is that the pairk.pairk_alignment_needleman() method uses the Needleman-Wunsch algorithm to align the k-mers, while the pairk.pairk_alignment() method uses a scoring matrix to exhaustively score the k-mer matches. pairk.pairk_alignment_needleman() ensures that the alignment is gapless by using an extremely high gap opening and extension penalty (-1000000). This will ensure that the alignment is gapless, unless you use a really unusual scoring matrix with very high scores.

This method takes similar arguments as pairk.pairk_alignment(), accept that the pairk.pairk_alignment_needleman() method takes an optional aligner argument. This allows you to create the aligner before calling the method, which is useful if you want to do multiprocessing, so that you’re not creating a new aligner for each process. I’ve found that if you create a new aligner for each process, the memory usage gets very high.

The aligner object can be created via the pairk.create_aligner() function. This function takes the name of the scoring matrix as an argument and returns the aligner object. If you don’t pass the aligner argument to the pairk.pairk_alignment_needleman() method, it will create a new aligner using the matrix_name argument. This is fine if you’re not doing multiprocessing. If you are doing multiprocessing, I would suggest creating the aligner before calling the method. If the aligner argument is passed, the matrix_name argument is ignored.

aligner = pairk.make_aligner('EDSSMat50')
aln_results_needleman = pairk.pairk_alignment_needleman(
    idr_dict=ex1.idr_dict,
    query_id=ex1.query_id,
    k=5,
    aligner=aligner
)

results are the same as the pairk.pairk_alignment method in this case

(aln_results.position_matrix == aln_results_needleman.position_matrix).all().all()
True

Embedding distance alignment

This method uses the Euclidean distance between the query k-mer residue embeddings and homolog k-mer residue embeddings and selects the lowest distance match from each homolog. For each homolog, it calculates the distance between the query k-mer and each k-mer in the homolog. It then selects the k-mer with the lowest distance as the best match.

embedding distance - pairk.pairk_alignment_embedding_distance()

This method generates residue embeddings using the ESM2 protein large language model unless residue embeddings are provided directly by the user.

Because residue embeddings are used, the inputs are slightly different than the previous methods.

The inputs are:

  • full_length_sequence_dict: a dictionary of full-length sequences, where the keys are the sequence ids and the values are the sequences. This is used to generate the embeddings.

  • idr_position_map: a dictionary where the keys are the full-length sequence ids and the values are the start and end positions of the IDR in the full-length sequence (using python indexing). This is used to slice out the IDR embeddings/sequences from the full-length embeddings/sequences.

  • query_id: a query sequence id (the sequence to split into k-mers and align with the homologs). This id should be present in idr_position_map and full_length_sequence_dict.

  • k - the length of the k-mers

  • mod - a pairk.ESM_Model object. This is the ESM2 model used to generate the embeddings. The code for the ESM2 embeddings is adapted from the kibby conservation tool link DOI: 10.1093/bib/bbac599

  • device - whether to use cuda or your cpu for pytorch, should be either “cpu” or “cuda”. (default is “cuda”). If “cuda” fails, it will default to “cpu”. This argument is passed to the pairk.ESM_Model.encode method

Full length sequences (full_length_sequence_dict) are required to generate the embeddings because each embedding is dependent upon the neighboring residues. The embeddings for just an IDR are different than the embeddings for full length sequences. Thus, the full length embeddings are gathered first, and then the IDR embeddings are sliced out for the k-mer alignment.

The idr_position_map is used to slice out the IDR embeddings, and there must be IDR positions for each sequence in the input sequence set.

Our example has both of those

ex1.full_length_dict
{'9606_0:00294e': 'MGESSEDIDQMFSTLLGEMDLLTQSLGVDTLPPPDPNPPRAEFNYSVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQKESLQNQHHSASLQASIFSGAASLGYGTNVAATGISQYEDDLPPPPADPVLDLPLPPPPPEPLSQEEEEAQAKADKIKLALEKLKEAKVKKLVVKVHMNDNSTKSLMVDERQLARDVLDNLFEKTHCDCNVDWCLYEIYPELQIERFFEDHENVVEVLSDWTRDTENKILFLEKEEKYAVFKNPQNFYLDNRGKKESKETNEKMNAKNKESLLEESFCGTSIIVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGTQHKMKYKAPTDYCFVLKHPQIQKESQYIKYLCCDDTRTLNQWVMGIRIAKYGKTLYDNYQRAVAKAGLASRWTNLGTVNAAAPAQPSTGPKTGTTQPNGQIPQATHSVSAVLQEAQRHAETSKDKKPALGNHHDPAVPRAPHAPKSSLPPPPPVRRSSDTSGSPATPLKAKGTGGGGLPAPPDDFLPPPPPPPPLDDPELPPPPPDFMEPPPDFVPPPPPSYAGIAGSELPPPPPPPPAPAPAPVPDSARPPPAVAKRPPVPPKRQENPGHPGGAGGGEQDFMSDLMKALQKKRGNVS',
 '9793_0:005123': 'MGESNEDIDQMFSHLLGEMDLLTKSLGVDTLPPPDPKPPRAEFNFSVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQRESSQDQLHSASLEKSNFSGAASLGYGADMAIMSTSQYGDELPPPPADPMLDLPLPPPPPEPLSQEEQEAAAKADKIKLALEKLKEAKVKKLVVKVHMNDNSTKSLMVDERQLARDILDNLFEKTHCDCNIDWCLYEIYPELQIERFFEDHENVVEVLSDWTRDTENKVVFLEKEEKYAVFKNPQNFYLDNKGKKESKETNGKMNAKNKESLLEESFCGTSIIVPELEGALYLKEDGKKSWKKRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGTQCKMRYKAPTDYCFVLKHPQIQKESQYIKYLCCDDARTLNQWVTGIRIAKYGKTLYDNYQRAMARAGLASRWTNVGTGNAATPAPPSTGLKTGTAQANGQIPQAAHSVSTVLNEADRQVDTPKDKKPALSNHDPGTPRAQHLPKSSLPPPPPVRRSSDTSSSPVMPAKGAAGGLPPLLDDSLPPPPPPPPLEDDELPPPPPDFDDAPPNFVPPPPPWDAGASLPPPPPPPPPALALAPEATKPSPVVAKRPPVPPKRQENPAPASGGGGGEQDFMSDLMKALQKKRGNVA',
 '1706337_0:000fc7': 'MGDPGACRLDELMRVSTMGESNEDIDQMFSNLLGEMDLLTQSLGVDTVPPPEPKPPRAEFNYSVGFKDLNESLNALEDQDLDALMADLVADIREAEQRTIQAQKESSQSRPHSASLDIPSFSGAASLGYTANVAAPSINQYEDDLPPPPADPMLDLPLPPPTPEPLSQEEEEAQAKADKIKLALEKLKEAKVKKLVVKVHMNDNSTKSLMVDERQLARDVLDNLFEKTHCDCNVDWCLYEIYPELQIERFFEDHENVVEILSDWTRDTENKVLFLEKEEKYAVFKNPQNFYLDNRGKKESKETNEKMNAKNKESLLEESFCGTSIIVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGIQCKMKYKAPTDYCFVLKHPQIQKESQYIKYLCCDDARMLNQWVTGIRIAKYGKTLYDSYQRAMARAGLASRWTNLGTVNAAPPAPSSTGVKTGTTQANGQIPQAAHSMSTVLGEAQRQVETTKDKKSGLGSHDPGAPRAQTLPKSSLPPPPPVRRSSEVGCGSPGTSPKVKGAAAGFPAPPHDLLPPPPPPPPLEDDELPPPPPDFSDAPPDFVPPPPPPSFAGDAGSSLPPPPPPPALAPEAAKPTPVVVKRPPAPPKRQANPGPPGGGGGEQDFMSDLMKALQKKRSNMP',
 '51337_0:001b5a': 'MGESNEDIDQMFSTLLGEMDLLTQSLGVDTLPPPDPKPPRAEFNYSVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQKESFQNQSHFAPPETSAHSSAACHGDAAHAASITISQCEGDLPPPPADPVLDLPLPPPPPEPLSQEEEEALAKADKIKLALEKLKEAKVKKLVVKVHMFDNSTKSLMVDERQLARDVLDNLFEKTHCDCSVDWCLYETYPELQIERFFEDHENVVEVLSDWTRDTENKVLFLKKEEKYAVFKNPQNFYLDNKGKKENKETSEKMNAKNKEYLLEESFCGTSVLVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGIQCKMKYKAPTDYCFVLKHPQIQKESQYIRHLCCDDAHTLHQWVMGIRIAKYGKTLYDNYQRAVARAGLASRWTNLGTVNTATPAQPSTGFKTGSSQPNGQIPQTIPSVSAGLQEAQRHETIKDKKPSLSSTEPGAPRDPPGARSSLPPPPPPVRRSSDTCARAASPFPAPPDDLPPPPPPPPLEDPAMLPPPPALPEPPPDCVPPPPPPPGPGPQPARPSPGAGRRPPVPPKRQENPGLPSAGAGGEQDFMSDLMKALQKRGHMP',
 '9568_0:004ae1': 'MGESSEDIDQMFSTLLGEMDLLTQSLGVDTLPPPDPNPPRAEFNYSVGFKDLNESLNALEDQDLEALMADLVADISEAEQRTIQAQKESSQNQHHSASLQASNFSGAAPLGHGTNVAATGISQYEDDLPPPPADPVLDLPLPPPPPEPLSQEEEEAQAKADKIKLALEKLKEAKVKKLVVKVHMDDNSTKSLMVDERQLARDVLDNLFEKTHCDCNVDWCLYEIYPELQIEKESVKSVHVRYNLIRGKVSSCKVVPQNFYLDNRGKKESKETNEKMNAKNKESLLEESFCGTSIIVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGTQHKMKYKAPTDYCFVLKHPQIQKESQYIKYLCCDDARTLNQWVMGIRIAKYGKTLYDNYQRAVAKAGLASRWTNLGTVNAAAPAQPSTGPINGTAQPNGQMPQAAHSVSAVLQEAQRHAETSKVKPARPINGTAQPNGQMPQAAHSVSAVLQEAQRHAETSKRPSPAVAKRPPMPPKRHENPGTPSGAGGGEQDFMSDLMKALQKKRGNVS',
 '43346_0:004190': 'MDVQAEVNLPLHKLFFLTYLPIDGPFRGKKEMGQLSPRTRLLLDCLFLDFFLFQMGESNEDIDQMFSNLLGEMDLLTQSLGVDILPPPDPKPPRAEFNYSVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQKESPQKQSPSASLCVPSFSDTASLGYGANVAAPSQYDDDLPPPPADPMLDLPLPPPPPGPISQEEEEAQAKADKIKLALEKLKEAKVKKLVVKILMNDNSSKSLMVDERQLARDVLDNLFEKTHCDCNVDWCLYEIYPELQIERFFEDHENVVEILSDWTRDTENKLLFLEKEEKYAVFKNPQNFYLDNKGKKENKETNEKMNAKNKESLLEESFCGTSVIVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGIQCKMKYKAPTDYCFVLKHPQIQKESQYIKYLCCDDARALSQWVTGIRIAKYGKTLYDNYQRAMARAGLASRWTNLGTVNAVPPAPPSTGVKTGTTQANGQLPQATQSMNTALGEDWRQLETTKDKKPGPGLGSHDPGAPRAQPLPKSSLPPPPPPVRRSSDVGGAPPPSFAEDLPPPPPPPPALAPESVRTPPVVVKRPPPPPKRQENPGPPGGGGGEQDFMSDLMKALQKKRGNVS',
 '885580_0:00488c': 'MGESHDDIDQMFSTLLGEMDLLTQSLGVDTLPPPAPEPPRAEFNYTVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQRESCQSQNHASSLGASDCGGVTSVGYAANVTAVGISPYEDDLPPPPDDPMLDLPPPPPPPEPLSKEEEEAQAKADKIKLALEKLKEAKVKKLVVKVHMNDNSTKSLMVDERQLARDVLDNLFEKTHCDCSVDWCLYEIYPELQIERFFEDHENVVEVLSDWTRDTENKVLFLEKEEKYAVFKNPQNFYLDNKGKKEVKQTKEKMNAKTKESLLEESFCGTSVIVPELEGALYLKEDGKKAWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNVYYGVQCRVKYKAPSEHCFVLKHPQIQKESQYIKYLCCDDARTLHQWVTGIRVAKYGKTLYENYQRAVARAGLASRWTNLGTVNTAAAAQPPAGLRTGTSQPNGQLPRDAPCLSDALRESQRQADTSKVGPEGQGGHELPDRPKAAFQASTALAASPYSPKH',
 '10181_0:00305d': 'MGQSHHLFTWTIARESLLSSPLHPSLFSCPMQLALGRITRQSYQMGESNDDIDQMFSTLLGEMDLLTQSLGVDTLPPPDPDPPRPEFNYTVGFKDLNESLNALEDQDLDALMADLVADISEAEQRTIQAQKESSQSQNHAASLEASDCSGDASVGSGANVTAVNISQYEDELPPPPADPMLDLPPPPPPPEPLSKEEEEAQAKADKIRLALEKLKEAKVKKLVVKVHMNDNSTKSLMVDERQLARDVLDNLFEKTHCDCSVDWCLYEIYPELQIERFFEDHENVVEVLSDWTRDTENKVLFLEKEEKYAVFKNPQNFYLDNKGKKEGKKTNEKMNAKNKESLLEESFCGTSIIVPELEGALYLKEDGKKAWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNIYYGVQCKMKYKAPTDHCFVLKHPQIQKESQYIKYLCCDDARTLNQWVTGIRIAKYGKTLHENYQRAVARAGLASRWTNLGTVSTATAAQPPTGLRTSTTQHNGQLPQAAPHLSAVLQEAQRQAEASKVGPEGNRPKLQQISAPSPPSKPEAVFWHLLLFPASENPPYCNFT',
 '1415580_0:000900': 'MEQACDDIDQMFSDLLGEMDLLTQSLGVETIPPPSPKAPNTEFNFSVGFKDLNESLNALEDNDLDALMADLMADINETEKKTFQAQKPPSSSQRSTFTDPEPGFSIAASFDYQSNIPAAYTQDFEDYLPPLPLPPKLDLALPLPPPEPSEPLSKEELESKAKTDKIKLALEKMKEAKVKKLVIKVLMDDDSSKTLMVDERQTARDILDNLFEKTHCDCNIDWCLYEVYQELQIERWFEDHENIVDALSGWTRDSENKMLFLQKKEKYAVFKNPQNFYLAKKGKDAGKEMNEKNKESLLKESFCGTSVIVPELEGALYLKEDGKKSWKKRYFLLRASGIYYVPKGKTKTSRDLACFIQFDNVNIYYGTQYKVKFKAPTDHCFVLKHPQIQKESQYIKYLCCDDSWTLHQWVTGIRIAKFGKTLYDNYRFAVQQMGLASRWPNLSKVDPTVTARSSSSGAVQANGQIHQNVIPVISTNPEAFKRAEDKKPNVGRKPDQAQLQPVPSSNHQQSPKVALHASKIPPPAPARISSQAYSSALTLPSNVKNVNANVLLPPPPSPSPPPPDAFPLPPPCNNDLPPPPDDFYDPPPDFLPPPPPCFATGDRAQLPPGPPLPPPPPSSNQPKPFMKKPVPLPPKRQDITSLHSEQPSLAGPTPVGGGGGQPDFMSDLMKALQKKRGSTS',
 '61221_0:00105a': 'MEEACEDIDQMFSDLLGEMDLLTQSLGVETLPPPPTKASSDEFNFAVGFKDLNESLNALEDTDLDALMADLVADISEVERSTLQEPKDASRYQQVGTMQPSADAGASYGCRTDLSNIANAHVDSGLPPPSVELDFDLPPPPPSTPPAPPSELLTKEEQETQAKADKIKLALEKLKEAKVKKLVIKVHMNDNSTKSLMVDERQVARDVLDNLFEKTHCDCNIDWCLYEMCPELQIERFFEDHENVIGVLSDWTRDSENKLLFLEKSEKYAVFKNPQNFYLSNKGKNEIKVMNEKSKESLLEESFCGTSVIVPEVEGALYLKEDGKKSWKKRYFLLRASGIYYVPKGKTKTSRDLACFIQFENVNVYYGIQYKMKYKAPTDHCFVLKHPQIQRESQYIKYLCCEDQQALHHWVTGIRIAKYGKTLYDNYIRAAHKAGLASRLAKSGNTESIVAVGTSAKGSTHANGQVPQSITLSKTDSSETGKSAEMPKVKKPDNTADSIQAPTPNTPQLKHQRKAGGSHSAPPMPPQRVSSAVTAPLQLPTNAEGKGKVCPSDAAEFPPPPESMLPPPELEDLPLPPPPPPEYFESPPDFIPPPPPSCAVAVSAGAPPLPPPPPSASLPRMPLSIKKKPPPPPRRQEESAGQAGLPKPSAPPPKTETAGQGDFMSDLMKALEKKRGATS',
 '7897_0:0033c5': 'MEQACDDIDAMFSDLLGEMDMLTQSLEVESLHPAVPPALNTDFSFAVGFKDLNESLSALEDTDLDALMADLVADINKVEVETSKGHNSALQDSALPPPPSEDVGVIASSIYIPAFPGSTAVNTGYFSEPLPPPPPLPRPPPDADILLPSKPSETLTQEEIEAKAKADKIKDALEKLKEAKVKKLVIKVHMNDESSKTLMVDERQPVREILDNLFEKTHCDCCVDWCLYEINPDLQIERFFEDHENLVEILLHWTRDSENKILFVEHKEKYAVFRNPQNYYLAKKGKGAEKDMKEKMKESLLEESFCGTSVIVPELEGALYLKEDGKKSWKRRYFLLRASGIYYVPKGKTKTSRDLACFIQFDNVNVYYGMQYKVKYKAPTDHCFILKHPQIQKESQYIKYLCCDNQWTLYQWVTGIRIAKYGKTLYDNYKIAIQKAGLASRWANFAKVQSQNTTASSTGSVQANGHAGQVTPVSVSFSEAWKRGDVGKEKQSNDGQDNLPPPPPPPPPPMQGFMNEAFPPPPSLPPIASGSLPPPLRASASAPAPPPISNNFPPPLDELSPPPDDFDFPEPPPDFLPPPPTVSASGVPPPPPPPPPPPAPTAASQPTPLPKKSVPPRRQENTTLSQPRGGGGGGQPDFMSDLTKALQKKRGNAS',
 '8407_0:002bff': 'MEQTCEEIDEMFSNLLGEMDLLTQSLAVESPPPPTTTKAGTDGNVLFGFKDFDSLNTLEDNDLDALMAELVADCTDAEMKVNNQISNSVAFSSNVDSFAYNIPDFTHDLPTGSTDDLSFLPPPPPSEWEMDLPPPPPDPEEPTEKDALESRDKVDKIMLALDKMKEAKVKKLIVKIHMTDNSTKTLMVDERQTVRDILDNVFEKTHCDCTIEWCLYEENPDLQIERFFEDHENIVEILSDWTRDSENKIRFLKKNEKYAVFKNPQNFYMARKGSADMKDMNEKSKVSLLEQSFCAASVVVPDLEGAIYLKEDGKKSWKKRYFLLRASGIYYVPKGKTKTSRDLACFIQFDNVNVYYGMQYKVRYKAPTDHCFVLKHPQIQKESQYIKYLCCDEPWVLHQWVTGIRIAKYGKVLYENYKSAIHKAGLASRWSSLSSSTTPTGAPQGNGQISQNVANVSSSFSDAWKRGETAKDKQQPTEVRKPEQKISLPPSTKQPPPAPVRRPSNAHVVGTPPLPIKAKPVTSNMPPPPPPAEASQWGDDFLPPPPPPELLDTPPNFLPPPPPSFNSESDYPAPPQFTNVGSAGGPPPPPPPPPPPPAALSPKSAPPQLPVKKLPPKPPMRRDSTGQRPNQQNSLMTNGGGAGGQPDFMSDLMSALQKKRSTTT',
 '173247_0:004550': 'MEDIDAMFSDLLGEMDLLTQSLGQETVPPEALPPTNQEVNYSIGFTDLNESLHELEDNDLDALMADLVADLNATEEKLAAEIKGLKTPSPTTSDLPPPPKGLSLHPPSPHPLPASPASSTSSSVSTPASSAASSLPPPPPQNAKPTKEEIEAQMKADKIKLALEKLKEAKVKKLVVKVLMNDGSSKTLMVDERQNVREVLDNLFEKTHCDCNVDWSLCETNPELQLDRTFEDHENLVEPLSAWMRDSENQVLFQERGEKYEVFKNPQNFYLWKKDKRALKDIKDKDKEILIQENFCGTSIIVPDLEGVLHLKEDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFLQFDNVNVYYGKDFKTKYKAPTDFCFVLKHPQIQKESQYIKYLCCDDARTMNLWVTGIRIAKYGVSLYENYKTAEKKAAVNSVWMNRSTPSSSNPSTPSPTIKAKTPNQANGHAPKPQPTAPDSMDFGNFPPPPSADILPPPPPDPAFPPPPPSLPAKSSSRPVAPQHKLPANFPPPPMAMDNLPPPPLPPPIDDSPEAPPDFLPPPPPAAGFGSLPPPPLSMNSLPPPPHFGGMDQSLPPPPPDPEFLPPPPPEPVFTGAGAPPPPPPPPPPPPAQAAAVPRAPVRPSGSVRKVPPAPPKRTTPSLQVGGGGGGGDFMSELMVAMQKKRGDH',
 '30732_0:0046dd': 'MEDIDAMFSDLLGEMNLLTQSLGQEAAPAADPPTSTKEVNFSIGFSDLNASLNELEDKDLDDLMSDLMADLNATEEKLAAELQSLEAPPPPDLPPPPKGLIAPAAAAPTSPASPASVCTPSSTATSPLPAPPPQSVKPSKEDLEAQLKAEKIKLALEKLKEAKVKKLVVKVLMNDGSSKTLMVDERQTVREVLDNLFEKTHCDCNVDWSLRETNHELQLERTFEDHENLVEPLSAWTRDSENKVLFLERGEKYEVFKNPQNFYLWKKDKKALKEIKDKDKEILIKENFCGTSTIVPDLESVLHLREDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNINVYYGNDFKTKYKAPTDFCFVLKHPQIQKESQYIKYLCCDDAWTMNLWVTGIRIAKYGSVLYENFKTAEKKAVVSSAWTNRSTPSSSNPSTPSPTVKAKAQSQANGHAPKPQPGPVSQDFGHLPPPPPPCPNDDLPPPPPDPVFPPPPPPLAAKRSPKTAGRSQHPQGNFPPPPPEMDHLPPPPPMEESPPDFLPPPPPMNSLPHPPPPPASFGGVDHSLPPPPPDPEFLPPPPPDPQVTGGGGPPPPPPPPPPPPPASAPAPRGALRPTGSAKKMPPAPPKRTTPVMGGGGGGGGGGGGDFMSELMKAMQKKRSDQ',
 '241271_0:0048e4': 'MDDIDAMFSEMLGEMDLLTQSLDSSLGPETLPPEPLPSTNKEVNYSFGFTDLNASLHELEDNDLDALMADLVADLSATEEKLAAQIEDLKMPSPPPSDLPAPPVGLSTHPTSSIASPTSPASSTSSNVSTPASSSTSPLPPPPPQAAKPTMEEIEAQMKADKIKLALEKLKEAKVKKLVVKVLLNDGSTKTLMVDERQSVREVLDNLFEKTHCDCNVDWSLCETNPELNLERTFEDHENLVEPLLAWTRDSENKILFQERPGKNEVFKNPQNFYLWKKDKRALKEIKDRDKELLVQENFCGTSIIVPDLEAVLYLKEDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFIQFDNVNVYYGKDFKSKYKAPTDFCFVLKHPQIQKESQYIKYLCCDDAWTMNLWVTGIRIAKYGVSLYENFKAAEKKAANSVWTNRTPVSSNQSTPSPTIKAKSPNQANGHAAKPQPGPESQDFGNIPPPPPPPPPPMTGFLPPPPPDPVLPPPPPLLAAKSPKPSPPQRNLPTNFPPPAMDNLPPPPPPPMDDSFEDPPDFLPPPPPAAGFGSLPPPPPPVNSFPPPPPSAGFGGMGQSLPPPPPDPGFLPPPPPQPMFTGGGTIPPPPPPPPPPTAAPRAPVRPTGSVKKAPPAPPKRTTPSLHGGGGGGGGGGGDFMSELMMAMNKKRGTT',
 '8103_0:0045e4': 'MDDIDAMFSDLLGEMDLLTQSLGQETVPPASLPSTNEEVNLSIGFTDLNESLNELEDTDLDALMADLMADLNATEEKLAAQIEDLKVPSPPPSDLQPPPKGLSIRPVSSLASPTSPASSTGSNVSTSATSPLPPPPPQAAKPTKEEIEAQIKADKIKLALEKLKEAKVKKLVVKVLMNDGSAKTLMVDERQTVREVLDNMFEKTHCDCNVDWSLCETNPELQLERAFEDHENLVEPLLAWTRDSENKVLFQERGEKYEVFKNPQNFYLWKKDMRALKDIKDQDKELLIQENFCGTSIIVPDLEGVMHLKEDGKKSWKPRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNLNIYYGKDFKGKYKAPTDFCFLLKHPQIQKESQYIKYLCCDDAGSMNLWVTGIRIAKYGASLYDNYKTAEKKAAVSSVWTNRSTPSSSNPSTPSPPIKAKSPGQANGHALKPEPGPVAQDFGHVPPPPPADILPPPADILPPPPPQTFLPPPPPPLAAKSSSKPSLPQRHLPTNFPPPPPAMINLPRPPQPPPTDDASEAPPDFLPPPPPAAGFSPLFPPPPPLNALPLPPPPVSFRVEDRSLPPPPPDPGFLPPPPPMFTGAGAPPPPPPPPPPPRVAVRPAGSVKKRPPAVPKRTTPSLRGGGGGDFMSELALAMNKKRSAH',
 '56723_0:00152f': 'MDDIDAMFTDLLGEMDLLTQSLDQPTVVPEPLPSTTEMNYSIGFTDLNESLHELEDHDLDALMADLVADINATEEKLTAQMKDQKVPPPPSSDLPAPPKGLSTYSASSIASPTSPASSTGSNVSTPASSSASPLPPPPPQSAKPTMEEIEAQMKADKIKLALEKLKEAKVKKLVVKVLLTDGSSKTLMVDERQNVREVLDNLFEKTHCDCNVDWSLCETNYELNLERIFEDHENLVEPLLAWTRDTENKVLFQERTEKNDMFRNPQNFYLWKKDKKALKEIKDRDKELLVQENFCGTSVIVPDLEAVLYLKEDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNMNVYYGKDFKTRYKAPTDFCFVLKHPQIQKDSQYIKYLCCDDAWTMNLWVTGIRIAKYGASLYENFKAAEKAAVSSVWTNRSTPSSSSSSTPSPTIKAKSPSQANGHAPKPQPGPVSQDFGNVPPPPPPMANILPPPRPDAFLPPAPPPLARKNSAKPPPPQRHLPTNFPPPPPAMDNLPPPPPPPPMDDALEAPPDFLPPPPPAAGFGSLPPPPPPSNSFPPPPPPGSFGSMGQSLPPPPPDPGFLPPPPPQPVFTGAGAPPPPPPPPPPPTAAAAPRAPVRPSGSVKKIPPATPKRTTPSLQGGGGGGGGGGGGGGDFMSELMLAMNKKRST',
 '210632_0:004c0c': 'MVDIDAMFSDLLGEMDLLTQSLEQEVAPPKSLPALPSADKEVNFSIGFSDLNESLGELEDNDLDALMADLMADLNATEEKLAAEIEELKVPSLPPANLPAPKNLSIHPASSITSPPSASPAGSSSTLASSSLPPPPPQSVKPTTEEMEAQMKADKIKLALEKLKEAKVKKLVVKVMMNDGSSKTLMVDERQNVREVLDNMFEKTHCDCNVDWSLCETNPELQLERAFEDHENLVDLLSTWMRDSENKVLFQERKEKNEVFKNPQNFYLWKKDKKALKDIKDKDKGLLIQENFCGTSIIVPDLEGILHLKEDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNVNVYYSKDYKSKYKAPTDFCFVLKHPQIQKESQYIKYLCCDDAWTLNLWVTGIRIAKYGVALHENFKTAEQKAATSSAWANRSTSSSSNSSTPSPTIKAKSSSQANGIFPKPGPAPQDFGDLPPPPPLAANILPPPPPEPGLPPPPPPPPPQAAKGSAKPAPPKRQMPANFPPPPTAMDNLPPPPPPPPIDNSEAPPDFLPPPPPASGFGSFPPPPPLNSLPPPPRPGGFGGMDQSLPPPPPDPEFLPPPPPPPQAVFTGGGAPPPPPPPPPPPAAAAPSTAIPRVGLRPAGSLKKLPPAPPKRTTPSMQGSGGGGGGDFMSELMLAMQKKRGDHPPAVLASGT',
 '31033_0:00264e': 'MKMDDIDAMFSEMLGEMDLLTKSLDQEMAPPDAPPSTSEEVSFSIGFPDLNESLQELEDSDLDALMADLVADLNATEQKLAAEIEDLKVPPPPQPHLPPKSRGAVSTSSSCSPSPASSATSPLPVPPPQSVKPSMEEIEAQMKADKIKLALEKLKEAKVKKLVVKVLLNDGSSKTLMVDERQSVRDVLDNLFEKTHCDCNVDWSLCETNAELQLERTFEDHENLVEPLLAWTRDSQNKVFFQERPEKNEVFKNPQNFYLWKKDKKTLQAIKDKDKEILIKENFCGTSIIVPDLEAVLHLREDGKKSWKQRLFQLRASGLYYVPKGKTKSSRDLICFVQFDNLNVYYGKDFRSKYKAPTEFCFVLKHPQIQKESQYIKYLCCDDAWTMNLWVAGIRIAKYGTALHQNYQTALRKAAVTSAWTNCSKPSSDGPPTPPTTIKASPANGHVPKPPPGAAPQDVFPPPPPPMDILPPPPPDPAFPPPPPPLMAKRSPKPSAGHRQAPGDHLPPPPLAPPHDDASEDPPDFLPPPPPSFDSLPPPPPGMSAFPPPPPLLGFSETSQPLPPPPPDPELLLPPPPASMISTGAGAPPPPPPPPPAAAASPRPAPTASGSVRKRPPAPPKRTTPALHGSGGGAGEGAGGGDFMSELMKAMNKKRADHS',
 '63155_0:004c86': 'MDDIDAMFSAMLGEMDLLTQSLGEEKAHPEPHPSSDKQVNFSIGFTDLNESLHELEDNDLDMLMADLMADLNATEEKLAAEIHGLKEPPQPKPDPLPLPRGSSNAPVSEHILPASSGGSGSSNVSTPAPSAACSLPPPPPQCVKPTMDDIEAQAKADKIKLALEKLKEAKVKKLVVKVLMNDGSSKTLMVDERQNVREVLDNMFEKTHCDCNVNWSLCETNPELQLERAFEEHENLVESLSAWIRDSENKVLFQERPEKYEVFKNPQNFYLWKKDKKTLKDIKDKDKELLIQENFCGTSIIVPDLEGVLHLKEDGKKSWKQRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNMNVYYGKDFKTKHKAPTDFCFVLKHPQIQKDSQYIKYLCCDDAWTMNLWVTGIRIAKYGASLYENYKTAEIKGSNSMWTNRSTPSSSNQSTPSPTVKAKSPNQANGHPPKPQPGPISQAPFPPPPLAEVLPPPPPDPVLPPPPPMPAKSSAKPSPPKRQQQSNFPPPPPELDNLPPPPPPPPTDDTAEAPPDFLPPPPPAVGFGSLPPPPPSFGGVGQSLPPPPPDPQSLPPPPPDPVFIGAGAPPPPPPPPPPPAPGAPVTTLRPAVRPSGSLKKVPPAPPQRNTPSVSGGGGGGGGDFMSELMLAMQKKRGAQ',
 '7994_0:004d71': 'MDDIDAMFTDMLEEMDLLTQSLGAEATEPTPPSKSSSSSSFNSMPEMSNFSIGFTDLNASLNELEDNDLDSLMADLVADLNATEELFAAEKGGVKEPRPPPAVTVPAVHFGSAAPIAPAAPTPSKPKNDVTSCPPAGNTQSLPPPPPASTRPSTDDPEAQKAEKIKLALEKLKEAKVKKLIVKVEITDGSSKTLMVDERQTVRDVMDNLFEKTHCDGNVDWCLCETNPELQTERGFEDHENLVEPLSAWTRDSENKVLFHERKDKYEVFKNPQNFYMWKKDKKSLMDMKDKDKELLLEENFCGTSVIVPDLEGMMYLKEEGKKSWKQRYFLLRASGLYYLPKGKTKSSKDMVCLVQFDNMNVYYCSEYKTKYKAPTDYCFILKHPQIQKESQYIKYLCCDDKWTMTLWVTGIRIAKYGKTMYENYKTAARKGSSLSAVWTSMNRQPSPSTSNTSTPSPTPKAKTANGHASQPRSETVPKAPSNQSAFPPPPPPADFLPPPPPDPTLPPPPPPPPALPVKKESNPPRSAPQRSQPAFPPPPPAMDFSLPPPPPPSDDLEMPPDFLPPPPPAPGGFMGGDLLPPPPPEPFHAPLPPPPAAFHPPPAVHPPPQATGGDLPPPPPPPPPPPPAPAAFHQTPSVRKVGPPPPKRTTPSLAAPSGGDFMSELMLAMNKKRGGQ',
 '109280_0:00369f': 'MDDIDAMFSDLLGEMDLLTQSLGQEQAPPPSSPPEAEQEVNLSIGFTDLNASLNELEDNDLDALMADLVADLNATEEKLAAEIESLKEPQPEPLPPPSVGPPSSSPPLSSDSSTTFPSSTLPPPPPQSSKPTMEEIEAQIKADKIKLALEKLKEAKVKKLVVKVCMNDGSSKTLMVDERQNVREVLDNLCEKTHCDYNVDWSLCETNPELQLERTFEDHEHLVEPLLAWTRDSENQILFQESSDKYEVFKNPQKFYLWKKDKKVLKDMKDKDKEILIKENFCGTSIMVPDLEGVLHLKEDGKKSWKPRLFQLRASGIYYVPKGKTKSSRDLVCFVQFDNVNVYYGKDFRAKHKAPSDFCFVLKHPQIQKDSQYIKYLCCDDAWTMNLWVTGIRIAKYGSNLYENFKTAEKKAAVGSAWASCSVTSGQKQSQSQVANGHANNTSPSPLPPPPPPLGEDLPPPPPPPPQLGKTLPPAPPPLGATLPTPPPPPGGTPPPPPPPRRNPPPYPRHLPHISELYPLRRLLLLYQLPTVHSLVLLFQPNLPPNPSPNHDVQRPISRCLPGPQITFPPPPPPPVDDSPPDFLPPPPPAANFGSHPPPPPPVKTLPPPPPHMKTLPPARLSFKSTNLPPPPPDPGFLPPPLTGVPPPPPPPPPPPPTTAAAGPRRAPVRPSGSLKKMPPPPPKRSTPSLHGRRDGDRGDGDGGGGGDFMSELMRAMQKKRDPH',
 '150288_0:004e5a': 'MDDIDVMFSHLLEEMDDLTQSLVQSADTAADAQTNSSGASDLNEPLNNLDKSESDHLLAQPEETLPVDNQTAPSDPPLPSASSVTLASPTTLAMLKLEPMEVQNESPAPKLTMPQTANNEKPPQTIIKVWMSDGSTKTLMVEGTQTVRDVLDKLFEKTYCDCATEWSLCEINQELHVERILEDHECFVESLSMWSSVTDNKLYFLKRPQKYVIFTQPQFFYMWKRSSLKAISEQEQQLLLKENFGGLTAVVPDLEGWLYLKDDGRKVWKPRYFVLRASGLYYVPKGKTKSSSDLACFVRFEQVNVYSADGHRIRYRAPTDYCFVLKHPCIQKESQYVKFLCCENEDTVLLWVNSIRIAKYGTVLYENYKTALKRAQHPPDRCSTSSDNLNSQIGQSAPTPDECIEQDEPPPDFIPPPPPGYMAIL'}
ex1.idr_position_map
{'9606_0:00294e': [440, 665],
 '9793_0:005123': [440, 657],
 '1706337_0:000fc7': [457, 676],
 '51337_0:001b5a': [440, 632],
 '9568_0:004ae1': [426, 564],
 '43346_0:004190': [492, 656],
 '885580_0:00488c': [440, 524],
 '10181_0:00305d': [484, 578],
 '1415580_0:000900': [439, 679],
 '61221_0:00105a': [440, 678],
 '7897_0:0033c5': [442, 653],
 '8407_0:002bff': [430, 663],
 '173247_0:004550': [432, 680],
 '30732_0:0046dd': [425, 663],
 '241271_0:0048e4': [435, 685],
 '8103_0:0045e4': [429, 673],
 '56723_0:00152f': [430, 685],
 '210632_0:004c0c': [429, 685],
 '31033_0:00264e': [420, 658],
 '63155_0:004c86': [431, 667],
 '7994_0:004d71': [439, 676],
 '109280_0:00369f': [418, 723],
 '150288_0:004e5a': [376, 424]}

The mod input is required so that you can preload the ESM model before running the method. You preload the ESM model with pairk.ESM_Model()

you can use pre-generated embeddings by providing them in a dictionary format to the precomputed_embeddings argument. The keys should be the sequence ids and the values should be full length sequence embedding tensors. For each sequence, the tensor.shape[0] should be equal to l+2, where l is the length of the full length sequence. The +2 is for the start and end tokens. If you provide precomputed embeddings, the mod and device arguments are ignored.

The pairk.pairk_alignment_embedding_distance method returns a PairkAln object, just like the previous methods

example usage: loading the ESM2 model and running the method

mod = pairk.ESM_Model(threads=4)
aln_results_embedding = pairk.pairk_alignment_embedding_distance(
    full_length_sequence_dict=ex1.full_length_dict,
    idr_position_map=ex1.idr_position_map,
    query_id=ex1.query_id,
    k=5,
    mod=mod,
    device="cpu"
)

k-mer alignment results

The results of the above pairwise k-mer alignment methods are returned as a pairk.PairkAln object.

The actual “alignments” are stored as matrices in the pairk.PairkAln object. The main matrices are:

  • orthokmer_matrix - the best matching k-mers from each homolog for each query k-mer

  • position_matrix - the positions of the best matching k-mers in the homologs

  • score_matrix - the scores of the best matching k-mers

Each matrix is a pandas DataFrame where the index is the start position of the k-mer in the query sequence. The columns are the query k-mers + the homolog sequence ids.

The pairk.PairkAln object has some useful methods for accessing the data. For example, you can get the best matching k-mers for a query k-mer by its position in the query sequence using the pairk.get_pseudo_alignment() method (or by directly accessing the dataframes). You can also plot the matrices as heatmaps, save the results to a json file, and load the results from that file

example: accessing the DataFrames from the pairk.PairkAln object directly

aln_results.score_matrix
query_kmer 9793_0:005123 1706337_0:000fc7 51337_0:001b5a 9568_0:004ae1 43346_0:004190 885580_0:00488c 10181_0:00305d 1415580_0:000900 61221_0:00105a ... 30732_0:0046dd 241271_0:0048e4 8103_0:0045e4 56723_0:00152f 210632_0:004c0c 31033_0:00264e 63155_0:004c86 7994_0:004d71 109280_0:00369f 150288_0:004e5a
0 TNLGT 22.0 28.0 28.0 28.0 28.0 28.0 28.0 10.0 9.0 ... 13.0 12.0 13.0 13.0 9.0 8.0 13.0 8.0 14.0 9.0
1 NLGTV 16.0 29.0 29.0 29.0 29.0 29.0 29.0 16.0 8.0 ... 11.0 13.0 11.0 13.0 7.0 7.0 11.0 6.0 14.0 10.0
2 LGTVN 16.0 29.0 29.0 29.0 29.0 29.0 23.0 10.0 7.0 ... 11.0 8.0 9.0 10.0 7.0 8.0 10.0 8.0 16.0 4.0
3 GTVNA 19.0 26.0 23.0 26.0 26.0 23.0 17.0 15.0 10.0 ... 11.0 9.0 10.0 8.0 8.0 9.0 11.0 9.0 9.0 6.0
4 TVNAA 18.0 25.0 22.0 25.0 22.0 22.0 16.0 14.0 11.0 ... 11.0 9.0 8.0 12.0 8.0 11.0 11.0 10.0 11.0 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
217 LQKKR 31.0 31.0 19.0 31.0 31.0 3.0 9.0 31.0 24.0 ... 26.0 19.0 19.0 19.0 26.0 19.0 26.0 19.0 26.0 0.0
218 QKKRG 29.0 23.0 16.0 29.0 29.0 5.0 3.0 29.0 22.0 ... 23.0 22.0 16.0 16.0 29.0 17.0 29.0 22.0 23.0 -1.0
219 KKRGN 29.0 23.0 18.0 29.0 29.0 2.0 2.0 23.0 21.0 ... 17.0 23.0 15.0 17.0 23.0 18.0 21.0 22.0 14.0 -1.0
220 KRGNV 29.0 19.0 20.0 29.0 29.0 8.0 8.0 17.0 15.0 ... 9.0 17.0 8.0 11.0 14.0 9.0 13.0 14.0 11.0 2.0
221 RGNVS 23.0 12.0 13.0 27.0 27.0 9.0 13.0 15.0 13.0 ... 10.0 11.0 8.0 13.0 7.0 9.0 7.0 9.0 8.0 9.0

222 rows × 23 columns

example: access the best matching k-mers for the query k-mer at position 4:

print(aln_results.orthokmer_matrix.loc[4])
query_kmer          TVNAA
9793_0:005123       TGNAA
1706337_0:000fc7    TVNAA
51337_0:001b5a      TVNTA
9568_0:004ae1       TVNAA
43346_0:004190      TVNAV
885580_0:00488c     TVNTA
10181_0:00305d      TVSTA
1415580_0:000900    NVNAN
61221_0:00105a      AVSAG
7897_0:0033c5       TVSAS
8407_0:002bff       SQNVA
173247_0:004550     TPNQA
30732_0:0046dd      TVKAK
241271_0:0048e4     PVNSF
8103_0:0045e4       PLNAL
56723_0:00152f      TAAAA
210632_0:004c0c     TIKAK
31033_0:00264e      TIKAS
63155_0:004c86      TVKAK
7994_0:004d71       TSNTS
109280_0:00369f     TTAAA
150288_0:004e5a     NLNSQ
Name: 4, dtype: object

example: access the best matching k-mers for the query k-mer at position 4 using the get_pseudo_alignment method. (the returned list includes the query k-mer sequence)

aln_results.get_pseudo_alignment(4)
['TVNAA',
 'TGNAA',
 'TVNAA',
 'TVNTA',
 'TVNAA',
 'TVNAV',
 'TVNTA',
 'TVSTA',
 'NVNAN',
 'AVSAG',
 'TVSAS',
 'SQNVA',
 'TPNQA',
 'TVKAK',
 'PVNSF',
 'PLNAL',
 'TAAAA',
 'TIKAK',
 'TIKAS',
 'TVKAK',
 'TSNTS',
 'TTAAA',
 'NLNSQ']

you can search for a specific kmer to get its positions. You can then use the positions to query the matrices.

aln_results.find_query_kmer_positions('LPPPP')
[75, 113, 127, 157]
aln_results.get_pseudo_alignment(75)
['LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'PPMPP',
 'LPPPP',
 'LPDRP',
 'APSPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'LPPPP',
 'IPPPP']
aln_results.orthokmer_matrix.loc[[75, 113, 127, 157]].T
75 113 127 157
query_kmer LPPPP LPPPP LPPPP LPPPP
9793_0:005123 LPPPP LPPPP LPPPP LPPPP
1706337_0:000fc7 LPPPP LPPPP LPPPP LPPPP
51337_0:001b5a LPPPP LPPPP LPPPP LPPPP
9568_0:004ae1 PPMPP PPMPP PPMPP PPMPP
43346_0:004190 LPPPP LPPPP LPPPP LPPPP
885580_0:00488c LPDRP LPDRP LPDRP LPDRP
10181_0:00305d APSPP APSPP APSPP APSPP
1415580_0:000900 LPPPP LPPPP LPPPP LPPPP
61221_0:00105a LPPPP LPPPP LPPPP LPPPP
7897_0:0033c5 LPPPP LPPPP LPPPP LPPPP
8407_0:002bff LPPPP LPPPP LPPPP LPPPP
173247_0:004550 LPPPP LPPPP LPPPP LPPPP
30732_0:0046dd LPPPP LPPPP LPPPP LPPPP
241271_0:0048e4 LPPPP LPPPP LPPPP LPPPP
8103_0:0045e4 LPPPP LPPPP LPPPP LPPPP
56723_0:00152f LPPPP LPPPP LPPPP LPPPP
210632_0:004c0c LPPPP LPPPP LPPPP LPPPP
31033_0:00264e LPPPP LPPPP LPPPP LPPPP
63155_0:004c86 LPPPP LPPPP LPPPP LPPPP
7994_0:004d71 LPPPP LPPPP LPPPP LPPPP
109280_0:00369f LPPPP LPPPP LPPPP LPPPP
150288_0:004e5a IPPPP IPPPP IPPPP IPPPP

Note

the k-mers are defined by position rather than sequence. You could easily make a variant of this method that uses the unique sequences instead. It would make the method slightly faster. The reason that I didn’t do this is because I wanted to mimic the LLM embedding version of Pairk, where identical k-mers have different embeddings and thus are treated as different k-mers.Inclusion of duplicate k-mers does alter the final z-scores, so it’s something to be aware of.

example: plot a heatmap of the matrices

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
aln_results.plot_position_heatmap(ax)
ax.xaxis.set_visible(False)
_images/pairk_tutorial_simplified_62_0.png

example: save the results to a file using write_to_file and load them back into python using from_file:

aln_results.write_to_file('./aln_results.json')
aln_results = pairk.PairkAln.from_file('./aln_results.json')
print(aln_results)
PairkAln object for 222 query k-mers
query sequence: TNLGTVNAAAPAQPSTGPKTGTTQPNGQIPQATHSVSAVLQEAQRHAETSKDKKPALGNHHDPAVPRAPHAPKSSLPPPPPVRRSSDTSGSPATPLKAKGTGGGGLPAPPDDFLPPPPPPPPLDDPELPPPPPDFMEPPPDFVPPPPPSYAGIAGSELPPPPPPPPAPAPAPVPDSARPPPAVAKRPPVPPKRQENPGHPGGAGGGEQDFMSDLMKALQKKRGNVS
k-mer length: 5