API Documentation
PairK convencience functions
- class pairk.FastaImporter(fasta_path: str | Path)[source]
import fasta file and return seqrecord objects in various formats
- Parameters:
fasta_path (str) – file path to fasta file
- import_as_alignment() MultipleSeqAlignment[source]
return multiple sequence alignment object
- Returns:
multiple sequence alignment object
- Return type:
Align.MultipleSeqAlignment
- import_as_dict() dict[str, SeqRecord][source]
return dictionary of SeqRecord objects for each sequence in the fasta file
- Returns:
dictionary of SeqRecord objects, keys are the sequence ids and values are the SeqRecord objects
- Return type:
dict[str, SeqRecord]
pairwise k-mer alignment
- pairk.pairk_alignment(idr_dict: dict[str, str], query_id: str, k: int, matrix_name: str = 'EDSSMat50') PairkAln[source]
run pairwise k-mer alignment method using an exhaustive comparison of k-mers. Each query k-mer is scored against each ortholog k-mer to find the best matching ortholog k-mer in each ortholog. If an ortholog IDR is shorter than the k-mer, a string of “-” characters (“-”*k) is assigned as the best matching ortholog k-mer for that ortholog.
Note: if there are multiple top-scoring matches, only one is returned.
- Parameters:
idr_dict (dict[str, str]) – input sequences in dictionary format with the key being the sequence id and the value being the sequence as a string
query_id (str) – the id of the query sequence within the idr_dict dictionary
k (int) – the length of the k-mers to use for the alignment
matrix_name (str, optional) – The name of the scoring matrix to use in the algorithm, by default “EDSSMat50”. The available matrices can be viewed with the function pairk.print_available_matrices().
- Returns:
an object containing the alignment results. See the pairk.PairkAln class for more information.
- Return type:
- pairk.pairk_alignment_needleman(idr_dict: dict[str, str], query_id: str, k: int, aligner: PairwiseAligner | None = None, matrix_name: str = 'EDSSMat50') PairkAln[source]
run pairwise k-mer alignment method using the needleman-wunsch algorithm as implemented in Biopython. Each query k-mer is scored against each ortholog k-mer to find the best matching ortholog k-mer in each ortholog. If a ortholog IDR is shorter than the k-mer, a string of “-” characters (“-”*k) is assigned as the best matching ortholog k-mer for that ortholog
Note: if there are multiple top-scoring matches, only one is returned.
- Parameters:
idr_dict (dict[str, str]) – input sequences in dictionary format with the key being the sequence id and the value being the sequence as a string
query_id (str) – the id of the query sequence within the idr_dict dictionary
k (int) – the length of the k-mers to use for the alignment
aligner (Align.PairwiseAligner | None, optional) – The Biopython pairwise aligner object to use in the pairwise gapless alignments, by default None. If None, then an aligner object will be created using the scoring matrix specified in matrix_name. If an aligner object is provided, it will take precedence over the matrix_name parameter, i.e. the matrix_name parameter will be ignored.
matrix_name (str, optional) – The name of the scoring matrix to use in the algorithm, by default “EDSSMat50”. The available matrices can be viewed with the function print_available_matrices() in pairk.backend.tools.matrices. If an aligner object is provided, this parameter will be ignored.
- Returns:
an object containing the alignment results. See the pairk.PairkAln class for more information.
- Return type:
- pairk.make_aligner(matrix_name: str) PairwiseAligner[source]
generates a Bio.Align.PairwiseAligner object with the given matrix.
The aligner is set to global mode, with open and extend gap scores set to extremely low values to prevent gaps from being introduced.
- Parameters:
matrix_name (str) – the substitution matrix to be used in the alignment. The available matrices can be viewed with the function print_available_matrices() in pairk.backend.tools.matrices. Some of them may not be compatible with the Biopython aligner object, but they will raise an error if they are not.
- Returns:
the aligner object with the given matrix
- Return type:
Align.PairwiseAligner
- pairk.pairk_alignment_embedding_distance(full_length_sequence_dict: dict[str, str], idr_position_map: dict[str, list[int]], query_id: str, k: int, mod: ESM_Model, device: str = 'cuda', precomputed_embeddings: None | dict[str, Tensor] = None)[source]
run pairwise k-mer alignment method using residue embeddings from a large language model to find the best k-mer matches from each homolog. By default, the ESM2 protein large language model is used to generate residue embeddings. Other residue embeddings (e.g. from different LLMs) can be used by providing the embeddings directly to the precomputed_embeddings argument. If an ortholog IDR is shorter than the k-mer, a string of “-” characters (“-”*k) is assigned as the best matching ortholog k-mer for that ortholog
Note: if there are multiple top-scoring matches, only one is returned.
If precomputed_embeddings is not provided, Sequence embeddings are calculated for each full length sequence in the input dictionary. The idr_position_map dictionary is used to extract the IDR and the IDR embeddings from each sequence.
If precomputed_embeddings is provided, the function will use these embeddings. the provided embeddings must have an extra dimension for the start and end tokens. The start token is at index 0 and the end token is at index -1. These are stripped out before calculating the pairwise distances.
The Euclidean distance is calculated between each query k-mer embedding slice and each ortholog k-mer embedding slice to find the best matching ortholog k-mer from each ortholog.
- Parameters:
full_length_sequence_dict (dict[str, str]) – input sequences in dictionary format with the key being the sequence id and the value being the sequence as a string
idr_position_map (dict[str, list[int]]) – a dictionary where the keys are the sequence ids in full_length_sequence_dict and the values are the start and end positions of the IDR in the sequence (using python indexing). This is used to slice out the IDR embeddings/sequences from the full-length embeddings/sequences.
query_id (str) – the id of the query sequence within the full_length_sequence_dict dictionary and the idr_position_map dictionary. The query id must be present in both dictionaries.
k (int) – the length of the k-mers to use for the alignment
mod (esm_tools.ESM_Model) – ESM2 model used to generate the embeddings
device (str, optional) – whether to use cuda or cpu for pytorch, must be either “cpu” or “cuda”, by default “cuda”. If “cuda” fails, it will default to “cpu”. This argument is passed to the esm_tools.ESM_Model.encode method.
precomputed_embeddings (None | dict[str, torch.Tensor], optional) – a dictionary where the keys are the sequence ids in full_length_sequence_dict and the values are the precomputed embeddings for each sequence. If this is provided, the function will use these embeddings instead of computing them. Allows you to pass any precomputed embeddings (from any LLM) for the sequences. The provided embeddings must have an extra dimension for the start and end tokens. The start token is at index 0 and the end token is at index -1. These are stripped out before calculating the pairwise distances.
- Returns:
an object containing the alignment results. See the pairk.PairkAln class for more information.
- Return type:
- class pairk.ESM_Model(model_name: str = 'esm2_t33_650M_UR50D', threads: int = 1)[source]
This was adapted from the kibby conservation method: DOI: 10.1093/bib/bbac599. see https://github.com/esbgkannan/kibby
Class that loads a specified ESM model. Provides a method for encoding protein sequences.
available models: - esm1b_t33_650M_UR50S
esm2_t6_8M_UR50D
esm2_t12_35M_UR50D
esm2_t30_150M_UR50D
esm2_t33_650M_UR50D (default)
esm2_t36_3B_UR50D
- Variables:
model_name (str) – the name of the model that was loaded.
threads (int) – the number of threads for pytorch to use, by default 1.
- encode(sequence, device='cuda')[source]
encode a protein sequence using the loaded model.
- Parameters:
sequence (str) – the amino acid sequence to encode.
device (str, optional) – whether to use a GPU via “cuda”, or “cpu”, by default “cuda”
- Returns:
sequence embedding tensor
- Return type:
torch.Tensor
- class pairk.PairkAln(orthokmer_df: DataFrame, pos_df: DataFrame, score_df: DataFrame | None = None)[source]
A class to store the results of the pairwise alignment.
The primary data is stored in pandas dataframes. All dataframes have the same structure. One column is the query k-mer sequence (‘query_kmer’). The other columns are named as the ortholog sequence ids. The dataframe indexes are the query k-mer start position in the query sequence.
- Variables:
orthokmer_matrix (pd.DataFrame) – the best scoring k-mer from each ortholog for each query k-mer.
position_matrix (pd.DataFrame) – the start position of the best scoring k-mer from each ortholog for each query k-mer.
score_matrix (pd.DataFrame | None) – the alignment scores for each k-mer in the query sequence against the corresponding best matching ortholog k-mer.
query_kmers (list[str]) – the list of query k-mers that were aligned.
query_sequence (str) – the full query sequence that was originally split into k-mers and aligned.
k (int) – the k-mer size used for the alignment.
- find_query_kmer_positions(kmer: str)[source]
convenience function to search for the positions of a k-mer string.
- Parameters:
kmer (str) – the k-mer string to search for.
- Returns:
the positions in the query sequence that match the input kmer.
- Return type:
list[int]
- classmethod from_file(filepath: str | Path)[source]
import the pairwise alignment matrices from a json file.
- Parameters:
filepath (str|Path) – the path to the json file containing the pairwise alignment matrices.
- Returns:
PairkAln object containing the pairwise alignment matrices.
- Return type:
Pairk.PairkAln
- get_pseudo_alignment(position: int) list[str][source]
get a list of the best scoring k-mers from each ortholog at a given query position.
- Parameters:
position (int) – the position of the query k-mer in the query sequence (0-based index).
- Returns:
list of the best scoring k-mers from each ortholog for the query k-mer.
- Return type:
list[str]
- plot_position_heatmap(ax: Axes | None = None, **kwargs) Axes[source]
plot a heatmap of the start positions of the best scoring k-mers in each ortholog.
- Parameters:
ax (matplotlib.axes.Axes | None, optional) – The axis to plot the heatmap on, by default None. If None, a new figure is created.
kwargs (optional) – additional keyword arguments to pass to the seaborn heatmap function.
- Returns:
The axis on which the heatmap was plotted.
- Return type:
matplotlib.axes.Axes
- plot_score_heatmap(ax: Axes | None = None, **kwargs) Axes[source]
plots a heatmap of the alignment scores.
- Parameters:
ax (matplotlib.axes.Axes | None, optional) – The axis on which to plot the heatmap, by default None. if None, a new figure is created.
kwargs (optional) – additional keyword arguments to pass to the seaborn heatmap function.
- Returns:
The axis on which the heatmap was plotted.
- Return type:
matplotlib.axes.Axes
- Raises:
ValueError – raised if no score matrix is found.
k-mer conservation
- pairk.calculate_conservation(pairk_aln_results: ~pairk.backend.tools.pairwise_tools.PairkAln, score_func: ~typing.Callable = <function property_entropy>) PairkConservation[source]
calculate the conservation scores for the k-mers in the PairkAln object. calculates the conservation scores and z-scores for each k-mer position.
- Parameters:
pairk_aln_results (PairkAln) – the results of the pairk alignment step as a pairk.PairkAln object.
score_func (Callable, optional) – A function to calculate conservation scores in a columnwise manner, by default it is the property_entropy function from Capra and Singh 2007, DOI: 10.1093/bioinformatics/btm270 located in the pairk.pairk_conservation.capra_singh_functions module.
- Returns:
PairkConservation object containing the conservation scores and z-scores for each k-mer position.
- Return type:
- pairk.calculate_conservation_arrays(orthokmer_df: ~pandas.core.frame.DataFrame, score_func: ~typing.Callable = <function property_entropy>) tuple[ndarray, ndarray, ndarray][source]
calculate the conservation scores and z-scores from a dataframe of ortholog k-mers
- Parameters:
orthokmer_df (pd.DataFrame) – the best scoring k-mer from each ortholog for each query k-mer. The index should be the query k-mer start positions, and the columns should be ‘query_kmer’ and the ids of the orthologs.
score_func (Callable, optional) – A function to calculate conservation scores in a columnwise manner, by default it is the property_entropy function from Capra and Singh 2007, DOI: 10.1093/bioinformatics/btm270 located in the pairk.pairk_conservation.capra_singh_functions module.
- Returns:
returns the orthokmer_df as a numpy array, the conservation scores as a numpy array, and the z-scores as a numpy array.
- Return type:
tuple[np.ndarray, np.ndarray, np.ndarray]
- Raises:
ValueError – If the orthokmer_df contains non-string elements
- class pairk.PairkConservation(orthokmer_arr: ndarray, score_arr: ndarray, z_score_arr: ndarray)[source]
a class to store the results of the conservation scoring
The methods can be used to create plots of the conservation scores and sequence logos.
- Variables:
orthokmer_arr (np.ndarray) – the best scoring k-mer from each ortholog for each query k-mer.
score_arr (np.ndarray) – the conservation scores for each k-mer position.
z_score_arr (np.ndarray) – the z-scores for each k-mer position.
query_kmers (list[str]) – the query k-mers.
query_sequence (str) – the query sequence.
k (int) – the length of the query k-mers.
bg_scores (np.ndarray) – the background conservation scores used to calculate the z-scores. This is just a flattened version of the score_arr.
n_bg_scores (int) – the number of background scores used to calculate the z-scores.
n_bg_kmers (int) – the number of k-mers used to calculate the z-scores.
bg_mean (float) – the mean of the background scores.
bg_std (float) – the standard deviation of the background scores.
- find_query_kmer_positions(kmer: str)[source]
convenience function to search for the positions of a k-mer string.
- Parameters:
kmer (str) – the k-mer string to search for.
- Returns:
the positions in the query sequence that match the input kmer.
- Return type:
list[int]
- get_average_score(position: int, score_type: str = 'z_score', position_mask: ndarray | None = None)[source]
get the average conservation score for a query k-mer, averaged across each position in the k-mer
- Parameters:
position (int) – The starting position of the k-mer in the query sequence.
score_type (str, optional) – which score to use, must be either “score” or “z_score”, by default “z_score”
position_mask (np.ndarray | None, optional) – A position mask to exclude specific positions of the query k-mer from the average, by default None. Must have a length of self.k and should be 1 for positions to include and 0 for positions to exclude.
- Returns:
The average conservation score across the query k-mer positions.
- Return type:
floating[Any]
- Raises:
ValueError – If the score_type is not “score” or “z_score”
- plot_background_distribution(ax: Axes | None = None, bins=20)[source]
plot the background conservation scores as a histogram
- Parameters:
ax (matplotlib.axes.Axes | None, optional) – if provided, the histogram will be plotted on the provided axes. If None, a new axes will be created. by default None
bins (int, other, optional) – passed to the plt.hist matplotlib function, by default 20
- Returns:
matplotlib axes with the background conservation score histogram
- Return type:
matplotlib.axes.Axes
- plot_conservation_mosaic(position: int, score_type: str = 'z_score', figsize: tuple[int, int] = (11, 4)) tuple[Figure, dict[str, Axes]][source]
makes a mosaic plot (with multiple subplots) of the conservation scores, sequence logos, and background scores for the pairk conservation results.
- Parameters:
position (int) – starting position of the k-mer in the query sequence.
score_type (str, optional) – either ‘score’ or ‘z_score’. The type of score to plot on the bar plot, by default “score”
figsize (tuple[int, int], optional) – the size of the figure, by default (15, 5)
- Returns:
the figure and axes dictionary for the mosaic plot.
- Return type:
tuple[matplotlib.figure.Figure, dict[str, matplotlib.axes.Axes]]
- plot_score_barplot(position: int, score_type: str = 'score', ax: Axes | None = None)[source]
plot the conservation scores as a bar plot
- Parameters:
position (int) – starting position of the k-mer in the query sequence.
score_type (str, optional) – which score to use, must be either “score” or “z_score”, by default “score”
ax (matplotlib.axes.Axes | None, optional) – if provided, the barplot will be plotted on the provided axes. If None, a new axes will be created. by default None
- Returns:
matplotlib axes with the plot
- Return type:
matplotlib.axes.Axes
- Raises:
ValueError – If the score_type is not “score” or “z_score”
- plot_sequence_logo(position: int, ax: Axes | None = None)[source]
plot the query k-mer “pseudo-MSA” as a sequence logo where the residue height is proportional to the number of homologs with that residue at that position. The query k-mer is included in the “pseudo-MSA”
- Parameters:
position (int) – The starting position of the k-mer in the query sequence.
ax (matplotlib.axes.Axes | None, optional) – if provided, the sequence logo will be plotted on the provided axes. If None, a new axes will be created. by default None
- Returns:
matplotlib axes with the plot
- Return type:
matplotlib.axes.Axes
- classmethod read_results_from_file(filename: str | Path)[source]
read the pairk conservation results from a file and return a PairkConservation object
- Parameters:
filename (str | Path) – The filename to read the results from.
- Returns:
a PairkConservation object containing the results from the file.
- Return type:
pairk.backend.conservation.kmer_conservation.PairkConservation
- write_results_to_file(filename: str | Path)[source]
write the PairkConservation object results to a file
note - to avoid having to pickle the numpy arrays, the orthokmer_arr is converted to numpy strings before saving
- Parameters:
filename (str | Path) – the filename to save the results to.
- Returns:
the filename that the results were saved to.
- Return type:
str | Path
utility functions
- pairk.utilities.fasta_MSA_to_idr_dict(alignment_file: str | Path, idr_aln_start: int, idr_aln_end: int) dict[str, str][source]
import a multiple sequence alignment (MSA) in fasta format and return the IDR sequences as a dictionary. uses a single slice of the MSA to define the IDRs of each sequence.
- Parameters:
alignment_file (str | Path) – a path to a fasta file containing a multiple sequence alignment
idr_aln_start (int) – the start position of the IDR in the MSA
idr_aln_end (int) – the end position of the IDR in the MSA
- Returns:
a dictionary with sequence IDs as keys and the IDR region as values
- Return type:
dict[str, str]
- pairk.utilities.fasta_MSA_to_idr_map(alignment_file: str | Path, idr_aln_start: int, idr_aln_end: int)[source]
import a multiple sequence alignment (MSA) from a fasta file and return a dictionary of the start and end positions of the IDRs in the unaligned sequences. The IDRs are defined by a single slice of the alignment.
- Parameters:
aln (str | Path) – a path to a fasta file containing a multiple sequence alignment
idr_aln_start (int) – start position of the IDRs in the alignment
idr_aln_end (int) – end position of the IDRs in the alignment
- Returns:
idr_position_map: dictionary of the start and end positions of the IDRs in the unaligned sequences. The keys are the sequence ids and the values are lists of the start and end positions of the IDRs in the unaligned sequences. The start and end positions are 0-indexed.
- Return type:
dict[str, list[int]]
- pairk.utilities.fasta_MSA_to_unaligned_sequences(alignment_file: str | Path) dict[str, str][source]
import a multiple sequence alignment (MSA) from a fasta file and return the unaligned sequences as a dictionary.
- Parameters:
alignment_file (str | Path) – a path to a fasta file containing a multiple sequence alignment
- Returns:
a dictionary with sequence IDs as keys and the unaligned sequence as values
- Return type:
dict[str, str]
single k-mer functions
Pairk for single kmer analysis. This is available for custom analysis. Relative conservation scores (z-scores) not available for single kmers but it’s considerably faster to align 1 kmer than all kmers in a sequence. We make this available for custom analysis.
- pairk.single_kmer.pairk_alignment_single_kmer(kmer: str, ortholog_idrs: dict[str, str], matrix_name: str = 'EDSSMat50')[source]
Align a single kmer to a dictionary of sequences in a pairwise manner (i.e. align the kmer to each sequence one-by-one (with no gaps). Returns the score, subsequence, and position of the best subsequence-kmer match for each sequence in the input dictionary. If an ortholog IDR is shorter than the k-mer, a string of “-” characters (“-”*k) is assigned as the best matching ortholog k-mer for that ortholog.
Note: if there are multiple top-scoring matches, only one is returned.
- Parameters:
kmer (str) – the kmer to align
ortholog_idrs (dict[str, str]) – a dictionary of sequences to align the kmer to, with the key being the sequence id and the value being the sequence as a string
matrix_name (str, optional) – The name of the scoring matrix to use in the algorithm, by default “EDSSMat50”. The available matrices can be viewed with the function pairk.print_available_matrices().
- Returns:
3 dictionaries containing the scores, subsequences, and positions of the best subsequence-kmer matches for each sequence in the input ortholog_idrs dictionary. Dictionary keys are the sequence ids.
- Return type:
dict[str, float], dict[str, str], dict[str, int]