Welcome to sequencing_tools’s documentation!¶
This is Douglas’s repo for useful bioinformatics stuff.
Installation¶
Installation can be done via conda:
$ conda config --add channels bioconda
$ conda create -q -n test-environment python=3.6 \
cython numpy networkx seaborn pyBigwig six pysam ujson pytest scipy \
matplotlib samtools future pytest-cov codecov
$ git clone https://github.com/wckdouglas/sequencing_tools.git
$ cd sequencing_tools
$ pip install .
fastq_tools¶
This module contains functions to interact with fastq files
- class sequencing_tools.fastq_tools._fastq_tools.fastqRecord¶
Fastq record
- Parameters
id (str) – fastq record id
seq (str) – fastq record seq
qual (str) – fastq record quality string
- subseq()¶
Trim off sequences outside of start:end
- Parameters
start (int) – start position on the sequence
end (str) – end position on the sequence
- class sequencing_tools.fastq_tools._fastq_tools.GappedKmer(k=6, m=2)¶
Making Gapped kmer feature vector
- Parameters
k – kmer size
m – number of mismatch tolerated
Example:
gkm = GappedKmer(k=3, m = 1) seq = 'ACTGGAATG' gkm.compute(seq)
- sequencing_tools.fastq_tools._fastq_tools.complement()¶
Find complement a sequence.
- Parameters
seq (str) – sequence to be complemented
- Returns
complemented sequence
- Return type
str
- sequencing_tools.fastq_tools._fastq_tools.extract_kmer()¶
- Parameters
sequence (str) – input sequence for extracting kmer
k (int) – kmer size
- Returns
kmer from the sequence
- Return type
generator(str)
- class sequencing_tools.fastq_tools._fastq_tools.fastqRecord¶
Fastq record
- Parameters
id (str) – fastq record id
seq (str) – fastq record seq
qual (str) – fastq record quality string
- subseq()¶
Trim off sequences outside of start:end
- Parameters
start (int) – start position on the sequence
end (str) – end position on the sequence
- sequencing_tools.fastq_tools._fastq_tools.kmer_bag()¶
K-mer bag method for feature extraction
We used k = 1,2,3,4,5, resulting in 41 + 42 + 43 + 44 + 45 = 1364 features. For example, the sequence ACTGG would produce a length-1364 feature vector where the entries corresponding to the k-mers A, C, T, AC, CT, TG, GG, ACT, CTG, TGG, ACTG, CTGG, and ACTGG would each equal 1, and the entry corresponding to G would equal 2. All other entries equal 0. (from Zhang and Kamath. Learning the Language of the Genome using RNNs)
Example:
sequence = 'ACTGG' kmer_bag(sequence, k_start = 1, k_end = 4) #defaultdict(int, # {'A': 1, # 'C': 1, # 'T': 1, # 'G': 2, # 'AC': 1, # 'CT': 1, # 'TG': 1, # 'GG': 1, # 'ACT': 1, # 'CTG': 1, # 'TGG': 1, # 'ACTG': 1, # 'CTGG': 1})
- Parameters
sequence (str) – the sequence for feature extraction
k_start (int) – k for the smallest kmer
k_end (int) – k for the largest kmer
- Returns
key is kmer, values: count
- Return type
dict
- class sequencing_tools.fastq_tools._fastq_tools.onehot_sequence_encoder(bases='ACTGN')¶
A onehot encoder for DNA sequence
usage:
sequence = 'AAACTTTG' dna_encoder = onehot_sequence_encoder() onehot_encoded_matrix = dna_encoder.fit_transform(sequence) onehot_encoded_matrix #array([[1., 0., 0., 0.], # [1., 0., 0., 0.], # [1., 0., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 1., 0.]]) dna_encoder.base_encoder # check which base each column represents # {'A': 0, 'C': 1, 'G': 2, 'T': 3}
- decode(encoded_mat)¶
- Parameters
array (onehot encoded) – len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
- Returns
Sequence, the decoded sequence
- Return type
str
Usage:
mat = np.array([[1., 0., 0., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.]]) dna_encoder.decode(mat) #'AAACTTTG'
- fit(sequence='ACTGN')¶
- Parameters
sequence (str) – the bases to be extract
- fit_transform(sequence)¶
One hot sequence encoder
- Parameters
sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n, with m unique bases
- Returns
onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
- Return type
np.ndarray(n,m)
- transform(sequence)¶
One hot sequence encoder
- Parameters
sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n
- Returns
onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
- Return type
np.ndarray(n,m)
- sequencing_tools.fastq_tools._fastq_tools.read_interleaved()¶
A interleaved fastq iterator
Usage:
with open('test.fq') as fastq: for read1, read2 in parse_interleaved(fastq): print(read1.id, read1.seq, read1.qual, read1.description) print(read2.id, read2.seq, read2.qual, read2.description)
- Parameters
fp – file handle of a fastq file
- Returns
(fastqRecord for R1, fastqRecord for R2)
sequencing_tools.fastq_tools._fastq_tools.fastqRecord
- Return type
Generator(tuple)
- sequencing_tools.fastq_tools._fastq_tools.readfq()¶
A fastq iterator from Heng li
- Parameters
fp – file handle of a fastq file
- Returns
- Return type
Generator(FastqRecord)
Usage:
with open('test.fq') as fastq: for record in readfq(fastq): print(record.id, record.seq, record.qual, record.description)
- sequencing_tools.fastq_tools._fastq_tools.reverse_complement()¶
Reverse complement a sequence.
- Parameters
seq (str) – sequence to be reverse complemented
- Returns
reverse complemented sequence
- Return type
str
- class sequencing_tools.fastq_tools.pe_align.ConsensusBuilder(error_toleration=0.1, min_len=15, report_all=False, conserved=False, highlight=False)¶
Merging the two reads from paired end sequencing, and build consensus sequence for the overlapped part, paired end sequencing of a fragments:
Illustration:
Reads: | adapter| |adapter | R1: |------------------> DNA Fragment: ========|----------------------------|======== R2: <----------------| report all: |----------------------------> not report all: |------>
- Parameters
error_toleration (float) – maximum mismatch fraction at the overlap region, reads are merged if they have mismatch fraction lower than this
min_len (int) – minumum overlapping bases at the overlapping region, if lower than this, reads won’t merge
report_all (boolean) – True if the output merged sequence contains non-overlapping bases
Example:
consensus_builder = ConsensusBuilder(error_toleration = 0.1, min_len = 15, report_all=False) read1 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG', 'ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAAAGATCGGAAGAGCACACGTCTGAA', 'A6AA6//EA/AEE/AEAE///EEE/AAA/6/EE/A//EEE/A//EE/AE//E/////AEE///////////////', 'NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG Read1', ) read2 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG', 'TTTCCCGCACCGATAGCAGCCGCTCTGGTCTCCCATCCCGGGCAATTGTGTGATCGTCGGACTGTAGAACTCTGA', 'AAA//E/E/E/A/A///EEE/E///<//EE/EE6/</EEA/A</EE<AAE/E/</<<AAE/E<///EE/EA///A', 'NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG Read2', ) out = consensus_builder.run(read1, read2) print(out) # @NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG # ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAA # + # IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- run(R1, R2)¶
reverse complement read2 sequence and find matching position on read1
- Parameters
- Returns
fastq line for the merged concensus sequence [+ non-overlapping sequence]
- Return type
str
- class sequencing_tools.fastq_tools.function_clip.ReadTrimmer(barcode_cut_off=20, constant='', constant_no_evaluation=True, umi_bases=6, usable_seq=6, hamming_threshold=6, adapter='GATCGTCGGACTGTAGAACTCTGAACGTGTAGA', min_length=12)¶
Implementation for a paired end UMI read trimmer
Triming UMI bases from 5’ of the UMI read sequence
append the UMI bases to the supplied adapter, and use that to trim the other read from 3’
trimming adapter sequence if overhang bases are sequenced
input reads (suppose A is adapter, X is UMI and N is biological fragment):
Example 1:
Input: Read 1: |---------------------------------> library fragment: AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA Read 2: <----------------------------------| Trimming from Read 2: AAAAAAAAAXXXXXX-> Trimming from Read 1: <--- Output: Read 1: |------------------------> library fragment: AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA Read 2: <------------------------|
Example 2:
Input: Read 1: |---------------------> library fragment: AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA Read 2: <-----------------------| Output: Read 1: |------------------------> library fragment: AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA Read 2: <-----------------------|
Usage:
clipping = ReadTrimmer(barcode_cut_off, constant, constant_no_evaluation, umi_bases, usable_seq, hamming_threshold, adapter, min_length) with xopen(inFastq1, mode = 'r') as in1, xopen(inFastq2, mode = 'r') as in2: adapter = Adapters.TGIRT_R1.value iterable = zip(readfq(in1), readfq(in2)) for count, (umi_read, opposite_read) in enumerate(iterable): ret_code, umi_read, opposite_read = clipping.trim_reads(umi_read, opposite_read) if ret_code == 1: if read == "read1": print(umi_read) print(opposite_read) if read == "read2": print(opposite_read) print(umi_read)
- trim_reads(umi_read, opposite_read)¶
from each UMI read, clipped barcode and constant regions and make it as ID only when:
barcode quality > cutoff and
constant region matched
- Parameters
umi_reads (
sequencing_tools.fastq_tools._fastq_tools.fastqRecord
) – Read containing UMIopposite_Read (
sequencing_tools.fastq_tools._fastq_tools.fastqRecord
) – The opposite read in the read pair
- Returns
return code (0: read pair doesn’t pass filter, 1: read pair is good), clipped UMI read (in 4-line fastq format), clipped the other side of read pair (in 4-line fastq format)
- Return type
tuple(int, str, str)
fasta_tools¶
This module contains functions to interact with fasta files
- class sequencing_tools.fasta_tools.IUPAC¶
Working with IUPAC bases
Usage:
iupac = IUPAC() iupac.check_degenerate('ACTGN') # {'N'} iupac.check_degenerate('ACTGY') # {'Y'} list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG'] list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']
- check_degenerate(seq)¶
- Parameters
seq (str) – sequence with dengerate base
- Returns
all degenerated bases from this sequence
- Return type
set
- expand(seq)¶
output all possible sequence by looping over the dengerative base
- Parameters
seq (str) – sequence with dengerate base
- Returns
all possible sequences from the DNA/RNA pool
- Return type
Generator(str)
- sequencing_tools.fasta_tools.readfa(file_handle)¶
A fasta reader iterator
- Parameters
fp – file handle of a fasta file
- Returns
sequence id, sequence
- Return type
(str, str)
Usage:
with open('test.fa') as fasta: for seq_id, seq in readfq(fasta): print(seq_id, seq)
- class sequencing_tools.fasta_tools.IUPAC¶
Working with IUPAC bases
Usage:
iupac = IUPAC() iupac.check_degenerate('ACTGN') # {'N'} iupac.check_degenerate('ACTGY') # {'Y'} list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG'] list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']
- check_degenerate(seq)¶
- Parameters
seq (str) – sequence with dengerate base
- Returns
all degenerated bases from this sequence
- Return type
set
- expand(seq)¶
output all possible sequence by looping over the dengerative base
- Parameters
seq (str) – sequence with dengerate base
- Returns
all possible sequences from the DNA/RNA pool
- Return type
Generator(str)
- class sequencing_tools.fasta_tools.MultiAlignments(fa_file, RNA=False)¶
- PairMatrix()¶
Calculate the hamming distances between each sequence pair
- colors¶
color dictionary guiding the multiplex alignment plotting
- concensus()¶
compute a consensus sequence from highest frequency base at each position
- Returns
(consensus sequence, fraction of base making up the composition of the position)
- Return type
tuple(str, list of numpy array
- mul_df¶
sequence matrix, each column is a position, and nucleotide as value
- pairwise¶
pairwise matrix computed by
sequencing_tools.fasta_tools.MultiAlignment.PairMatrix()
- plot(ax, min_pos=0, max_pos=None, fontsize=20, labelsize=20, sample_regex='[A-Za-z0-9_-]+')¶
- Parameters
ax (plt.axes) – matplotlib axes
min_pos (float) – start position to plot on the multialignments
max_pos (float) – end position to plot on the mutlialignments
fontsize (int) – fontsize for the nucleotides
labelsize (int) – fontsize for sequence id
sample_regex (str) – regex for including sequecne id
bam_tools¶
This module contains functions to interact with bam files and bed files
- sequencing_tools.bam_tools._bam_tools.check_concordant()¶
Check if read pairs are concordant
- Parameters
read1 – first pysam alignment
read2 – second pysam alignment
- Returns
True if they have same read ID, ref ID, are read 1 and read2, and is opposite strand
- Return type
boolean
- sequencing_tools.bam_tools._bam_tools.check_primary()¶
Check if read pairs are both primary
- Parameters
read1 – first pysam alignment
read2 – second pysam alignment
- Returns
True if both of them are primary alignments
- Return type
boolean
- sequencing_tools.bam_tools._bam_tools.cigar_to_str()¶
Cigarstring to string, only extract cigar op == M, I or S, Deletions (D) are skipepd to return same length cigar_seq as the query_sequence
- Parameters
cigar_string – standard cigar string from BAM file
- Returns
cigar_seq - same length string as sequence, with every character matched the aligned status of the base
- Return type
str
Example:
$ cigar_seq = cigar_to_str('3S5M1I3M') $ print cigar_seq 'SSSMMMMMIMMM'
- sequencing_tools.bam_tools._bam_tools.concordant_alignment()¶
Check if alignment is properly mapped flag == 99, 147, 163, 83
- Parameters
alignment – pysam alignment segment
- Returns
True if 83 or 163 or 99 or 147 for their flags otherwise False
- Return type
boolean
- sequencing_tools.bam_tools._bam_tools.concordant_pairs()¶
Check if pair is properly mapped flag == 99, 147, 163, 83
Example:
usage: concordant_pairs(read1, read2)
- Parameters
read1 – read1 of the pair
read2 – read2 of the pair
- Returns
True if 83 and 163 or 99 and 147 for their flags otherwise False
- Return type
boolean
- sequencing_tools.bam_tools._bam_tools.fragment_ends()¶
get start and end position of a pair of reads
Example:
fragment_ends(read1, read2)
- Parameters
read1 – a pysam alignment
read2 – a pysam alignment
- Returns
(start, end), leftmost and rightmost position of the fragment
- Return type
tuple
- sequencing_tools.bam_tools._bam_tools.get_strand()¶
Get strand of the paired fragment
- Parameters
alignment – an pysam aligned segment
- Returns
Strand - “+” if it is reverse read2 or forward read1 “-” if it is reverse read1 or forward read2
- Return type
str
- sequencing_tools.bam_tools._bam_tools.make_cigar_seq()¶
Generator convert number and operator into a sequence of aligned status of bases, see split_cigar
- Parameters
cigar_numbers – list of numbers in string format
cigar_operator – list of single character
- Returns
cigar_base, sequence of cigar base
- Return type
Generator
Example:
$ for c in make_cigar_seq(['3','5','1','3'],['S','M','I','M']): $ print c SSS MMMMM I MMM
- sequencing_tools.bam_tools._bam_tools.make_regions()¶
generator: segment chromosome in to regions usage: make_regions(chromosome_length, how_many_bases_every_time) return cigar_base
- Parameters
chromosome_length – last base you want to look at
how_many_bases_every_time – segment size
- Returns
(start, end): start, end of segment
- Return type
tuple
Example:
for start, end in make_regions(100,10): print start, end 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100
- sequencing_tools.bam_tools._bam_tools.paired_bam()¶
iterator for paired end bam file
- Parameters
bam_handle – bam file handle opened by pysam.Samfile
- Returns
(read1, read2)
- Return type
Generator(tuple)
Example:
with pysam.Samfile('path/to/bam/file') as bam: for read1, read2 in paired_bam(bam): print(read1.query_name) print(read2.query_name)
- sequencing_tools.bam_tools._bam_tools.read_ends()¶
Get read end positions, output start and end position of a read
- Parameters
AlignedSegment – a pysam alignment
- Returns
(start, end), leftmost and rightmost position of the read
- Return type
tuple
- sequencing_tools.bam_tools._bam_tools.remove_insert()¶
Iterator remove insertion base from aligned sequence
usage: remove_insert(sequence, quality_string, cigar_seq) return: base, base_quality
- Parameters
sequence – DNA sequence from BAM
quality_string – qual string from BAM
cigar_seq – cigar seq from cigar_to_str
- Returns
(base, base_qual) base that are not annotated as insertion, BAQ associated with the base
- Return type
Generator(tuple)
- sequencing_tools.bam_tools._bam_tools.split_cigar()¶
Splitting cigar string to numpy array
- Parameters
cigar_string – a cigar string e.g. ‘63M1S4D10M’
- Returns
([list of cigar numbers], [list of cigar operators correspongs to the numbers])
- Return type
(list,list)
Example:
$ split_cigar('63M') [63], [M]
consensus_tools¶
- class sequencing_tools.consensus_tools._consensus_tools.ErrorCorrection(mode='prob', threshold=0.8)¶
A module for error correction in fastq records, included two modes:
prob: using base quality as probability of errors as prior to calculate posterior quality, see fgbio explanation
vote: using a voting scheme as SafeSeq to generate consensus sequences
- Parameters
mode – ‘prob’ or ‘vote’
threshold – only consider for “vote” mode, as a cutoff for returning a “N” if not enough fraction of bases agree
Example:
ec = ErrorCorrection(mode='prob') ec.Correct(['AAACA','AAAAA','AAACA','AAACA'], ['IIIII','IIIAI','FFFFF', 'FFFFF'])
- Correct(seq_list, qual_list)¶
Given a list of sequences and a list of quality, return a consensus sequence and a recalibrated quality
- Parameters
seq_list – list of sequences with the same length
qual_list – list of the corresponding quality strings, must be in the same length as seq_list
- Returns
(consensus seq, consensus qual)
- Return type
tuple
gene_tools¶
This module contains functions to interact with bed files, refflat, and gtf files.
- class sequencing_tools.gene_tools.Bed12Record(line)¶
Parser for bed12 line
- genomic_position(tpos)¶
Translating transcript position into genomic position
- Parameters
tpos (int) – position on transcript
- Returns
genomic position
- Return type
int
- get_introns()¶
generator for introns, for each transcript (bed12 line), return all its intron
- Returns
intron_line, 6-columns bed for the introns
- Return type
generator
- class sequencing_tools.gene_tools.GTFRecord(gtf_line)¶
parsing a GTF record line
Usage:
line = 'chr1 ensGene exon 14404 14501 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000488147"; exon_number "1"; exon_id "ENST00000488147.1"; gene_name "ENSG00000227232";' gtf = GTFRecord(line) print(gtf.info['gene_id']) # 'ENSG00000227232'
- chrom¶
chromosome name
- end¶
genomic end
- feature_type¶
feature type of the transcript
- info¶
info field as a dictionary
- start¶
genomic start
- strand¶
strand
- class sequencing_tools.gene_tools.transcriptome.Transcriptome(refflat=None, sqldb=None, coding_only=True)¶
all annotated transcripts read in a refflat file
- Parameters
refflat – refflat file or url
sqldb – Can be found under Annotationhub (if no refflat is supplied)
coding_only – only index coding transcript?
Usage:
url = 'http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz' txome = Transcriptome(refflat=url, coding_only=False) gene = txome.get_gene('WASH7P') tx = gene['NR_024540'] print(tx.exons[1].start, tx.exons[2].end) # 29320 24891 print(tx.genomic_position(300)) #18171 print(tx.blocks(0, 300)) #[(29320, 29370), (24737, 24891), (18270, 18366)]
Or:
# wget http://s3.amazonaws.com/annotationhub/AHEnsDbs/v87/EnsDb.Hsapiens.v87.sqlite txome = Transcriptome(sqldb="EnsDb.Hsapiens.v87.sqlite", coding_only=False)
- coding_only¶
Only index coding genes (mRNA)?
- get_gene(gene_name)¶
Get transcripts from gene
- Parameters
gene_name (str) –
- Returns
dictionary with key as transcript id as key and values are
sequencing_tools.gene_tools.transcriptome.Transcript
- Return type
dict
- transcript_count¶
how many transcript was indexed?
- transcript_dict¶
Dict storing transcripts, with gene name as key and values are
sequencing_tools.gene_tools.transcriptome.Transcript
- class sequencing_tools.gene_tools.transcriptome.Transcript(transcript)¶
- blocks(tstart_pos: int, tend_pos: int)¶
given a start position and end position along the transcript, return the block starts and block sizes on the genome scale
Example:
exon 1 exon 2 Transcript: |===========|-----------------|=================| Amplicon: |------> <--------| blocks: block1 block2 Return: [(a, b) (c, d) ]
- Parameters
tstart_pos (int) – left position on the transcript
tend_pos (int) – right position on the transcript
- Returns
list of tuples containing the (start, end) of each block
- Return type
list
- cde¶
coding end site (genomic)
- cds¶
coding start site (genomic)
- chrom¶
chromosome name
- exon_count¶
number of exons in this transcript
- exon_ends¶
str, comman list of exon end sites
- exon_starts¶
str, comma list of exon start sites
- exons¶
dictionary with key as exon rank, values are
sequencing_tools.gene_tools.transcriptome.Exon
- five_UTR_length¶
how long is the 5’ UTR? 5’ as of genomic coordinate
- genomic_position(tpos)¶
translate transcriptome position to genomic position
- Parameters
tpos – transcript position
- Returns
the corresponding genomic position
- Return type
int
Illustration:
|-------------|-----------*---------|-----------| / exon1 /\ exon2 / \ exon3 / / \ / \ |------------|xxxx|---------*-----|xxxxx|----------| intron intron
- id¶
transcript ID
- strand('+' or '-')¶
strand (‘+’ or ‘-‘)
- transcript_length¶
transcript size
- tx_end¶
transcription end site (genomnic)
- tx_start¶
transcription start site (genomic)
- class sequencing_tools.gene_tools.transcriptome.Exon(start, end, exon_num, strand)¶
Exon object storing the coordinate, exon rank and strand information
- Parameters
start (int) – genomic start coordinate of the exon
end (int) – genomic end coordinate of the exon
exon_num (int) – exon rank
strand (string) – strandeness of the exon (either + or -)
- after_cde¶
1 if downstream of coding end site else 0
- after_cds¶
1 if downstream of coding start site else 0
- coding_bases¶
how many bases are within coding frame
- contain_cde¶
1 if this exon contains coding end site else 0
- contain_cds¶
1 if this exon contains coding start site else 0
- end¶
Exon end position on the chromosome
- exon_num¶
Exon rank
- length¶
Exon size
- start¶
Exon start position on the chromosome
- strand¶
Exon strand
- transcript_end¶
the transcript position of the last exon base
- transcript_start¶
the transcript position of the first exon base
viz_tools¶
This module contains some functions that work with figures
- class sequencing_tools.viz_tools.color_encoder¶
color-encoding a categoric vector
Example:
categorical_vector = ['a','b','c','a'] colors = obakeito_palette() ce = color_encoder() ce.fit(categorical_vector, colors) encoded_colors = ce.transform(new_categorical_vector)
or:
ce = color_encoder() encoded_colors = ce.fit_transform(categorical_vector, colors)
access color encoder:
encoded_color_map = ce.encoder
- encoder¶
color encoder dictionary
- fit(x, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])¶
Usage:
ce = color_encoder() ce.fit(categroical_vector, colors)
- fit_transform(xs, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])¶
ce = color_encoder() encoded_colors = ce.fit_transform(categorical_vector, colors)
- show_legend(ax=None, sort=False, **kwargs)¶
Adding matplotlib legend
- transform(xs)¶
Usage:
ce = color_encoder() ce.fit(categroical_vector, colors) encoded_colors = ce.transform(new_categorical_vector)
- sequencing_tools.viz_tools.cor_plot(plot_df, fig, diagonal_line=True, method='pearson', **kwargs)¶
given a data frame with columns storing data for each sample, output a matplotlib figure object as correlation plots.
usage: cor_plot(plot_df, fig=fig, diagonal_line = True, method = ‘pearson’)
- Parameters
plot_df – a pandas dataframe
fig – matplotlib figure object (optional)
diagonal_line – Boolean controlling if a diagonal line should be drawn
method – correlation method, [spearman or pearson]
- Returns
matplotlib figure object
- Return type
matplotlib.pyplot.figure
- sequencing_tools.viz_tools.douglas_palette()¶
Automatic set color if seaborn is installed, otherwise return list of colors Example:
colors = douglas_palette() ax=plt.subplot(); for i in range(14): ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i]) ax.legend()
- sequencing_tools.viz_tools.maximum_palette()¶
modified from: https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
Example:
colors = maximum_palette() ax=plt.subplot(); for i in range(22): ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i]) ax.legend()
- sequencing_tools.viz_tools.mixed_sort(list_of_elements)¶
https://arcpy.wordpress.com/2012/05/11/sorting-alphanumeric-strings-in-python/
- Parameters
of (a list) –
- Returns
ordered list
- Return type
list
Example:
a = ['A1','A10','A2','A20'] mixed_sort(a) > ['A1','A2', 'A10', 'A20']
- sequencing_tools.viz_tools.okabeito_palette()¶
Color palette proposed by Okabe and Ito copy from colorboindr R package https://github.com/clauswilke/colorblindr/blob/master/R/palettes.R
Example:
colors = okabeito_palette() ax=plt.subplot(); for i in range(8): ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i]) ax.legend()
- sequencing_tools.viz_tools.plot_upset(fig, upset_df, ylab='Intersected count', matrix_to_plot_ratio=0.4, fontsize=15)¶
- Parameters
fig – matplotlib figure object
upset_df – pandas dataframe, contain column: sample matrix (binary), “count”, ‘index’
ylab – y label
matrix_to_plot_ratio – ratio between interaction matrix and batplot
fontsize – fontsize
input example:
index HeLa K562 UHRR Plasma count 3 0 1 1 1 1 4 1 0 0 0 2 5 1 1 0 0 3 1 0 1 0 0 11 6 1 1 0 1 12 7 1 1 1 1 12 2 0 1 0 1 13 0 0 0 0 1 30
- sequencing_tools.viz_tools.simpsons_palette()¶
A palette from ggsci R package https://github.com/road2stat/ggsci/blob/master/data-raw/data-generator.R
Example:
colors = simpsons_palette() ax=plt.subplot(); for i in range(16): ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i]) ax.legend()
stats_tools¶
This module contains some math functions
- class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)¶
- Parameters
seed – seed for random number generater
- generate(xs, group_size=100, n_boots=100)¶
boostrap 1d array
Usage:
xs = np.arange(100) bs = Bootstrap(seed=123) for idx in bs.generate(xs, group_size=50, n_boots=10): print(xs[idx].mean())
- Parameters
xs – 1d np.array
group_size – number of values in each bootstrap iteration
n_boots – how many bootstrap groups
- Returns
bootstrapped
- Return type
iterator
- sequencing_tools.stats_tools._stats_tools.binom_test()¶
Vectorized binomial test
Usage:
success_test = [10,10,10] total_test = [100,100,100] binom_test(success_test, total_test, expected_p = 0.5) #array([1.53164509e-17, 1.53164509e-17, 1.53164509e-17])
- Parameters
success_test – number of success
total_test – number of total trials
expected_p – expected probability of success
- Returns
list of p-values
- Return type
list
- sequencing_tools.stats_tools._stats_tools.cy_mean()¶
Fast numerical mean calculator, works with number generator too
Example:
from sequencing_tools.stats_tools import cy_mean import numpy as np a = range(10) b = np.array(a) %timeit b.mean() # 6.37 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %timeit cy_mean(a) # 385 ns ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Usage:
cy_mean(list_of_numbers)
- Parameters
xs – list of numbers
- Returns
Mean of the list
- Return type
float
- sequencing_tools.stats_tools._stats_tools.hamming_distance()¶
Calculating hamming distance from two strings, the two strings has to be in the same length
Usage:
hamming_distance(string1, string2)
- Parameters
string1 – first string in the comparison
string2 – second string in the comparison
- Returns
the hamming edit distance between two strin
- Return type
int
- sequencing_tools.stats_tools._stats_tools.levenshtein_distance()¶
Calculating Levenshtein distance from two strings algorithm from: http://rosettacode.org/wiki/Levenshtein_distance#Python
usage:
levenshtein_distance(string1, string2)
- Parameters
string1 – first string in the comparison
string2 – second string in the comparison
- Returns
the edit distance between two string
- Return type
int
- sequencing_tools.stats_tools._stats_tools.normalize_count()¶
-
- Parameters
count_mat – pandas dataframe with shape(m,n): m is gene count, n is sample number
return_sf – boolean, if it needs returning size factors
- Returns
normalized count matrix
- Return type
np.ndarray(m,n)
- sequencing_tools.stats_tools._stats_tools.p_adjust()¶
Benjamini-Hochberg p-value correction for multiple hypothesis testing. adapted from https://stackoverflow.com/questions/7450957/how-to-implement-rs-p-adjust-in-python/21739593
Usage:
padj = p_adjust(pvalue_array)
- Parameters
pvalue_array – list of pvalue
- Returns
list of adjusted pvalue
- Return type
list
An equivalent in R:
nm <- names(p) vp <- as.numeric(p) p0 <- setNames(p, nm) lp <- length(p) i <- lp:1L o <- order(p, decreasing = TRUE) ro <- order(o) pmin(1, cummin(n/i * p[o]))[ro]
- class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)¶
- Parameters
seed – seed for random number generater
- generate(xs, group_size=100, n_boots=100)¶
boostrap 1d array
Usage:
xs = np.arange(100) bs = Bootstrap(seed=123) for idx in bs.generate(xs, group_size=50, n_boots=10): print(xs[idx].mean())
- Parameters
xs – 1d np.array
group_size – number of values in each bootstrap iteration
n_boots – how many bootstrap groups
- Returns
bootstrapped
- Return type
iterator
- class sequencing_tools.stats_tools.regression.GradientDescent(lr=0.01, max_iter=10000, limit=0.0001, verbose=False, seed=123, n_iter_no_change=5, method='mean')¶
- Adam_update()¶
Adam optimizer https://github.com/sagarvegad/Adam-optimizer/blob/master/Adam.py
- Cost()¶
compute error with new regression coefficien
- fit(X, y)¶
fitting B for
y = XB
utils¶
- exception sequencing_tools.utils.SeqUtilsError¶
Exception type for the package sequencing tools
Example:
a = -1 if a < 0: raise SeqUtilsError("a should be > 0")