Welcome to sequencing_tools’s documentation!¶

This is Douglas’s repo for useful bioinformatics stuff.

Installation¶

Installation can be done via conda:

$ conda config --add channels bioconda
$ conda create -q -n test-environment python=3.6 \
    cython numpy networkx seaborn pyBigwig six pysam ujson pytest scipy   \
    matplotlib samtools future pytest-cov codecov
$ git clone https://github.com/wckdouglas/sequencing_tools.git
$ cd sequencing_tools
$ pip install .

fastq_tools¶

This module contains functions to interact with fastq files

class sequencing_tools.fastq_tools._fastq_tools.fastqRecord¶

Fastq record

Parameters

id (str) – fastq record id
seq (str) – fastq record seq
qual (str) – fastq record quality string

subseq()¶

Trim off sequences outside of start:end

Parameters

start (int) – start position on the sequence
end (str) – end position on the sequence

class sequencing_tools.fastq_tools._fastq_tools.GappedKmer(k=6, m=2)¶

Making Gapped kmer feature vector

Parameters

k – kmer size
m – number of mismatch tolerated

Example:

gkm = GappedKmer(k=3, m = 1)
seq = 'ACTGGAATG'
gkm.compute(seq)

sequencing_tools.fastq_tools._fastq_tools.complement()¶

Find complement a sequence.

Parameters: seq (str) – sequence to be complemented
Returns: complemented sequence
Return type: str

sequencing_tools.fastq_tools._fastq_tools.extract_kmer()¶

Parameters

sequence (str) – input sequence for extracting kmer
k (int) – kmer size

Returns

kmer from the sequence

Return type

generator(str)

class sequencing_tools.fastq_tools._fastq_tools.fastqRecord¶

Fastq record

Parameters

id (str) – fastq record id
seq (str) – fastq record seq
qual (str) – fastq record quality string

subseq()¶

Trim off sequences outside of start:end

Parameters

start (int) – start position on the sequence
end (str) – end position on the sequence

sequencing_tools.fastq_tools._fastq_tools.kmer_bag()¶

K-mer bag method for feature extraction

We used k = 1,2,3,4,5, resulting in 41 + 42 + 43 + 44 + 45 = 1364 features. For example, the sequence ACTGG would produce a length-1364 feature vector where the entries corresponding to the k-mers A, C, T, AC, CT, TG, GG, ACT, CTG, TGG, ACTG, CTGG, and ACTGG would each equal 1, and the entry corresponding to G would equal 2. All other entries equal 0. (from Zhang and Kamath. Learning the Language of the Genome using RNNs)

Example:

sequence = 'ACTGG'
kmer_bag(sequence, k_start = 1, k_end = 4)
#defaultdict(int,
#    {'A': 1,
#     'C': 1,
#     'T': 1,
#     'G': 2,
#     'AC': 1,
#     'CT': 1,
#     'TG': 1,
#     'GG': 1,
#     'ACT': 1,
#     'CTG': 1,
#     'TGG': 1,
#     'ACTG': 1,
#     'CTGG': 1})

Parameters

sequence (str) – the sequence for feature extraction
k_start (int) – k for the smallest kmer
k_end (int) – k for the largest kmer

Returns

key is kmer, values: count

Return type

dict

class sequencing_tools.fastq_tools._fastq_tools.onehot_sequence_encoder(bases='ACTGN')¶

A onehot encoder for DNA sequence

usage:

sequence = 'AAACTTTG'
dna_encoder = onehot_sequence_encoder()
onehot_encoded_matrix = dna_encoder.fit_transform(sequence)
onehot_encoded_matrix
#array([[1., 0., 0., 0.],
#    [1., 0., 0., 0.],
#    [1., 0., 0., 0.],
#    [0., 1., 0., 0.],
#    [0., 0., 0., 1.],
#    [0., 0., 0., 1.],
#    [0., 0., 0., 1.],
#    [0., 0., 1., 0.]])
dna_encoder.base_encoder  # check which base each column represents
# {'A': 0, 'C': 1, 'G': 2, 'T': 3}

decode(encoded_mat)¶

Parameters: array (onehot encoded) – len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
Returns: Sequence, the decoded sequence
Return type: str

Usage:

mat = np.array([[1., 0., 0., 0.],
                [1., 0., 0., 0.],
                [1., 0., 0., 0.],
                [0., 1., 0., 0.],
                [0., 0., 0., 1.],
                [0., 0., 0., 1.],
                [0., 0., 0., 1.],
                [0., 0., 1., 0.]])
dna_encoder.decode(mat)   #'AAACTTTG'

fit(sequence='ACTGN')¶

Parameters: sequence (str) – the bases to be extract

fit_transform(sequence)¶

One hot sequence encoder

Parameters: sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n, with m unique bases
Returns: onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
Return type: np.ndarray(n,m)

transform(sequence)¶

One hot sequence encoder

Parameters: sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n
Returns: onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence
Return type: np.ndarray(n,m)

sequencing_tools.fastq_tools._fastq_tools.read_interleaved()¶

A interleaved fastq iterator

Usage:

with open('test.fq') as fastq:
    for read1, read2 in parse_interleaved(fastq):
        print(read1.id, read1.seq, read1.qual, read1.description)
        print(read2.id, read2.seq, read2.qual, read2.description)

Parameters: fp – file handle of a fastq file
Returns: (fastqRecord for R1, fastqRecord for R2) sequencing_tools.fastq_tools._fastq_tools.fastqRecord
Return type: Generator(tuple)

sequencing_tools.fastq_tools._fastq_tools.readfq()¶

A fastq iterator from Heng li

Parameters: fp – file handle of a fastq file
Returns: sequencing_tools.fastq_tools._fastq_tools.fastqRecord
Return type: Generator(FastqRecord)

Usage:

with open('test.fq') as fastq:
    for record in readfq(fastq):
        print(record.id, record.seq, record.qual, record.description)

sequencing_tools.fastq_tools._fastq_tools.reverse_complement()¶

Reverse complement a sequence.

Parameters: seq (str) – sequence to be reverse complemented
Returns: reverse complemented sequence
Return type: str

class sequencing_tools.fastq_tools.pe_align.ConsensusBuilder(error_toleration=0.1, min_len=15, report_all=False, conserved=False, highlight=False)¶

Merging the two reads from paired end sequencing, and build consensus sequence for the overlapped part, paired end sequencing of a fragments:

Illustration:

Reads:
               | adapter|                            |adapter |
            R1:         |------------------>
  DNA Fragment: ========|----------------------------|========
            R2:                     <----------------|
    report all:         |---------------------------->
not report all:                     |------>

Parameters

error_toleration (float) – maximum mismatch fraction at the overlap region, reads are merged if they have mismatch fraction lower than this
min_len (int) – minumum overlapping bases at the overlapping region, if lower than this, reads won’t merge
report_all (boolean) – True if the output merged sequence contains non-overlapping bases

Example:

consensus_builder = ConsensusBuilder(error_toleration = 0.1,
                            min_len = 15, report_all=False)
read1 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG',
    'ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAAAGATCGGAAGAGCACACGTCTGAA',
    'A6AA6//EA/AEE/AEAE///EEE/AAA/6/EE/A//EEE/A//EE/AE//E/////AEE///////////////',
    'NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG    Read1',
    )

read2 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG',
    'TTTCCCGCACCGATAGCAGCCGCTCTGGTCTCCCATCCCGGGCAATTGTGTGATCGTCGGACTGTAGAACTCTGA',
    'AAA//E/E/E/A/A///EEE/E///<//EE/EE6/</EEA/A</EE<AAE/E/</<<AAE/E<///EE/EA///A',
    'NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG    Read2',
    )

out = consensus_builder.run(read1, read2)
print(out)
# @NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG
# ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAA
# +
# IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

run(R1, R2)¶

reverse complement read2 sequence and find matching position on read1

Parameters

R1 – sequencing_tools.fastq_tools._fastq_tools.fastqRecord
R2 – sequencing_tools.fastq_tools._fastq_tools.fastqRecord

Returns

fastq line for the merged concensus sequence [+ non-overlapping sequence]

Return type

str

class sequencing_tools.fastq_tools.function_clip.ReadTrimmer(barcode_cut_off=20, constant='', constant_no_evaluation=True, umi_bases=6, usable_seq=6, hamming_threshold=6, adapter='GATCGTCGGACTGTAGAACTCTGAACGTGTAGA', min_length=12)¶

Implementation for a paired end UMI read trimmer

Triming UMI bases from 5’ of the UMI read sequence
append the UMI bases to the supplied adapter, and use that to trim the other read from 3’
trimming adapter sequence if overhang bases are sequenced

input reads (suppose A is adapter, X is UMI and N is biological fragment):

Example 1:

Input:
    Read 1:                         |--------------------------------->
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                     <----------------------------------|
    Trimming from Read 2:  AAAAAAAAAXXXXXX->
    Trimming from Read 1:                                          <---


Output:
    Read 1:                               |------------------------>
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                               <------------------------|

Example 2:

Input:
    Read 1:                         |--------------------->
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                                <-----------------------|


Output:
    Read 1:                               |------------------------>
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                                <-----------------------|

Usage:

clipping = ReadTrimmer(barcode_cut_off, constant,
            constant_no_evaluation, umi_bases,
            usable_seq, hamming_threshold, adapter,
            min_length)
with xopen(inFastq1, mode = 'r') as in1, xopen(inFastq2, mode = 'r') as in2:
    adapter = Adapters.TGIRT_R1.value
    iterable = zip(readfq(in1), readfq(in2))
    for count, (umi_read, opposite_read) in enumerate(iterable):
        ret_code, umi_read, opposite_read = clipping.trim_reads(umi_read, opposite_read)
        if ret_code == 1:
            if read == "read1":
                print(umi_read)
                print(opposite_read)
            if read == "read2":
                print(opposite_read)
                print(umi_read)

trim_reads(umi_read, opposite_read)¶

from each UMI read, clipped barcode and constant regions and make it as ID only when:

barcode quality > cutoff and
constant region matched

Parameters

umi_reads (sequencing_tools.fastq_tools._fastq_tools.fastqRecord) – Read containing UMI
opposite_Read (sequencing_tools.fastq_tools._fastq_tools.fastqRecord) – The opposite read in the read pair

Returns

return code (0: read pair doesn’t pass filter, 1: read pair is good), clipped UMI read (in 4-line fastq format), clipped the other side of read pair (in 4-line fastq format)

Return type

tuple(int, str, str)

fasta_tools¶

This module contains functions to interact with fasta files

class sequencing_tools.fasta_tools.IUPAC¶

Working with IUPAC bases

Usage:

iupac = IUPAC()
iupac.check_degenerate('ACTGN') # {'N'}
iupac.check_degenerate('ACTGY') # {'Y'}
list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG']
list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']

check_degenerate(seq)¶

Parameters: seq (str) – sequence with dengerate base
Returns: all degenerated bases from this sequence
Return type: set

expand(seq)¶

output all possible sequence by looping over the dengerative base

Parameters: seq (str) – sequence with dengerate base
Returns: all possible sequences from the DNA/RNA pool
Return type: Generator(str)

sequencing_tools.fasta_tools.readfa(file_handle)¶

A fasta reader iterator

Parameters: fp – file handle of a fasta file
Returns: sequence id, sequence
Return type: (str, str)

Usage:

with open('test.fa') as fasta:
    for seq_id, seq in readfq(fasta):
        print(seq_id, seq)

class sequencing_tools.fasta_tools.IUPAC¶

Working with IUPAC bases

Usage:

iupac = IUPAC()
iupac.check_degenerate('ACTGN') # {'N'}
iupac.check_degenerate('ACTGY') # {'Y'}
list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG']
list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']

check_degenerate(seq)¶

Parameters: seq (str) – sequence with dengerate base
Returns: all degenerated bases from this sequence
Return type: set

expand(seq)¶

output all possible sequence by looping over the dengerative base

Parameters: seq (str) – sequence with dengerate base
Returns: all possible sequences from the DNA/RNA pool
Return type: Generator(str)

class sequencing_tools.fasta_tools.MultiAlignments(fa_file, RNA=False)¶

PairMatrix()¶: Calculate the hamming distances between each sequence pair

colors¶: color dictionary guiding the multiplex alignment plotting

concensus()¶

compute a consensus sequence from highest frequency base at each position

Returns: (consensus sequence, fraction of base making up the composition of the position)
Return type: tuple(str, list of numpy array

mul_df¶: sequence matrix, each column is a position, and nucleotide as value

pairwise¶: pairwise matrix computed by sequencing_tools.fasta_tools.MultiAlignment.PairMatrix()

plot(ax, min_pos=0, max_pos=None, fontsize=20, labelsize=20, sample_regex='[A-Za-z0-9_-]+')¶

Parameters

ax (plt.axes) – matplotlib axes
min_pos (float) – start position to plot on the multialignments
max_pos (float) – end position to plot on the mutlialignments
fontsize (int) – fontsize for the nucleotides
labelsize (int) – fontsize for sequence id
sample_regex (str) – regex for including sequecne id

bam_tools¶

This module contains functions to interact with bam files and bed files

sequencing_tools.bam_tools._bam_tools.check_concordant()¶

Check if read pairs are concordant

Parameters

read1 – first pysam alignment
read2 – second pysam alignment

Returns

True if they have same read ID, ref ID, are read 1 and read2, and is opposite strand

Return type

boolean

sequencing_tools.bam_tools._bam_tools.check_primary()¶

Check if read pairs are both primary

Parameters

read1 – first pysam alignment
read2 – second pysam alignment

Returns

True if both of them are primary alignments

Return type

boolean

sequencing_tools.bam_tools._bam_tools.cigar_to_str()¶

Cigarstring to string, only extract cigar op == M, I or S, Deletions (D) are skipepd to return same length cigar_seq as the query_sequence

Parameters: cigar_string – standard cigar string from BAM file
Returns: cigar_seq - same length string as sequence, with every character matched the aligned status of the base
Return type: str

Example:

$ cigar_seq = cigar_to_str('3S5M1I3M')
$ print cigar_seq
'SSSMMMMMIMMM'

sequencing_tools.bam_tools._bam_tools.concordant_alignment()¶

Check if alignment is properly mapped flag == 99, 147, 163, 83

Parameters: alignment – pysam alignment segment
Returns: True if 83 or 163 or 99 or 147 for their flags otherwise False
Return type: boolean

sequencing_tools.bam_tools._bam_tools.concordant_pairs()¶

Check if pair is properly mapped flag == 99, 147, 163, 83

Example:

usage: concordant_pairs(read1, read2)

Parameters

read1 – read1 of the pair
read2 – read2 of the pair

Returns

True if 83 and 163 or 99 and 147 for their flags otherwise False

Return type

boolean

sequencing_tools.bam_tools._bam_tools.fragment_ends()¶

get start and end position of a pair of reads

Example:

fragment_ends(read1, read2)

Parameters

read1 – a pysam alignment
read2 – a pysam alignment

Returns

(start, end), leftmost and rightmost position of the fragment

Return type

tuple

sequencing_tools.bam_tools._bam_tools.get_strand()¶

Get strand of the paired fragment

Parameters: alignment – an pysam aligned segment
Returns: Strand - “+” if it is reverse read2 or forward read1 “-” if it is reverse read1 or forward read2
Return type: str

sequencing_tools.bam_tools._bam_tools.make_cigar_seq()¶

Generator convert number and operator into a sequence of aligned status of bases, see split_cigar

Parameters

cigar_numbers – list of numbers in string format
cigar_operator – list of single character

Returns

cigar_base, sequence of cigar base

Return type

Generator

Example:

$ for c in make_cigar_seq(['3','5','1','3'],['S','M','I','M']):
$     print c
SSS
MMMMM
I
MMM

sequencing_tools.bam_tools._bam_tools.make_regions()¶

generator: segment chromosome in to regions usage: make_regions(chromosome_length, how_many_bases_every_time) return cigar_base

Parameters

chromosome_length – last base you want to look at
how_many_bases_every_time – segment size

Returns

(start, end): start, end of segment

Return type

tuple

Example:

for start, end in make_regions(100,10):
    print start, end

10
20
30
40
50
60
70
80
90
100

sequencing_tools.bam_tools._bam_tools.paired_bam()¶

iterator for paired end bam file

Parameters: bam_handle – bam file handle opened by pysam.Samfile
Returns: (read1, read2)
Return type: Generator(tuple)

Example:

with pysam.Samfile('path/to/bam/file') as bam:
    for read1, read2 in paired_bam(bam):
        print(read1.query_name)
        print(read2.query_name)

sequencing_tools.bam_tools._bam_tools.read_ends()¶

Get read end positions, output start and end position of a read

Parameters: AlignedSegment – a pysam alignment
Returns: (start, end), leftmost and rightmost position of the read
Return type: tuple

sequencing_tools.bam_tools._bam_tools.remove_insert()¶

Iterator remove insertion base from aligned sequence

usage: remove_insert(sequence, quality_string, cigar_seq) return: base, base_quality

Parameters

sequence – DNA sequence from BAM
quality_string – qual string from BAM
cigar_seq – cigar seq from cigar_to_str

Returns

(base, base_qual) base that are not annotated as insertion, BAQ associated with the base

Return type

Generator(tuple)

sequencing_tools.bam_tools._bam_tools.split_cigar()¶

Splitting cigar string to numpy array

Parameters: cigar_string – a cigar string e.g. ‘63M1S4D10M’
Returns: ([list of cigar numbers], [list of cigar operators correspongs to the numbers])
Return type: (list,list)

Example:

$ split_cigar('63M')
[63], [M]

consensus_tools¶

class sequencing_tools.consensus_tools._consensus_tools.ErrorCorrection(mode='prob', threshold=0.8)¶

A module for error correction in fastq records, included two modes:

prob: using base quality as probability of errors as prior to calculate posterior quality, see fgbio explanation
vote: using a voting scheme as SafeSeq to generate consensus sequences

Parameters

mode – ‘prob’ or ‘vote’
threshold – only consider for “vote” mode, as a cutoff for returning a “N” if not enough fraction of bases agree

Example:

ec = ErrorCorrection(mode='prob')
ec.Correct(['AAACA','AAAAA','AAACA','AAACA'],
            ['IIIII','IIIAI','FFFFF', 'FFFFF'])

Correct(seq_list, qual_list)¶

Given a list of sequences and a list of quality, return a consensus sequence and a recalibrated quality

Parameters

seq_list – list of sequences with the same length
qual_list – list of the corresponding quality strings, must be in the same length as seq_list

Returns

(consensus seq, consensus qual)

Return type

tuple

gene_tools¶

This module contains functions to interact with bed files, refflat, and gtf files.

class sequencing_tools.gene_tools.Bed12Record(line)¶

Parser for bed12 line

genomic_position(tpos)¶

Translating transcript position into genomic position

Parameters: tpos (int) – position on transcript
Returns: genomic position
Return type: int

get_introns()¶

generator for introns, for each transcript (bed12 line), return all its intron

Returns: intron_line, 6-columns bed for the introns
Return type: generator

class sequencing_tools.gene_tools.GTFRecord(gtf_line)¶

parsing a GTF record line

Usage:

line = 'chr1    ensGene exon    14404   14501   .       -       .       gene_id "ENSG00000227232"; transcript_id "ENST00000488147"; exon_number "1"; exon_id "ENST00000488147.1"; gene_name "ENSG00000227232";'
gtf = GTFRecord(line)
print(gtf.info['gene_id']) # 'ENSG00000227232'

chrom¶: chromosome name

end¶: genomic end

feature_type¶: feature type of the transcript

info¶: info field as a dictionary

start¶: genomic start

strand¶: strand

class sequencing_tools.gene_tools.transcriptome.Transcriptome(refflat=None, sqldb=None, coding_only=True)¶

all annotated transcripts read in a refflat file

Parameters

refflat – refflat file or url
sqldb – Can be found under Annotationhub (if no refflat is supplied)
coding_only – only index coding transcript?

Usage:

url = 'http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz'
txome = Transcriptome(refflat=url, coding_only=False)
gene = txome.get_gene('WASH7P')
tx = gene['NR_024540']
print(tx.exons[1].start, tx.exons[2].end) # 29320 24891
print(tx.genomic_position(300)) #18171
print(tx.blocks(0, 300))
#[(29320, 29370), (24737, 24891), (18270, 18366)]

Or:

# wget http://s3.amazonaws.com/annotationhub/AHEnsDbs/v87/EnsDb.Hsapiens.v87.sqlite
txome = Transcriptome(sqldb="EnsDb.Hsapiens.v87.sqlite", coding_only=False)

coding_only¶: Only index coding genes (mRNA)?

get_gene(gene_name)¶

Get transcripts from gene

Parameters: gene_name (str) –
Returns: dictionary with key as transcript id as key and values are sequencing_tools.gene_tools.transcriptome.Transcript
Return type: dict

transcript_count¶: how many transcript was indexed?

transcript_dict¶: Dict storing transcripts, with gene name as key and values are sequencing_tools.gene_tools.transcriptome.Transcript

class sequencing_tools.gene_tools.transcriptome.Transcript(transcript)¶

blocks(tstart_pos: int, tend_pos: int)¶

given a start position and end position along the transcript, return the block starts and block sizes on the genome scale

Example:

                        exon 1                           exon 2
Transcript: |===========|-----------------|=================|
Amplicon:        |------>                 <--------|
blocks:             block1                   block2
Return:        [(a,     b)                (c,      d)   ]

Parameters

tstart_pos (int) – left position on the transcript
tend_pos (int) – right position on the transcript

Returns

list of tuples containing the (start, end) of each block

Return type

list

cde¶: coding end site (genomic)

cds¶: coding start site (genomic)

chrom¶: chromosome name

exon_count¶: number of exons in this transcript

exon_ends¶: str, comman list of exon end sites

exon_starts¶: str, comma list of exon start sites

exons¶: dictionary with key as exon rank, values are sequencing_tools.gene_tools.transcriptome.Exon

five_UTR_length¶: how long is the 5’ UTR? 5’ as of genomic coordinate

genomic_position(tpos)¶

translate transcriptome position to genomic position

Parameters: tpos – transcript position
Returns: the corresponding genomic position
Return type: int

Illustration:

|-------------|-----------*---------|-----------|
/    exon1   /\        exon2      / \    exon3              /            /  \                 /   \                       |------------|xxxx|---------*-----|xxxxx|----------|
            intron              intron

id¶: transcript ID

strand('+' or '-')¶: strand (‘+’ or ‘-‘)

transcript_length¶: transcript size

tx_end¶: transcription end site (genomnic)

tx_start¶: transcription start site (genomic)

class sequencing_tools.gene_tools.transcriptome.Exon(start, end, exon_num, strand)¶

Exon object storing the coordinate, exon rank and strand information

Parameters

start (int) – genomic start coordinate of the exon
end (int) – genomic end coordinate of the exon
exon_num (int) – exon rank
strand (string) – strandeness of the exon (either + or -)

after_cde¶: 1 if downstream of coding end site else 0

after_cds¶: 1 if downstream of coding start site else 0

coding_bases¶: how many bases are within coding frame

contain_cde¶: 1 if this exon contains coding end site else 0

contain_cds¶: 1 if this exon contains coding start site else 0

end¶: Exon end position on the chromosome

exon_num¶: Exon rank

length¶: Exon size

start¶: Exon start position on the chromosome

strand¶: Exon strand

transcript_end¶: the transcript position of the last exon base

transcript_start¶: the transcript position of the first exon base

viz_tools¶

This module contains some functions that work with figures

class sequencing_tools.viz_tools.color_encoder¶

color-encoding a categoric vector

Example:

categorical_vector = ['a','b','c','a']
colors = obakeito_palette()
ce = color_encoder()
ce.fit(categorical_vector, colors)
encoded_colors = ce.transform(new_categorical_vector)

or:

ce = color_encoder()
encoded_colors = ce.fit_transform(categorical_vector, colors)

access color encoder:

encoded_color_map = ce.encoder

encoder¶: color encoder dictionary

fit(x, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])¶

Usage:

ce = color_encoder()
ce.fit(categroical_vector, colors)

fit_transform(xs, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])¶: ce = color_encoder() encoded_colors = ce.fit_transform(categorical_vector, colors)

show_legend(ax=None, sort=False, **kwargs)¶: Adding matplotlib legend

transform(xs)¶

Usage:

ce = color_encoder()
ce.fit(categroical_vector, colors)
encoded_colors = ce.transform(new_categorical_vector)

sequencing_tools.viz_tools.cor_plot(plot_df, fig, diagonal_line=True, method='pearson', **kwargs)¶

given a data frame with columns storing data for each sample, output a matplotlib figure object as correlation plots.

usage: cor_plot(plot_df, fig=fig, diagonal_line = True, method = ‘pearson’)

Parameters

plot_df – a pandas dataframe
fig – matplotlib figure object (optional)
diagonal_line – Boolean controlling if a diagonal line should be drawn
method – correlation method, [spearman or pearson]

Returns

matplotlib figure object

Return type

matplotlib.pyplot.figure

sequencing_tools.viz_tools.douglas_palette()¶

Automatic set color if seaborn is installed, otherwise return list of colors Example:

colors = douglas_palette()

ax=plt.subplot();
for i in range(14):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()

sequencing_tools.viz_tools.maximum_palette()¶

modified from: https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/

Example:

colors = maximum_palette()

ax=plt.subplot();
for i in range(22):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()

sequencing_tools.viz_tools.mixed_sort(list_of_elements)¶

https://arcpy.wordpress.com/2012/05/11/sorting-alphanumeric-strings-in-python/

Parameters: of (a list) –
Returns: ordered list
Return type: list

Example:

a = ['A1','A10','A2','A20']
mixed_sort(a)
> ['A1','A2', 'A10', 'A20']

sequencing_tools.viz_tools.okabeito_palette()¶

Color palette proposed by Okabe and Ito copy from colorboindr R package https://github.com/clauswilke/colorblindr/blob/master/R/palettes.R

Example:

colors = okabeito_palette()

ax=plt.subplot();
for i in range(8):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()

sequencing_tools.viz_tools.plot_upset(fig, upset_df, ylab='Intersected count', matrix_to_plot_ratio=0.4, fontsize=15)¶

Parameters

fig – matplotlib figure object
upset_df – pandas dataframe, contain column: sample matrix (binary), “count”, ‘index’
ylab – y label
matrix_to_plot_ratio – ratio between interaction matrix and batplot
fontsize – fontsize

input example:

index   HeLa    K562    UHRR    Plasma  count
     0       1       1       1       1
     1       0       0       0       2
     1       1       0       0       3
     0       1       0       0       11
     1       1       0       1       12
     1       1       1       1       12
     0       1       0       1       13
     0       0       0       1       30

sequencing_tools.viz_tools.simpsons_palette()¶

A palette from ggsci R package https://github.com/road2stat/ggsci/blob/master/data-raw/data-generator.R

Example:

colors = simpsons_palette()

ax=plt.subplot();
for i in range(16):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()

stats_tools¶

This module contains some math functions

class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)¶

Parameters: seed – seed for random number generater

generate(xs, group_size=100, n_boots=100)¶

boostrap 1d array

Usage:

xs = np.arange(100)
bs = Bootstrap(seed=123)
for idx in bs.generate(xs, group_size=50, n_boots=10):
    print(xs[idx].mean())

Parameters

xs – 1d np.array
group_size – number of values in each bootstrap iteration
n_boots – how many bootstrap groups

Returns

bootstrapped

Return type

iterator

sequencing_tools.stats_tools._stats_tools.binom_test()¶

Vectorized binomial test

Usage:

success_test = [10,10,10]
total_test = [100,100,100]
binom_test(success_test, total_test, expected_p = 0.5)
#array([1.53164509e-17, 1.53164509e-17, 1.53164509e-17])

Parameters

success_test – number of success
total_test – number of total trials
expected_p – expected probability of success

Returns

list of p-values

Return type

list

sequencing_tools.stats_tools._stats_tools.cy_mean()¶

Fast numerical mean calculator, works with number generator too

Example:

from sequencing_tools.stats_tools import cy_mean
import numpy as np
a = range(10)
b = np.array(a)
%timeit b.mean()
# 6.37 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit cy_mean(a)
# 385 ns ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Usage:

cy_mean(list_of_numbers)

Parameters: xs – list of numbers
Returns: Mean of the list
Return type: float

sequencing_tools.stats_tools._stats_tools.hamming_distance()¶

Calculating hamming distance from two strings, the two strings has to be in the same length

Usage:

hamming_distance(string1, string2)

Parameters

string1 – first string in the comparison
string2 – second string in the comparison

Returns

the hamming edit distance between two strin

Return type

int

sequencing_tools.stats_tools._stats_tools.levenshtein_distance()¶

Calculating Levenshtein distance from two strings algorithm from: http://rosettacode.org/wiki/Levenshtein_distance#Python

usage:

levenshtein_distance(string1, string2)

Parameters

string1 – first string in the comparison
string2 – second string in the comparison

Returns

the edit distance between two string

Return type

int

sequencing_tools.stats_tools._stats_tools.normalize_count()¶

DESeq2 size factor

Parameters

count_mat – pandas dataframe with shape(m,n): m is gene count, n is sample number
return_sf – boolean, if it needs returning size factors

Returns

normalized count matrix

Return type

np.ndarray(m,n)

sequencing_tools.stats_tools._stats_tools.p_adjust()¶

Benjamini-Hochberg p-value correction for multiple hypothesis testing. adapted from https://stackoverflow.com/questions/7450957/how-to-implement-rs-p-adjust-in-python/21739593

Usage:

padj = p_adjust(pvalue_array)

Parameters: pvalue_array – list of pvalue
Returns: list of adjusted pvalue
Return type: list

An equivalent in R:

nm <- names(p)
vp <- as.numeric(p)
p0 <- setNames(p, nm)
lp <- length(p)

i <- lp:1L
o <- order(p, decreasing = TRUE)
ro <- order(o)
pmin(1, cummin(n/i * p[o]))[ro]

class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)¶

Parameters: seed – seed for random number generater

generate(xs, group_size=100, n_boots=100)¶

boostrap 1d array

Usage:

xs = np.arange(100)
bs = Bootstrap(seed=123)
for idx in bs.generate(xs, group_size=50, n_boots=10):
    print(xs[idx].mean())

Parameters

xs – 1d np.array
group_size – number of values in each bootstrap iteration
n_boots – how many bootstrap groups

Returns

bootstrapped

Return type

iterator

class sequencing_tools.stats_tools.regression.GradientDescent(lr=0.01, max_iter=10000, limit=0.0001, verbose=False, seed=123, n_iter_no_change=5, method='mean')¶

Adam_update()¶: Adam optimizer https://github.com/sagarvegad/Adam-optimizer/blob/master/Adam.py

Cost()¶: compute error with new regression coefficien

fit(X, y)¶

fitting B for

y = XB

utils¶

exception sequencing_tools.utils.SeqUtilsError¶

Exception type for the package sequencing tools

Example:

a = -1
if a < 0:
   raise SeqUtilsError("a should be > 0")

Welcome to sequencing_tools’s documentation!¶

Installation¶

fastq_tools¶

fasta_tools¶

bam_tools¶

consensus_tools¶

gene_tools¶

viz_tools¶

stats_tools¶

utils¶

Indices and tables¶