Welcome to sequencing_tools’s documentation!

This is Douglas’s repo for useful bioinformatics stuff.

Installation

Installation can be done via conda:

$ conda config --add channels bioconda
$ conda create -q -n test-environment python=3.6 \
    cython numpy networkx seaborn pyBigwig six pysam ujson pytest scipy   \
    matplotlib samtools future pytest-cov codecov
$ git clone https://github.com/wckdouglas/sequencing_tools.git
$ cd sequencing_tools
$ pip install .

fastq_tools

This module contains functions to interact with fastq files

class sequencing_tools.fastq_tools._fastq_tools.fastqRecord

Fastq record

Parameters
  • id (str) – fastq record id

  • seq (str) – fastq record seq

  • qual (str) – fastq record quality string

subseq()

Trim off sequences outside of start:end

Parameters
  • start (int) – start position on the sequence

  • end (str) – end position on the sequence

class sequencing_tools.fastq_tools._fastq_tools.GappedKmer(k=6, m=2)

Making Gapped kmer feature vector

Parameters
  • k – kmer size

  • m – number of mismatch tolerated

Example:

gkm = GappedKmer(k=3, m = 1)
seq = 'ACTGGAATG'
gkm.compute(seq)
sequencing_tools.fastq_tools._fastq_tools.complement()

Find complement a sequence.

Parameters

seq (str) – sequence to be complemented

Returns

complemented sequence

Return type

str

sequencing_tools.fastq_tools._fastq_tools.extract_kmer()
Parameters
  • sequence (str) – input sequence for extracting kmer

  • k (int) – kmer size

Returns

kmer from the sequence

Return type

generator(str)

class sequencing_tools.fastq_tools._fastq_tools.fastqRecord

Fastq record

Parameters
  • id (str) – fastq record id

  • seq (str) – fastq record seq

  • qual (str) – fastq record quality string

subseq()

Trim off sequences outside of start:end

Parameters
  • start (int) – start position on the sequence

  • end (str) – end position on the sequence

sequencing_tools.fastq_tools._fastq_tools.kmer_bag()

K-mer bag method for feature extraction

We used k = 1,2,3,4,5, resulting in 41 + 42 + 43 + 44 + 45 = 1364 features. For example, the sequence ACTGG would produce a length-1364 feature vector where the entries corresponding to the k-mers A, C, T, AC, CT, TG, GG, ACT, CTG, TGG, ACTG, CTGG, and ACTGG would each equal 1, and the entry corresponding to G would equal 2. All other entries equal 0. (from Zhang and Kamath. Learning the Language of the Genome using RNNs)

Example:

sequence = 'ACTGG'
kmer_bag(sequence, k_start = 1, k_end = 4)
#defaultdict(int,
#    {'A': 1,
#     'C': 1,
#     'T': 1,
#     'G': 2,
#     'AC': 1,
#     'CT': 1,
#     'TG': 1,
#     'GG': 1,
#     'ACT': 1,
#     'CTG': 1,
#     'TGG': 1,
#     'ACTG': 1,
#     'CTGG': 1})
Parameters
  • sequence (str) – the sequence for feature extraction

  • k_start (int) – k for the smallest kmer

  • k_end (int) – k for the largest kmer

Returns

key is kmer, values: count

Return type

dict

class sequencing_tools.fastq_tools._fastq_tools.onehot_sequence_encoder(bases='ACTGN')

A onehot encoder for DNA sequence

usage:

sequence = 'AAACTTTG'
dna_encoder = onehot_sequence_encoder()
onehot_encoded_matrix = dna_encoder.fit_transform(sequence)
onehot_encoded_matrix
#array([[1., 0., 0., 0.],
#    [1., 0., 0., 0.],
#    [1., 0., 0., 0.],
#    [0., 1., 0., 0.],
#    [0., 0., 0., 1.],
#    [0., 0., 0., 1.],
#    [0., 0., 0., 1.],
#    [0., 0., 1., 0.]])
dna_encoder.base_encoder  # check which base each column represents
# {'A': 0, 'C': 1, 'G': 2, 'T': 3}
decode(encoded_mat)
Parameters

array (onehot encoded) – len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence

Returns

Sequence, the decoded sequence

Return type

str

Usage:

mat = np.array([[1., 0., 0., 0.],
                [1., 0., 0., 0.],
                [1., 0., 0., 0.],
                [0., 1., 0., 0.],
                [0., 0., 0., 1.],
                [0., 0., 0., 1.],
                [0., 0., 0., 1.],
                [0., 0., 1., 0.]])
dna_encoder.decode(mat)   #'AAACTTTG'
fit(sequence='ACTGN')
Parameters

sequence (str) – the bases to be extract

fit_transform(sequence)

One hot sequence encoder

Parameters

sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n, with m unique bases

Returns

onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence

Return type

np.ndarray(n,m)

transform(sequence)

One hot sequence encoder

Parameters

sequence (str) – a string of sequence, only accept the bases suppled in fit, with length n

Returns

onehot encoded array, len(sequence)-by-distinct(base) matrix, columns represent each base, rows represent each position along the sequence

Return type

np.ndarray(n,m)

sequencing_tools.fastq_tools._fastq_tools.read_interleaved()

A interleaved fastq iterator

Usage:

with open('test.fq') as fastq:
    for read1, read2 in parse_interleaved(fastq):
        print(read1.id, read1.seq, read1.qual, read1.description)
        print(read2.id, read2.seq, read2.qual, read2.description)
Parameters

fp – file handle of a fastq file

Returns

(fastqRecord for R1, fastqRecord for R2) sequencing_tools.fastq_tools._fastq_tools.fastqRecord

Return type

Generator(tuple)

sequencing_tools.fastq_tools._fastq_tools.readfq()

A fastq iterator from Heng li

Parameters

fp – file handle of a fastq file

Returns

sequencing_tools.fastq_tools._fastq_tools.fastqRecord

Return type

Generator(FastqRecord)

Usage:

with open('test.fq') as fastq:
    for record in readfq(fastq):
        print(record.id, record.seq, record.qual, record.description)
sequencing_tools.fastq_tools._fastq_tools.reverse_complement()

Reverse complement a sequence.

Parameters

seq (str) – sequence to be reverse complemented

Returns

reverse complemented sequence

Return type

str

class sequencing_tools.fastq_tools.pe_align.ConsensusBuilder(error_toleration=0.1, min_len=15, report_all=False, conserved=False, highlight=False)

Merging the two reads from paired end sequencing, and build consensus sequence for the overlapped part, paired end sequencing of a fragments:

Illustration:

Reads:
               | adapter|                            |adapter |
            R1:         |------------------>
  DNA Fragment: ========|----------------------------|========
            R2:                     <----------------|
    report all:         |---------------------------->
not report all:                     |------>
Parameters
  • error_toleration (float) – maximum mismatch fraction at the overlap region, reads are merged if they have mismatch fraction lower than this

  • min_len (int) – minumum overlapping bases at the overlapping region, if lower than this, reads won’t merge

  • report_all (boolean) – True if the output merged sequence contains non-overlapping bases

Example:

consensus_builder = ConsensusBuilder(error_toleration = 0.1,
                            min_len = 15, report_all=False)
read1 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG',
    'ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAAAGATCGGAAGAGCACACGTCTGAA',
    'A6AA6//EA/AEE/AEAE///EEE/AAA/6/EE/A//EEE/A//EE/AE//E/////AEE///////////////',
    'NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG    Read1',
    )

read2 = fastqRecord('NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG',
    'TTTCCCGCACCGATAGCAGCCGCTCTGGTCTCCCATCCCGGGCAATTGTGTGATCGTCGGACTGTAGAACTCTGA',
    'AAA//E/E/E/A/A///EEE/E///<//EE/EE6/</EEA/A</EE<AAE/E/</<<AAE/E<///EE/EA///A',
    'NB501060:148:HNFYCBGX5:1:11101:10036:1116 2:N:0:GAGTGG    Read2',
    )

out = consensus_builder.run(read1, read2)
print(out)
# @NB501060:148:HNFYCBGX5:1:11101:10036:1116 1:N:0:GAGTGG
# ACACAATTGCCCGGGATGGGAGACCAGAGCGGCTGCTATCGGTGCGGGAAA
# +
# IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
run(R1, R2)

reverse complement read2 sequence and find matching position on read1

Parameters
Returns

fastq line for the merged concensus sequence [+ non-overlapping sequence]

Return type

str

class sequencing_tools.fastq_tools.function_clip.ReadTrimmer(barcode_cut_off=20, constant='', constant_no_evaluation=True, umi_bases=6, usable_seq=6, hamming_threshold=6, adapter='GATCGTCGGACTGTAGAACTCTGAACGTGTAGA', min_length=12)

Implementation for a paired end UMI read trimmer

  1. Triming UMI bases from 5’ of the UMI read sequence

  2. append the UMI bases to the supplied adapter, and use that to trim the other read from 3’

  3. trimming adapter sequence if overhang bases are sequenced

input reads (suppose A is adapter, X is UMI and N is biological fragment):

Example 1:

Input:
    Read 1:                         |--------------------------------->
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                     <----------------------------------|
    Trimming from Read 2:  AAAAAAAAAXXXXXX->
    Trimming from Read 1:                                          <---


Output:
    Read 1:                               |------------------------>
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                               <------------------------|

Example 2:

Input:
    Read 1:                         |--------------------->
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                                <-----------------------|


Output:
    Read 1:                               |------------------------>
    library fragment:      AAAAAAAAAXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAA
    Read 2:                                <-----------------------|

Usage:

clipping = ReadTrimmer(barcode_cut_off, constant,
            constant_no_evaluation, umi_bases,
            usable_seq, hamming_threshold, adapter,
            min_length)
with xopen(inFastq1, mode = 'r') as in1, xopen(inFastq2, mode = 'r') as in2:
    adapter = Adapters.TGIRT_R1.value
    iterable = zip(readfq(in1), readfq(in2))
    for count, (umi_read, opposite_read) in enumerate(iterable):
        ret_code, umi_read, opposite_read = clipping.trim_reads(umi_read, opposite_read)
        if ret_code == 1:
            if read == "read1":
                print(umi_read)
                print(opposite_read)
            if read == "read2":
                print(opposite_read)
                print(umi_read)
trim_reads(umi_read, opposite_read)

from each UMI read, clipped barcode and constant regions and make it as ID only when:

  1. barcode quality > cutoff and

  2. constant region matched

Parameters
Returns

return code (0: read pair doesn’t pass filter, 1: read pair is good), clipped UMI read (in 4-line fastq format), clipped the other side of read pair (in 4-line fastq format)

Return type

tuple(int, str, str)

fasta_tools

This module contains functions to interact with fasta files

class sequencing_tools.fasta_tools.IUPAC

Working with IUPAC bases

Usage:

iupac = IUPAC()
iupac.check_degenerate('ACTGN') # {'N'}
iupac.check_degenerate('ACTGY') # {'Y'}
list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG']
list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']
check_degenerate(seq)
Parameters

seq (str) – sequence with dengerate base

Returns

all degenerated bases from this sequence

Return type

set

expand(seq)

output all possible sequence by looping over the dengerative base

Parameters

seq (str) – sequence with dengerate base

Returns

all possible sequences from the DNA/RNA pool

Return type

Generator(str)

sequencing_tools.fasta_tools.readfa(file_handle)

A fasta reader iterator

Parameters

fp – file handle of a fasta file

Returns

sequence id, sequence

Return type

(str, str)

Usage:

with open('test.fa') as fasta:
    for seq_id, seq in readfq(fasta):
        print(seq_id, seq)
class sequencing_tools.fasta_tools.IUPAC

Working with IUPAC bases

Usage:

iupac = IUPAC()
iupac.check_degenerate('ACTGN') # {'N'}
iupac.check_degenerate('ACTGY') # {'Y'}
list(IUPAC().expand('ACTGN')) #['ACTGA', 'ACTGC', 'ACTGT', 'ACTGG']
list(IUPAC().expand('ACTGY')) #['ACTGC', 'ACTGT']
check_degenerate(seq)
Parameters

seq (str) – sequence with dengerate base

Returns

all degenerated bases from this sequence

Return type

set

expand(seq)

output all possible sequence by looping over the dengerative base

Parameters

seq (str) – sequence with dengerate base

Returns

all possible sequences from the DNA/RNA pool

Return type

Generator(str)

class sequencing_tools.fasta_tools.MultiAlignments(fa_file, RNA=False)
PairMatrix()

Calculate the hamming distances between each sequence pair

colors

color dictionary guiding the multiplex alignment plotting

concensus()

compute a consensus sequence from highest frequency base at each position

Returns

(consensus sequence, fraction of base making up the composition of the position)

Return type

tuple(str, list of numpy array

mul_df

sequence matrix, each column is a position, and nucleotide as value

pairwise

pairwise matrix computed by sequencing_tools.fasta_tools.MultiAlignment.PairMatrix()

plot(ax, min_pos=0, max_pos=None, fontsize=20, labelsize=20, sample_regex='[A-Za-z0-9_-]+')
Parameters
  • ax (plt.axes) – matplotlib axes

  • min_pos (float) – start position to plot on the multialignments

  • max_pos (float) – end position to plot on the mutlialignments

  • fontsize (int) – fontsize for the nucleotides

  • labelsize (int) – fontsize for sequence id

  • sample_regex (str) – regex for including sequecne id

bam_tools

This module contains functions to interact with bam files and bed files

sequencing_tools.bam_tools._bam_tools.check_concordant()

Check if read pairs are concordant

Parameters
  • read1 – first pysam alignment

  • read2 – second pysam alignment

Returns

True if they have same read ID, ref ID, are read 1 and read2, and is opposite strand

Return type

boolean

sequencing_tools.bam_tools._bam_tools.check_primary()

Check if read pairs are both primary

Parameters
  • read1 – first pysam alignment

  • read2 – second pysam alignment

Returns

True if both of them are primary alignments

Return type

boolean

sequencing_tools.bam_tools._bam_tools.cigar_to_str()

Cigarstring to string, only extract cigar op == M, I or S, Deletions (D) are skipepd to return same length cigar_seq as the query_sequence

Parameters

cigar_string – standard cigar string from BAM file

Returns

cigar_seq - same length string as sequence, with every character matched the aligned status of the base

Return type

str

Example:

$ cigar_seq = cigar_to_str('3S5M1I3M')
$ print cigar_seq
'SSSMMMMMIMMM'
sequencing_tools.bam_tools._bam_tools.concordant_alignment()

Check if alignment is properly mapped flag == 99, 147, 163, 83

Parameters

alignment – pysam alignment segment

Returns

True if 83 or 163 or 99 or 147 for their flags otherwise False

Return type

boolean

sequencing_tools.bam_tools._bam_tools.concordant_pairs()

Check if pair is properly mapped flag == 99, 147, 163, 83

Example:

usage: concordant_pairs(read1, read2)
Parameters
  • read1 – read1 of the pair

  • read2 – read2 of the pair

Returns

True if 83 and 163 or 99 and 147 for their flags otherwise False

Return type

boolean

sequencing_tools.bam_tools._bam_tools.fragment_ends()

get start and end position of a pair of reads

Example:

fragment_ends(read1, read2)
Parameters
  • read1 – a pysam alignment

  • read2 – a pysam alignment

Returns

(start, end), leftmost and rightmost position of the fragment

Return type

tuple

sequencing_tools.bam_tools._bam_tools.get_strand()

Get strand of the paired fragment

Parameters

alignment – an pysam aligned segment

Returns

Strand - “+” if it is reverse read2 or forward read1 “-” if it is reverse read1 or forward read2

Return type

str

sequencing_tools.bam_tools._bam_tools.make_cigar_seq()

Generator convert number and operator into a sequence of aligned status of bases, see split_cigar

Parameters
  • cigar_numbers – list of numbers in string format

  • cigar_operator – list of single character

Returns

cigar_base, sequence of cigar base

Return type

Generator

Example:

$ for c in make_cigar_seq(['3','5','1','3'],['S','M','I','M']):
$     print c
SSS
MMMMM
I
MMM
sequencing_tools.bam_tools._bam_tools.make_regions()

generator: segment chromosome in to regions usage: make_regions(chromosome_length, how_many_bases_every_time) return cigar_base

Parameters
  • chromosome_length – last base you want to look at

  • how_many_bases_every_time – segment size

Returns

(start, end): start, end of segment

Return type

tuple

Example:

for start, end in make_regions(100,10):
    print start, end

0 10
10 20
20 30
30 40
40 50
50 60
60 70
70 80
80 90
90 100
sequencing_tools.bam_tools._bam_tools.paired_bam()

iterator for paired end bam file

Parameters

bam_handle – bam file handle opened by pysam.Samfile

Returns

(read1, read2)

Return type

Generator(tuple)

Example:

with pysam.Samfile('path/to/bam/file') as bam:
    for read1, read2 in paired_bam(bam):
        print(read1.query_name)
        print(read2.query_name)
sequencing_tools.bam_tools._bam_tools.read_ends()

Get read end positions, output start and end position of a read

Parameters

AlignedSegment – a pysam alignment

Returns

(start, end), leftmost and rightmost position of the read

Return type

tuple

sequencing_tools.bam_tools._bam_tools.remove_insert()

Iterator remove insertion base from aligned sequence

usage: remove_insert(sequence, quality_string, cigar_seq) return: base, base_quality

Parameters
  • sequence – DNA sequence from BAM

  • quality_string – qual string from BAM

  • cigar_seq – cigar seq from cigar_to_str

Returns

(base, base_qual) base that are not annotated as insertion, BAQ associated with the base

Return type

Generator(tuple)

sequencing_tools.bam_tools._bam_tools.split_cigar()

Splitting cigar string to numpy array

Parameters

cigar_string – a cigar string e.g. ‘63M1S4D10M’

Returns

([list of cigar numbers], [list of cigar operators correspongs to the numbers])

Return type

(list,list)

Example:

$ split_cigar('63M')
[63], [M]

consensus_tools

class sequencing_tools.consensus_tools._consensus_tools.ErrorCorrection(mode='prob', threshold=0.8)

A module for error correction in fastq records, included two modes:

  1. prob: using base quality as probability of errors as prior to calculate posterior quality, see fgbio explanation

  2. vote: using a voting scheme as SafeSeq to generate consensus sequences

Parameters
  • mode – ‘prob’ or ‘vote’

  • threshold – only consider for “vote” mode, as a cutoff for returning a “N” if not enough fraction of bases agree

Example:

ec = ErrorCorrection(mode='prob')
ec.Correct(['AAACA','AAAAA','AAACA','AAACA'],
            ['IIIII','IIIAI','FFFFF', 'FFFFF'])
Correct(seq_list, qual_list)

Given a list of sequences and a list of quality, return a consensus sequence and a recalibrated quality

Parameters
  • seq_list – list of sequences with the same length

  • qual_list – list of the corresponding quality strings, must be in the same length as seq_list

Returns

(consensus seq, consensus qual)

Return type

tuple

gene_tools

This module contains functions to interact with bed files, refflat, and gtf files.

class sequencing_tools.gene_tools.Bed12Record(line)

Parser for bed12 line

genomic_position(tpos)

Translating transcript position into genomic position

Parameters

tpos (int) – position on transcript

Returns

genomic position

Return type

int

get_introns()

generator for introns, for each transcript (bed12 line), return all its intron

Returns

intron_line, 6-columns bed for the introns

Return type

generator

class sequencing_tools.gene_tools.GTFRecord(gtf_line)

parsing a GTF record line

Usage:

line = 'chr1    ensGene exon    14404   14501   .       -       .       gene_id "ENSG00000227232"; transcript_id "ENST00000488147"; exon_number "1"; exon_id "ENST00000488147.1"; gene_name "ENSG00000227232";'
gtf = GTFRecord(line)
print(gtf.info['gene_id']) # 'ENSG00000227232'
chrom

chromosome name

end

genomic end

feature_type

feature type of the transcript

info

info field as a dictionary

start

genomic start

strand

strand

class sequencing_tools.gene_tools.transcriptome.Transcriptome(refflat=None, sqldb=None, coding_only=True)

all annotated transcripts read in a refflat file

Parameters
  • refflat – refflat file or url

  • sqldb – Can be found under Annotationhub (if no refflat is supplied)

  • coding_only – only index coding transcript?

Usage:

url = 'http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz'
txome = Transcriptome(refflat=url, coding_only=False)
gene = txome.get_gene('WASH7P')
tx = gene['NR_024540']
print(tx.exons[1].start, tx.exons[2].end) # 29320 24891
print(tx.genomic_position(300)) #18171
print(tx.blocks(0, 300))
#[(29320, 29370), (24737, 24891), (18270, 18366)]

Or:

# wget http://s3.amazonaws.com/annotationhub/AHEnsDbs/v87/EnsDb.Hsapiens.v87.sqlite
txome = Transcriptome(sqldb="EnsDb.Hsapiens.v87.sqlite", coding_only=False)
coding_only

Only index coding genes (mRNA)?

get_gene(gene_name)

Get transcripts from gene

Parameters

gene_name (str) –

Returns

dictionary with key as transcript id as key and values are sequencing_tools.gene_tools.transcriptome.Transcript

Return type

dict

transcript_count

how many transcript was indexed?

transcript_dict

Dict storing transcripts, with gene name as key and values are sequencing_tools.gene_tools.transcriptome.Transcript

class sequencing_tools.gene_tools.transcriptome.Transcript(transcript)
blocks(tstart_pos: int, tend_pos: int)

given a start position and end position along the transcript, return the block starts and block sizes on the genome scale

Example:

                        exon 1                           exon 2
Transcript: |===========|-----------------|=================|
Amplicon:        |------>                 <--------|
blocks:             block1                   block2
Return:        [(a,     b)                (c,      d)   ]
Parameters
  • tstart_pos (int) – left position on the transcript

  • tend_pos (int) – right position on the transcript

Returns

list of tuples containing the (start, end) of each block

Return type

list

cde

coding end site (genomic)

cds

coding start site (genomic)

chrom

chromosome name

exon_count

number of exons in this transcript

exon_ends

str, comman list of exon end sites

exon_starts

str, comma list of exon start sites

exons

dictionary with key as exon rank, values are sequencing_tools.gene_tools.transcriptome.Exon

five_UTR_length

how long is the 5’ UTR? 5’ as of genomic coordinate

genomic_position(tpos)

translate transcriptome position to genomic position

Parameters

tpos – transcript position

Returns

the corresponding genomic position

Return type

int

Illustration:

|-------------|-----------*---------|-----------|
/    exon1   /\        exon2      / \    exon3              /            /  \                 /   \                       |------------|xxxx|---------*-----|xxxxx|----------|
            intron              intron
id

transcript ID

strand('+' or '-')

strand (‘+’ or ‘-‘)

transcript_length

transcript size

tx_end

transcription end site (genomnic)

tx_start

transcription start site (genomic)

class sequencing_tools.gene_tools.transcriptome.Exon(start, end, exon_num, strand)

Exon object storing the coordinate, exon rank and strand information

Parameters
  • start (int) – genomic start coordinate of the exon

  • end (int) – genomic end coordinate of the exon

  • exon_num (int) – exon rank

  • strand (string) – strandeness of the exon (either + or -)

after_cde

1 if downstream of coding end site else 0

after_cds

1 if downstream of coding start site else 0

coding_bases

how many bases are within coding frame

contain_cde

1 if this exon contains coding end site else 0

contain_cds

1 if this exon contains coding start site else 0

end

Exon end position on the chromosome

exon_num

Exon rank

length

Exon size

start

Exon start position on the chromosome

strand

Exon strand

transcript_end

the transcript position of the last exon base

transcript_start

the transcript position of the first exon base

viz_tools

This module contains some functions that work with figures

class sequencing_tools.viz_tools.color_encoder

color-encoding a categoric vector

Example:

categorical_vector = ['a','b','c','a']
colors = obakeito_palette()
ce = color_encoder()
ce.fit(categorical_vector, colors)
encoded_colors = ce.transform(new_categorical_vector)

or:

ce = color_encoder()
encoded_colors = ce.fit_transform(categorical_vector, colors)

access color encoder:

encoded_color_map = ce.encoder
encoder

color encoder dictionary

fit(x, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])

Usage:

ce = color_encoder()
ce.fit(categroical_vector, colors)
fit_transform(xs, colors=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7', '#999999'])

ce = color_encoder() encoded_colors = ce.fit_transform(categorical_vector, colors)

show_legend(ax=None, sort=False, **kwargs)

Adding matplotlib legend

transform(xs)

Usage:

ce = color_encoder()
ce.fit(categroical_vector, colors)
encoded_colors = ce.transform(new_categorical_vector)
sequencing_tools.viz_tools.cor_plot(plot_df, fig, diagonal_line=True, method='pearson', **kwargs)

given a data frame with columns storing data for each sample, output a matplotlib figure object as correlation plots.

usage: cor_plot(plot_df, fig=fig, diagonal_line = True, method = ‘pearson’)

Parameters
  • plot_df – a pandas dataframe

  • fig – matplotlib figure object (optional)

  • diagonal_line – Boolean controlling if a diagonal line should be drawn

  • method – correlation method, [spearman or pearson]

Returns

matplotlib figure object

Return type

matplotlib.pyplot.figure

sequencing_tools.viz_tools.douglas_palette()

Automatic set color if seaborn is installed, otherwise return list of colors Example:

colors = douglas_palette()

ax=plt.subplot();
for i in range(14):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()
sequencing_tools.viz_tools.maximum_palette()

modified from: https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/

Example:

colors = maximum_palette()

ax=plt.subplot();
for i in range(22):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()
sequencing_tools.viz_tools.mixed_sort(list_of_elements)

https://arcpy.wordpress.com/2012/05/11/sorting-alphanumeric-strings-in-python/

Parameters

of (a list) –

Returns

ordered list

Return type

list

Example:

a = ['A1','A10','A2','A20']
mixed_sort(a)
> ['A1','A2', 'A10', 'A20']
sequencing_tools.viz_tools.okabeito_palette()

Color palette proposed by Okabe and Ito copy from colorboindr R package https://github.com/clauswilke/colorblindr/blob/master/R/palettes.R

Example:

colors = okabeito_palette()

ax=plt.subplot();
for i in range(8):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()
sequencing_tools.viz_tools.plot_upset(fig, upset_df, ylab='Intersected count', matrix_to_plot_ratio=0.4, fontsize=15)
Parameters
  • fig – matplotlib figure object

  • upset_df – pandas dataframe, contain column: sample matrix (binary), “count”, ‘index’

  • ylab – y label

  • matrix_to_plot_ratio – ratio between interaction matrix and batplot

  • fontsize – fontsize

input example:

index   HeLa    K562    UHRR    Plasma  count
3       0       1       1       1       1
4       1       0       0       0       2
5       1       1       0       0       3
1       0       1       0       0       11
6       1       1       0       1       12
7       1       1       1       1       12
2       0       1       0       1       13
0       0       0       0       1       30
sequencing_tools.viz_tools.simpsons_palette()

A palette from ggsci R package https://github.com/road2stat/ggsci/blob/master/data-raw/data-generator.R

Example:

colors = simpsons_palette()

ax=plt.subplot();
for i in range(16):
    ax.plot(np.arange(10),np.arange(10) + i,label=i, color = colors[i])
ax.legend()

stats_tools

This module contains some math functions

class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)
Parameters

seed – seed for random number generater

generate(xs, group_size=100, n_boots=100)

boostrap 1d array

Usage:

xs = np.arange(100)
bs = Bootstrap(seed=123)
for idx in bs.generate(xs, group_size=50, n_boots=10):
    print(xs[idx].mean())
Parameters
  • xs – 1d np.array

  • group_size – number of values in each bootstrap iteration

  • n_boots – how many bootstrap groups

Returns

bootstrapped

Return type

iterator

sequencing_tools.stats_tools._stats_tools.binom_test()

Vectorized binomial test

Usage:

success_test = [10,10,10]
total_test = [100,100,100]
binom_test(success_test, total_test, expected_p = 0.5)
#array([1.53164509e-17, 1.53164509e-17, 1.53164509e-17])
Parameters
  • success_test – number of success

  • total_test – number of total trials

  • expected_p – expected probability of success

Returns

list of p-values

Return type

list

sequencing_tools.stats_tools._stats_tools.cy_mean()

Fast numerical mean calculator, works with number generator too

Example:

from sequencing_tools.stats_tools import cy_mean
import numpy as np
a = range(10)
b = np.array(a)
%timeit b.mean()
# 6.37 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit cy_mean(a)
# 385 ns ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Usage:

cy_mean(list_of_numbers)
Parameters

xs – list of numbers

Returns

Mean of the list

Return type

float

sequencing_tools.stats_tools._stats_tools.hamming_distance()

Calculating hamming distance from two strings, the two strings has to be in the same length

Usage:

hamming_distance(string1, string2)
Parameters
  • string1 – first string in the comparison

  • string2 – second string in the comparison

Returns

the hamming edit distance between two strin

Return type

int

sequencing_tools.stats_tools._stats_tools.levenshtein_distance()

Calculating Levenshtein distance from two strings algorithm from: http://rosettacode.org/wiki/Levenshtein_distance#Python

usage:

levenshtein_distance(string1, string2)
Parameters
  • string1 – first string in the comparison

  • string2 – second string in the comparison

Returns

the edit distance between two string

Return type

int

sequencing_tools.stats_tools._stats_tools.normalize_count()

DESeq2 size factor

Parameters
  • count_mat – pandas dataframe with shape(m,n): m is gene count, n is sample number

  • return_sf – boolean, if it needs returning size factors

Returns

normalized count matrix

Return type

np.ndarray(m,n)

sequencing_tools.stats_tools._stats_tools.p_adjust()

Benjamini-Hochberg p-value correction for multiple hypothesis testing. adapted from https://stackoverflow.com/questions/7450957/how-to-implement-rs-p-adjust-in-python/21739593

Usage:

padj = p_adjust(pvalue_array)
Parameters

pvalue_array – list of pvalue

Returns

list of adjusted pvalue

Return type

list

An equivalent in R:

nm <- names(p)
vp <- as.numeric(p)
p0 <- setNames(p, nm)
lp <- length(p)

i <- lp:1L
o <- order(p, decreasing = TRUE)
ro <- order(o)
pmin(1, cummin(n/i * p[o]))[ro]
class sequencing_tools.stats_tools._stats_tools.Bootstrap(seed=123)
Parameters

seed – seed for random number generater

generate(xs, group_size=100, n_boots=100)

boostrap 1d array

Usage:

xs = np.arange(100)
bs = Bootstrap(seed=123)
for idx in bs.generate(xs, group_size=50, n_boots=10):
    print(xs[idx].mean())
Parameters
  • xs – 1d np.array

  • group_size – number of values in each bootstrap iteration

  • n_boots – how many bootstrap groups

Returns

bootstrapped

Return type

iterator

class sequencing_tools.stats_tools.regression.GradientDescent(lr=0.01, max_iter=10000, limit=0.0001, verbose=False, seed=123, n_iter_no_change=5, method='mean')
Adam_update()

Adam optimizer https://github.com/sagarvegad/Adam-optimizer/blob/master/Adam.py

Cost()

compute error with new regression coefficien

fit(X, y)

fitting B for

y = XB

utils

exception sequencing_tools.utils.SeqUtilsError

Exception type for the package sequencing tools

Example:

a = -1
if a < 0:
   raise SeqUtilsError("a should be > 0")

Indices and tables