MolBioWiz by JoeCouto

help

Press Choose File to select plain text files (e.g. .txt .py .html) from your computer, then press Upload to INPUT

If you upload binary files (e.g. .doc .pdf) you will get nonsense characters.

If you want to work with text from web pages, pdf files, etc, you can highlight the text, copy it (ctrl + c), and paste it (ctrl + v) into INPUT.

start index→ [ ... ] ←end index

help

Tutorial: press this to load some text lines into INPUT. Then play with the buttons.

grep extracts lines that contain the search motif
Optionally, check the case-sensitive box
search motif accepts regular expressions
findAll outputs a list of found motifs and their positions on the complete text
replace finds search motif and replaces it with replace with
substring extracts vertical columns of text

press the add ruler button to help you select the indeces
note that the ruler is added only if there is some text in INPUT
once you decide what the indeces should be, press the remove ruler button and then substring

delRepeats removes repeated lines - (flanking blank spaces are ignored)
other text functions - At least with Chrome or Edge browsers, you can press crlt + F to search text, and you can right-click over the text areas for additional text functions.

clean name rev tra xtrct show

help

Seqs works on fastA-formatted sequences entered in INPUT. You can upload up to thousands of sequences from a textfile, or, copy them from another file or web-page, and paste them into INPUT.
Tutorial: press this to load some sequences into INPUT. Then play with the buttons.

clean outputs a single line of sequence, controls up/lower cases, removes/keeps dashes
name batch sequence renaming
rev reverse-complement, reverse-transcribe
tran translate
xtrct extract sequences that contain search motif
show show motifs aligned with seqs
The openFormat button in the output window opens another window for formatting (i.e. line breaks, residue numbering).

SeqOnly

Seq&Dashes

Case:

UPPER

lower

Ori

pad rplc cut

help

Works on sequence names (lines starting with ">") only.

pad makes all the names the same length as the longest name (by adding _).
rplc replaces search motif with replace with.
cut slices sequence name lines by keeping only length characters from start. If length is 0 or if it exceeds the number of leftover characters, all characters are kept from start through the end of the name.

RevComp RT-human RT-other

help

Processing of INPUT fastA-formatted DNA and protein sequences.

RC reverse-complements DNA sequences containing only IUPAC codes plus X.
5'-ACGTURYSWKMBDHVNX-3' → 5'-XNBDHVKMWSRYAACGT-3'
↓
3'-TGCAAYRSWMKVHDBNX-5'
where:U=T, R=A/G, Y=C/T, S=G/C, W=A/T, K=G/T, M=A/C, B=C/G/T, D=A/G/T, H=A/C/T, V=A/C/G, and N=X=A/C/G/T
RT-human reverse-transcribes protein sequences using an internal human codon frequency table
RT-other reverse-transcribes protein sequences using other codon-frequency tables

name seq

help

Read fastA-formatted sequences in INPUT and separate those that have search motif from those that don't
. Optionally, check the case-sensitive box

name extraction if search motif is found in the sequence name. Sequences are not reformatted
seq extraction if search motif in found in the sequence. Sequences are converted into a single line of letters only

→ ⇄

help

Finds case-insensitive search motif in fastA-formatted INPUT sequences and shows the motif aligned with the sequence. Search motif can be a regular expression. Adds a list of all found positions and patterns to the sequence name line. Sequences that do not contain the search motif are separated into OUT2 a different output text area.

→ the motif is shown at all found positions above a single line of sequence, which could be protein or ssDNA
⇄ the motif and it's reverse-complement are shown at all found positions respectively above and below dsDNA
sequences containing at least one non-ATGC code are assumed to be protein (i.e. the ⇄ option is ignored)

help

fetchFromUniprot reads a list of Uniprot accession numbers (e.g., Q9NZQ7 Q01279, ... separated by white spaces or new lines) in INPUT and fetches their fastA-formatted sequences from https://www.uniprot.org/. Note that Uniprot could possibly block this kind of data scrapping in the future. You may also submit a list of accession numbers and retrieve the respective sequences on the Uniprot site though it takes longer.
Properties reads fastA-formatted protein sequences in INPUT and outputs a table of properties. It is best to copy the output (ctrl + A) (ctrl + C) and paste it (ctrl + V) to a spreadsheet.

R K H... Aa counts
sz Total # Aas
kd Molecular mass (KiloDaltons)
charge atpH approximate pI of the protein if the charge is 0 (see procedure below)
kd hw Kyte-Doolittle and Hopp-Woods weighted arithmetic means

Tutorial: (a) cler INPUT (b) copy and paste these accession numbers Q9NZQ7 Q01279 to INPUT, (c) press fetchFromUniprot, (d) on the UNIPROT output area press the copy to INPUT button, (e) on the INPUT area, which now contains the fastA-formatted sequences, press Properties. You should see the results below, which you can manually copy to a spreadsheet.

Here is my procedure to calculate the pI (charge atpH):

calculate the charge of the protein at pH6.5 using the Bjellqvist's pKa values: R(12) K(10) H(5.98) Nterm(7.5) D(4.05) E(4.45) Y(10) C(9) Cterm(3.55)
This is done by adding the following two sums:
for positive Aas and Nterm: ∑ 1/(1+10^pH-pK_Aa)
for negative Aas and Cterm: ∑ -1/(1+10^pK_Aa-pH)
if the charge is negative lower the pH by 1 unit; otherwise, increase the pH by 1 unit
re-calculate the charge
iterate steps 2 and 3 but if the charge changes from pos to neg or vice-versa add or subtract only half a pH unit.
keep iterating and halving the pH units until the charge is within 0.01 of zero, or until the number of iterations reaches 30.

%IdFr1	%IdFr2	%IdFr3	%IdFr4	truncL	trunc%

AbAlign help

AbAlign aligns fastA-formatted antibody sequences entered in INPUT ignoring all but the variable regions.

It finds VH, VK or VL frameworks using a small database of template sequences, which can only be edited at the code level (let me know if you think the template needs to be expanded). As is, it should correctly align most human, mouse, and rabbit sequences to an IMGT-like FR numbering system.
⚠ Sequence names are truncated to 10 characters.
There are five extra residues in FR1 (0, 11a-d) to accommodate rare rabbit sequences.
The first 4 boxes allow you to change the %ID threshold for finding a FR match.
If you know that a FR is NOT present in your sequence enter 100% in its box to avoid finding a similar but wrong FR sequence.
If the tool can't find a complete FR, it tries to find a truncated one using the last two parameters, minimum length of truncated FR and %ID, respectively.
The AbDev button in the output window opens another movable window with its own functions that look for sequence motifs that could be problematic for antibody developability.

Developability help

Antibody Developability

These functions check for motifs that might cause problems in antibody developability.
They work only on the aligned antibody sequences in the Ab Alignment window.
You cannot download the output because it has colored text but you can select it (CTRL+A), copy it (CTRL+C) and paste it to an MS word document - If you paste with "keep source formatting" the text colors will be preserved in MS word.
note the regular expressions for the following motifs are more complex than usual because we need to account for possible intervening "_" in the antibody sequence alignments

glyc - N-linked glycosylation sites N_*[^P]_*[ST]
NxDx - CDR motifs that may lead to deamination/isomerization N_*[GSTNH]|G_*N_*[FY]|D_*[GSDTH]
specf - CDR3 scFV non-specificity motifs G+_*G+|R+_*R+|V_*G|V+_*V+|W+_*W+|Y+_*Y+|W_*\w_*W
aggr - motifs in CDRs that cause aggregation F_*H_*W
visc - motifs in CDRs that lead to viscosity H_*Y_*F|H_*W_*H'
posit - marks R,K,D,E CDR residues if Nmbr(R,K) - Nmbr(D,E) > 1
cys - colors canonical Cys residues at positions 23 and 104 in red and others in blue
motif -enter your own regular expression in search motif

minPepLen	maxPepLen	maxKyDo	EDRK	ILVMF	duplc	forbid	score	bestN	nextStart

help

PepScan helps finding good peptide immunogens for Input fastA-formatted protein sequences
It outputs protein-peptide sequence alignments and a peptide list respectively into the ALIGNED and PEPTIDES text areas
The initial selection parameters are pretty strict, so, you may not get any peptides until you change the parameters

You can FORMAT the ALIGNED output to wrap the lines while maintaining the alignments
To fastA-format the peptide list swap the outputs (swap OUT1⇄OUT2) and once the list is in the top output textarea you can toggle the format by pressing fastA⇄name::seq

Selection Parameters

minPepLen Exclude peps shorter than # Aas
maxPepLen Exclude peps longer than # Aas
maxKyDo Exclude peps with KyDo > value (KyDo = weighted arithmetic mean of Aa residue Kyte-Doolittle values, excluding N and C termini)
EDRK Point-subtraction for excessive charges (see Scoring below)
ILVMFAC Point-subtraction for hydrophobicity (see Scoring below)
duplc Point-subtraction for Aa duplications (see Scoring below)
forbid Exclude peps if they contain listed Aa residues
score Exclude peps with scores < value
bestN Show peps with the top N scores
nextStart The next potential pep starts n residues after previous one

Scoring

initial score = peptide length + number of unique residues
+ 2 pts if -0.2 < kydo < -1
+ 1 pt if kydo < -1 (too many +/- charges)
- x pts if s = 2^p(n-2); n = # repeated Aas, p = entered parameter.
- x pts if pep contains > 2 consecutive charged (EDRK) Aas
- x pts if pep contains > 2 consecutive hydrophobic (ILVMFAC) Aas

Disclaimer