Disclaimer

This software is provided "as is" without any warranties, express or implied. The developer shall not be held liable for any errors, inaccuracies, or damages resulting from the use of this software. By using this software, you agree to the terms of this disclaimer and waive any rights to pursue legal action against the developer.

This software is intended for educational and informational purposes only. Users are responsible for verifying the accuracy and suitability of the software for their specific needs.

If you do not agree with these terms, please do not use the software.




About MOLBIOWIZ Joe Couto JAN 2025
MolBioWiz can process up to thousands of DNA or protein sequences at once, for sequence formatting, translation, reverse-complementing, reverse-transcription and motif-searches.
It can also do multiple antibody V-region sequence alignments, find antibody developability problems, and help design peptide immunogens.
help

Press Choose File to select plain text files (e.g. .txt .py .html) from your computer, then press Upload to INPUT

If you upload binary files (e.g. .doc .pdf) you will get nonsense characters.

If you want to work with text from web pages, pdf files, etc, you can highlight the text, copy it (ctrl + c), and paste it (ctrl + v) into INPUT.

[ ... ]
help Tutorial: press this to load some text lines into INPUT. Then play with the buttons.
  • grep extracts lines that contain the search motif
  • Optionally, check the case-sensitive box
  • search motif accepts regular expressions
  • findAll outputs a list of found motifs and their positions on the complete text
  • replace finds search motif and replaces it with replace with
  • substring extracts vertical columns of text
    • press the add ruler button to help you select the indeces
    • note that the ruler is added only if there is some text in INPUT
    • once you decide what the indeces should be, press the remove ruler button and then substring
  • delRepeats removes repeated lines - (flanking blank spaces are ignored)
  • other text functions - At least with Chrome or Edge browsers, you can press crlt + F to search text, and you can right-click over the text areas for additional text functions.
help Seqs works on fastA-formatted sequences entered in INPUT. You can upload up to thousands of sequences from a textfile, or, copy them from another file or web-page, and paste them into INPUT.
Tutorial: press this to load some sequences into INPUT. Then play with the buttons.
  • clean outputs a single line of sequence, controls up/lower cases, removes/keeps dashes
  • name batch sequence renaming
  • rev reverse-complement, reverse-transcribe
  • tran translate
  • xtrct extract sequences that contain search motif
  • show show motifs aligned with seqs
  • The openFormat button in the output window opens another window for formatting (i.e. line breaks, residue numbering).


   
Case:



   

 
  
help Works on sequence names (lines starting with ">") only.
  • pad makes all the names the same length as the longest name (by adding _).
  • rplc replaces search motif with replace with.
  • cut slices sequence name lines by keeping only length characters from start. If length is 0 or if it exceeds the number of leftover characters, all characters are kept from start through the end of the name.
CODON USAGE .txt
The human codon usage table on the left appears when you select RT-other and will disappear if you select Rev-Comp or RT-human
You can replace it by uploading your own alternative codon-usage text file - Warning: your file must be in exactly the same format as the table on the left, meaning:
  • do not touch the codons (e.g. F:TTT,TTC)
  • change ONLY the percentages and use INTEGER numbers (e.g. F:58,42)
  • optionally, change the word "Human" in the 1st line, but DO NOT CHANGE THE WORD "Organism"
  • list the codons you want to exclude on last line (keeping the word:"avoid"), but:
    1. use only capital letters, triplets separated by commas (no spaces)
    2. keep the XYZ
 










 
help Translation of fastA-formatted INPUT DNA sequences
⚠ AGCT only - seqs containing non-AGCT codes are flagged but not processed
Stop codons translate to *
  • Translation starts at the designated nt#
  • If motif is checked:
    • translation starts at the first incidence of search motif plus the designated nt# after the search motif
    • if the search motif is not found there is no processing and no output
  • fr1 fr2 fr3: choose any combination of translation frames; (note: it translates the top strand only)
  • dna: ouput translation with no DNA shown
  • ⇀: output translation below ssDNA or output just the ssDNA if no frames are selected
  • ⇌: output translation below dsDNA or output just the dsDNA if no frames are selected
 
help Read fastA-formatted sequences in INPUT and separate those that have search motif from those that don't
. Optionally, check the case-sensitive box
  • name extraction if search motif is found in the sequence name. Sequences are not reformatted
  • seq extraction if search motif in found in the sequence. Sequences are converted into a single line of letters only
    
help Finds case-insensitive search motif in fastA-formatted INPUT sequences and shows the motif aligned with the sequence. Search motif can be a regular expression. Adds a list of all found positions and patterns to the sequence name line. Sequences that do not contain the search motif are separated into OUT2 a different output text area.
  • → the motif is shown at all found positions above a single line of sequence, which could be protein or ssDNA
  • ⇄ the motif and it's reverse-complement are shown at all found positions respectively above and below dsDNA
  • sequences containing at least one non-ATGC code are assumed to be protein (i.e. the ⇄ option is ignored)
help
  • fetchFromUniprot reads a list of Uniprot accession numbers (e.g., Q9NZQ7 Q01279, ... separated by white spaces or new lines) in INPUT and fetches their fastA-formatted sequences from https://www.uniprot.org/. Note that Uniprot could possibly block this kind of data scrapping in the future. You may also submit a list of accession numbers and retrieve the respective sequences on the Uniprot site though it takes longer.
  • Properties reads fastA-formatted protein sequences in INPUT and outputs a table of properties. It is best to copy the output (ctrl + A) (ctrl + C) and paste it (ctrl + V) to a spreadsheet.
    • R K H... Aa counts
    • sz Total # Aas
    • kd Molecular mass (KiloDaltons)
    • charge atpH approximate pI of the protein if the charge is 0 (see procedure below)
    • kd hw Kyte-Doolittle and Hopp-Woods weighted arithmetic means
Tutorial: (a) cler INPUT (b) copy and paste these accession numbers Q9NZQ7 Q01279 to INPUT, (c) press fetchFromUniprot, (d) on the UNIPROT output area press the copy to INPUT button, (e) on the INPUT area, which now contains the fastA-formatted sequences, press Properties. You should see the results below, which you can manually copy to a spreadsheet.

Here is my procedure to calculate the pI (charge atpH):
  1. calculate the charge of the protein at pH6.5 using the Bjellqvist's pKa values: R(12) K(10) H(5.98) Nterm(7.5) D(4.05) E(4.45) Y(10) C(9) Cterm(3.55)
    This is done by adding the following two sums:
    for positive Aas and Nterm: ∑ 1/(1+10pH-pKAa)
    for negative Aas and Cterm: ∑ -1/(1+10pKAa-pH)
  2. if the charge is negative lower the pH by 1 unit; otherwise, increase the pH by 1 unit
  3. re-calculate the charge
  4. iterate steps 2 and 3 but if the charge changes from pos to neg or vice-versa add or subtract only half a pH unit.
  5. keep iterating and halving the pH units until the charge is within 0.01 of zero, or until the number of iterations reaches 30.
%IdFr1 %IdFr2 %IdFr3 %IdFr4 truncL trunc%
AbAlign help AbAlign aligns fastA-formatted antibody sequences entered in INPUT ignoring all but the variable regions.
  • It finds VH, VK or VL frameworks using a small database of template sequences, which can only be edited at the code level (let me know if you think the template needs to be expanded). As is, it should correctly align most human, mouse, and rabbit sequences to an IMGT-like FR numbering system.
  • ⚠ Sequence names are truncated to 10 characters.
  • There are five extra residues in FR1 (0, 11a-d) to accommodate rare rabbit sequences.
  • The first 4 boxes allow you to change the %ID threshold for finding a FR match.
  • If you know that a FR is NOT present in your sequence enter 100% in its box to avoid finding a similar but wrong FR sequence.
  • If the tool can't find a complete FR, it tries to find a truncated one using the last two parameters, minimum length of truncated FR and %ID, respectively.
  • The AbDev button in the output window opens another movable window with its own functions that look for sequence motifs that could be problematic for antibody developability.
Developability help Antibody Developability
  • These functions check for motifs that might cause problems in antibody developability.
  • They work only on the aligned antibody sequences in the Ab Alignment window.
  • You cannot download the output because it has colored text but you can select it (CTRL+A), copy it (CTRL+C) and paste it to an MS word document - If you paste with "keep source formatting" the text colors will be preserved in MS word.
  • note the regular expressions for the following motifs are more complex than usual because we need to account for possible intervening "_" in the antibody sequence alignments
    • glyc - N-linked glycosylation sites N_*[^P]_*[ST]
    • NxDx - CDR motifs that may lead to deamination/isomerization N_*[GSTNH]|G_*N_*[FY]|D_*[GSDTH]
    • specf - CDR3 scFV non-specificity motifs G+_*G+|R+_*R+|V_*G|V+_*V+|W+_*W+|Y+_*Y+|W_*\w_*W
    • aggr - motifs in CDRs that cause aggregation F_*H_*W
    • visc - motifs in CDRs that lead to viscosity H_*Y_*F|H_*W_*H'
    • posit - marks R,K,D,E CDR residues if Nmbr(R,K) - Nmbr(D,E) > 1
    • cys - colors canonical Cys residues at positions 23 and 104 in red and others in blue
    • motif -enter your own regular expression in search motif
minPepLen maxPepLen maxKyDo EDRK ILVMF duplc forbid nextStart
help PepScan helps finding good peptide immunogens for Input fastA-formatted protein sequences
It outputs protein-peptide sequence alignments and a peptide list respectively into the ALIGNED and PEPTIDES text areas
The initial selection parameters are pretty strict, so, you may not get any peptides until you change the parameters

You can FORMAT the ALIGNED output to wrap the lines while maintaining the alignments
To fastA-format the peptide list swap the outputs (swap OUT1⇄OUT2) and once the list is in the top output textarea you can toggle the format by pressing fastA⇄name::seq

Selection Parameters
  • minPepLen Exclude peps shorter than # Aas
  • maxPepLen Exclude peps longer than # Aas
  • maxKyDo Exclude peps with KyDo > value (KyDo = weighted arithmetic mean of Aa residue Kyte-Doolittle values, excluding N and C termini)
  • EDRK Point-subtraction for excessive charges (see Scoring below)
  • ILVMFAC Point-subtraction for hydrophobicity (see Scoring below)
  • duplc Point-subtraction for Aa duplications (see Scoring below)
  • forbid Exclude peps if they contain listed Aa residues
  • score Exclude peps with scores < value
  • bestN Show peps with the top N scores
  • nextStart The next potential pep starts n residues after previous one
Scoring
  • initial score = peptide length + number of unique residues
  • + 2 pts if -0.2 < kydo < -1
  • + 1 pt if kydo < -1 (too many +/- charges)
  • - x pts if s = 2p(n-2); n = # repeated Aas, p = entered parameter.
  • - x pts if pep contains > 2 consecutive charged (EDRK) Aas
  • - x pts if pep contains > 2 consecutive hydrophobic (ILVMFAC) Aas

INPUT time.txt

OUT1
fastA⇄name::seq works in place and toggles the output between the following two formats.
Note that the sequence must be in a single line (which you can achieve by using the function clean)
>name__1
ATGGCGGCGGGATTTCGGACTCT
>name
MGATTRDHILPRRQW
   ⇅
>name__1::ATGGCGGCGGGATTTCGGACTCT
>name2::MGATTRDHILPRRQW
time.txt ×

OUT2 time.txt ×
  ✥   You can drag and resize this window   

ANTIBODY DEVELOPABILITY OUTPUT AREA

FORMAT wrdSz on line time.txt ×
help Formats the OUT1 content according to entered wrdSz (word size) and lineSz (line size), with options to show residue numbers. It works only on fastA-formatted sequences and other text blocks that start with a fastA-like ">name line", such as the outputs from translation or the alignments from show

Example wrdSz:3 lineSz:30 show#s on line = 1
>a
ATT GTT CGG ATT CGG ATG AGT TCT ACT TTA 30
ACT TCT TAT TCT CTT TCT TAT TCT TCT ATC 60
TAA CAA GCC TAA GCC TAC TCA AGA 84
>b
TAA CAA GCC TAA GCC TAC TCA AGA TGA AAT 30
TGA AGA ATA AGA GAA AGA ATA AGA AGA TAG 60
ATT GTT CG 68

time.txt