Mutalyzer 2 Help

About Mutalyzer

Mutalyzer is a tool primarily designed to check descriptions of sequence variants according to the standard human sequence variant nomenclature of the Human Genome Sequence Variation Society (HGVS) (For an overview, visit http://www.hgvs.org/mutnomen/). Mutalyzer aims to encourage the proper use of  nomenclature in publications and reduce redundancy in sequence variation databases. In principle, Mutalyzer can check descriptions of sequence variants detected in other organisms, provided that the standard HGVS nomenclature is applied.

Mutalyzer 2 flow

The user specifies a reference sequence (file) and a variant using the Name Generator or the Name checker interface. The Name Generator builds the complete variant description for the Name Checker (e.g., Mutalyzer uses this input to perform the nomenclature check in the following steps:

1) Retriever: retrieves reference sequence records from the NCBI or LRG websites.

2) Reference sequence parser: extracts sequence and annotation from reference sequence records

3) Syntax checker: context-free parser using the complete sequence variant description to check whether the syntax is correct according to standard HGVS sequence variant nomenclature

4) Name checker: the core nomenclature checker using the complete sequence variant description to check whether it is correct according to standard HGVS sequence variant nomenclature

 

Additional Mutalyzer 2 functionality:

- Position Converter: converts hg18 and hg19 chromosomal positions to transcript positions in HGVS n. or c. notation and vice versa. The n. or c. notation should be checked with the Name checker

- Refrence File Loader: allows you to upload and use your own reference sequence.

- SNP converter: allows you to convert a dbSNP rsId to HGVS notation.

- Batch Checkers: interfaces for the different checkers that accept a large list of descriptions as input.

- Webservices: programmatic access to Mutalyzer's functionality.

 

Introduction

Reference sequences

We strongly recommend the use of genomic reference sequences containing proper annotation for optimal use of Mutalyzer's capabilities to generate descriptions for all transcripts and protein isoforms of the gene(s) affected by the sequence variation.

Mutalyzer accepts the following reference sequences:

1) GenBank files

GenBank records (e.g., NG_007400.1) are specified by a GenBank accession number (NG_007400) and a version number (.1). Omission of the version number automatically results in selection of the most recent version of that record. In case of outdated versions, Mutalyzer will issue a warning. Alternatively, the unique GenInfo identifier (gi) of the reference sequence (e.g., 4506864) can be used with or without the letters ''gi'';. Mutalyzer does not accept GenBank records containing no sequence (e.g. chromosomal reference sequence identifiers referring to contig accession numbers) or files larger than 10 MB. Mutalyzer also accepts user-defined files in GenBank format, including slices of chromosomal reference sequences. These files are specified by unique UD identifiers, which are returned by Mutalyzer after upload (See the Reference File Loader section for more information).

2) LRG files

Locus Reference Genomic (LRG) files containing uniquely and stable reference DNA sequences along with all relevant transcript and protein sequences essential to the description of gene variants (see the LRG website for more information). LRG files are based on NCBI's RefSeqGene project and created in collaboration with the community of research and diagnostic labs, LSDB curators and mutation consortia. LRG files are specified by the prefix "LRG_" followed by a number (e.g., LRG_1). The LRG website lists existing LRG sequences and has an FTP site for downloading LRGs. To maintain LRG stability, Mutalyzer's Reference File Loader does not accept user-defined LRG files.

Variant descriptions

The Mutalyzer nomenclature checker accepts variant descriptions in standard human sequence variant nomenclature format. For users, who are not familiar with the nomenclature syntax, Mutalyzer's Name Generator provides a form to acquire the separate components necessary to construct variant descriptions.

These components, which are discussed in more detail below are:

1) Position numbering scheme (Sequence Type)

2) Gene symbol, transcript variant and protein isoform

3) Variant start and end positions

4) Mutation type

5) Deleted and Inserted sequence

Position numbering scheme (Sequence Type)

The standard human sequence variant nomenclature uses different position numbering schemes to describe variants relative to the reference sequence. Mutalyzer checks if the specified reference sequence is compatible with the selected position numbering scheme for the sequence variation. Variant descriptions involving upstream or downstream regulatory sequences and intron sequences can only be checked using genomic sequence records. Therefore, genomic records with correct annotation of all genes, transcripts and protein isoforms support most position numbering schemes. Mutalyzer automatically converts the given variant description to other position numbering schemes supported by the reference sequence and its annotation. Mutalyzer will not return results when the selected reference sequence does not contain sufficient sequence or annotation to support the nomenclature check of the variant.

There are six position numbering schemes (Sequence Types):

Genomic

The Genomic position numbering scheme is applied to raw genomic records. The value 1 is assigned to the first base in the record and all bases are counted from there. In the output, genomic numbering is indicated by the g. prefix preceding the position number(s). LRG records and all GenBank records with 'DNA' in the first line will be accepted.

Please note that well-annotated genomic sequence records containing annotated transcripts and corresponding coding sequences can be used in combination with non-coding DNA, coding DNA and protein position numbering schemes.

Non-coding DNA (ncDNA)

The Non-coding DNA or ncDNA position numbering scheme can be used with:
a) GenBank records containing genomic sequences with annotated transcripts without a corresponding coding sequence.
b) LRG records
c) Genbank records containing transcript sequences without annotated coding sequences, provided that no intronic bases are involved in the variation. Mutalyzer needs a correctly annotated genomic reference sequence to check HGVS Non-coding DNA numbering of intron positions.

The value 1 is assigned to to the first base of the transcript in the record and all the exonic bases are counted from there. Intronic bases are numbered x+1, x+2, x+3, ... y-3, y-2, y-1 where x is the value of the last exonic base upstream of the intron, y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers. Intronic position numbers are always counted from the closest exonic base. In case of a tie, the upstream base is used. In the output, ncDNA numbering is indicated by the n. prefix preceding the position number(s).

Coding DNA (cDNA)

The Coding DNA or cDNA position numbering scheme can be used with:
a) Genbank records containing genomic sequences with annotated transcripts and corresponding coding sequences.
b) LRG records
c) Genbank records containing transcript sequences with annotated coding sequences, provided that no intronic bases are involved in the variation. Mutalyzer needs a correctly annotated genomic reference sequence to check HGVS Coding DNA numbering of intron positions.

The value 1 is assigned to the A of the ATG start codon and all the exonic bases between start and stop are counted normally.
5' untranslated region: Exonic bases upstream of (i.e. before) the ATG are numbered -1, -2, -3 and so on.
3' untranslated region: Exonic bases downstream of (i.e. behind) the stop codon are numbered *1, *2, *3 and so on.
Intronic bases in the Coding sequence are numbered x+1, x+2, x+3, ... y-3, y-2, y-1 where x is the value of the last exonic base upstream of the intron, y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers. Intronic position numbers are always counted from the closest exonic base. In case of a tie, the upstream base is used.
In case of: a 5' untranslated region split over two or more exons: Intronic bases are numbered -x+1, -x+2, -x+3, ... -y-3, -y-2, -y-1 where -x is the value of the last exonic base upstream of the intron, -y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers.
In case of: a 3' untranslated region split over two or more exons: Intronic bases are numbered *x+1, *x+2, *x+3, ... *y-3, *y-2, *y-1 where *x is the value of the last exonic base upstream of the intron, *y is the value of the first exonic base downstream of the intron and y are consecutive numbers.
In the output, cDNA numbering is indicated by the c. prefix preceding the position number(s).

RNA

The RNA position numbering scheme has not yet been implemented in Mutalyzer 2. The value 1 is assigned to the first base in the record and from there all bases are counted normally. In the output, RNA numbering is indicated by the r. notation preceding the position number(s).

Mitochondrial DNA (mtDNA)

The Mitochondrial DNA (mtDNA) position numbering scheme uses raw genomic records. The value 1 is assigned to the first base in the record and from there all bases are counted normally.

Protein

The Protein position numbering scheme is used to generate variant descriptions at protein level from genomic or Coding DNA descriptions by translation of the Coding sequence. The current version of Mutalyzer 2 does not yet support checks of protein variants using a GenBank protein record. The value 1 is assigned to the first amino acid of the translated Coding sequence and from there all amino acids are counted normally. In the output, protein variants have the prefix p. folllowed by the amino acid changes between parentheses to indicate that they are predicted by translation of the modified Coding sequence.

EST

The EST position numbering scheme can be used with GenBank EST records. The value 1 is assigned to the first base in the record and from there all bases are counted normally. Sequence variation descriptions based on EST sequences lack the c. prefix to indicate that only part of the coding sequence may be present. All records with 'EST' in the first line will be accepted. These records do not allow checks of intronic sequence variations.

Gene Symbol and Variant

In genomic records containing annotation of multiple genes, alternative transcript variants and protein isoforms, only genomic positions are unambiguous. Descriptions at Non-coding DNA, coding DNA, or protein level may be ambiguous. Mutalyzer parses the annotation of the reference sequence record and displays the detected genes, transcript variants or protein isoforms in the legend at the bottom of the output page. Mutalyzer uses the annotation to detect potential ambiguity in a variant description. Further specification of genes, transcript variants or protein isoforms may be required to solve it. Only Gene Symbols matching the reference sequence annotation are allowed. Usually, gene symbols have to be combined with the desired transcript variant or protein isoform

Variant descriptions are accepted in two formats:

- A positive integer referring to the order of the transcripts in the annotation, e.g. 1, 2, 3, ...
- The exact identifier following the underscore behind the Gene symbol in the legend, e.g. v002 for a transcript variant or i002 for a protein isoform

Start and End Position

The Start position is the positional value of the most upstream base or amino acid in the reference sequence affected by the mutation. The End position is the positional value of the most downstream base or amino acid in the reference sequence affected by the mutation. Mutalyzer only accepts positions contained within the reference sequence. The values should be a positive integer (whole number) for all position numbering schemes, except Non-coding DNA and Coding DNA . For Non-coding and Coding DNA, these positions may also contain + and - signs to indicate intron positions. For Coding DNA, positions can also have prefixes - and * to indicate exonic positions in 5' or 3' untranslated regions. Furthermore, in descriptions of deletions, exonic positions can be followed by +? or -? to indicate unknown intronic positions.
The Mutalyzer nomenclature checker has a strict implementation of Start and End positions in Non-coding DNA and Coding DNA position numbering schemes. To prevent discrepancies between Non-coding DNA and Coding DNA descriptions based on genomic RefSeqGene (NG_) records and the corresponding RefSeq transcript (NR_ or NM_) records, exon positions may not exceed those of the transcript annotated in the genomic reference sequence record. Therefore, Mutalyzer cannot use - or * prefixes to indicate positions in upstream or downstream intergenic regions.

For upstream intergenic positions, Mutalyzer combines the position of the first nucleotide of the transcript with the suffix -u followed by the position of the upstream nucleotide. Intergenic bases upstream of Non-coding DNA are numbered n.1-uy, ..., n.1-u3, n.1-u2, n.1-u1 where y is the value of the most upstream base and n.1-u1 is the value of the first intergenic base upstream of the first exon. Intergenic bases upstream of Coding DNA are numbered c.x-uy, ..., c.x-u3, c.x-u2, c.x-u1 where x is the value of the first nucleotide of the first exon and y is the value of the most upstream base. The advantage of this notation is that the -u position corresponds to the - position used by to describe transcription factor binding sites.

For downstream intergenic positions, Mutalyzer combines the position of the last nucleotide of the transcript with the suffix +d followed by the position of the downstream nucleotide. Intergenic bases downstream of Non-coding DNA are numbered n.x+d1, n.x+d2, n.x+d3 ... where x is the value of the last nucleotide of the last exon. Intergenic bases downstream of Coding DNA are numbered c.x+d1, c.x+d2, c.x+d3, ... where x is the value of the last nucleotide of the last exon.

Mutation Type

The syntax of the standard human sequence variant nomenclature depends on the type of mutation. Six mutation types are supported:

Substitution

A substitution is the replacement of a single nucleotide or amino acid by another. A substitution involving multiple residues is classified as an indel. The start and end position should be identical. The original residue and the new residue have to be specified and must be non-identical. In the Name Generator, the Deleted Sequence and Inserted Sequence fields must be filled in.

Deletion

A deletion is the removal of one or more nucleotides or amino acids without replacement. In the Name Generator, the Inserted Sequence field must remain empty. The Deleted Sequence field can be filled in to check the start and end positions and to match the deleted residues with the reference sequence (Optional). Please note that the start and end positions should be equal when only one nucleotide or amino acid is deleted.

Insertion

An insertion is the addition of one or more nucleotides or amino acids without removing any previously existing ones. The starting and end positions should differ by exactly one. In the Name Generator, Inserted Sequence must be filled in with the actual new sequence. If the inserted sequence is already present in the reference sequence at the location of the insertion, it should be represented as a duplication.

Duplication

Duplication is the addition of one or more nucleotides or amino acids identical to the sequence from the specified start position to the specified end position, at the end position. In the Name Generator,  Deleted Sequence must remain empty. Inserted Sequence can be filled in to check the start and end positions and to match the duplicated residues with the reference sequence (Optional).

Insertion/Deletion (indel)

An indel is the removal of one or more bases or amino acids, combined with the addition of one or more bases or amino acids. In case a single residue is deleted and another residue is inserted, the mutation should be described as a substitution, not an indel. If the inserted sequence is the reverse complement of the original sequence, it should be described as an inversion. Start and end position define the boundaries of the deletion in the original sequence. In the Name Generator, the deleted sequence should be entered in the Deleted Sequence field and the Inserted sequence in the New Sequence field.

Inversion (nucleotide sequences only)

An inversion is a sequence of two or more bases inserted as its reverse complement. Start and end position must be non-identical. In the Name Generator, the Deleted Sequence and Inserted Sequence fields must remain empty. 

Deleted and Inserted Sequence

The syntax of the standard human sequence variant nomenclature requires specification of the inserted residue(s) for several mutation types. Specification of the original residue(s) is optional for most types, except for subsitutions. In the Name Generator, the presence or absence of these fields depends on the selected Mutation Type. These fields should be used:
-to enter the original amino acid or nucleotide residue(s) present in the reference sequence (Deleted Sequence).
-to enter the amino acid(s) or nucleotide residue(s) introduced by the change (Inserted Sequence).


Mutalyzer Name Checker Help

Users can check the correctness of a variant description. The Name Checker will try to regenerate the variant sequence and name it according to the HGVS standard human sequence variant nomenclature.

Examples:
AB026906.1:c.3_4insG
AB026906.1:c.[1del;4G>T]
AL449423.14(CDKN2A_v1):c.1_10del
UD_127955523176(DMD_v002):c.136G>T
LRG_1t1:c.266G>T


Mutalyzer Syntax Checker Help

Users can check the correctness of the standard nomenclature syntax. The Syntax Checker uses a context-free parser to detect deviations from the standard nomenclature syntax in the input. The position of  the deviation is indicated in the error message and by a caret (^) below the description.

Examples:
AB026906:c.3_4inG
AB026906.1:c.35_36ins
LRG_1t1:c.266G>T


Mutalyzer Position Converter Help

The Position Converter will convert the positions of the variation description from the chromosomal position for a specific human genome build to a position relative to RefSeq transcript reference sequences. The Position Converter uses a local database containing the mapping information from the UCSC genome browser for human genome builds hg18 (NCBI 36) and hg19 (GRCh37). The specified version of the RefSeq transcript Accession number has to be present in the database. The sequence variation description has not been checked by Mutalyzer's Name Checker.

Examples:
NM_003002.2:c.274G>T
chr11:g.111959693G>T
NC_000011.9:g.111959693G>T


Mutalyzer SNP Converter Help

The SNP Converter will submit a dbSNP rsID to dbSNP to retrieve the sequence variation description according to the HGVS sequence variation description listed in dbSNP.
The sequence variation description has not been checked by Mutalyzer's Name Checker.

Example:
SNP Accession number: rs9919552


Mutalyzer Name Generator Help

The Name Generator aims to assist users, who are not familiar with all the details of the HGVS standard human sequence variant nomenclature, to construct variant descriptions. The Name Generator presents a form to collect the separate components of a variant description described above. The variant description generated is subsequently used by the Name Checker to construct the variant sequence and name it according to the HGVS standard human sequence variant nomenclature.

Example:
Reference: AL449423.14
Sequence Type: Coding DNA
Gene symbol: CDKN2A
Transcript: v_1

Variant 1
Mutation Type: Substitution
Start Position: 112
End Position: 112
Deleted Sequence: C
Inserted Sequence: T


Mutalyzer Batch Checker Help


The Batch checkers support submission of files containing large datasets to the Name Checker, Syntax Checker, and Position Converter tools.

The Mutalyzer batch checker accepts the following file formats

We accept two types of input files, you can download examples below
New Style Download Example File

This file format has no header-row and no columns. Instead each row contains a single variant for the Batch check.

AB026906.1:c.274G>T
AL449423.14(CDKN2A_v002):c.5_400del
Old Style: Download Example File

This file format has a header-row, which consists of three tab delimited fields. In each following row, the corresponding data is also tab delimited.The gene symbol field may be left empty, when it is not nessary to select a particular gene or transcript.

AccNoGenesymbolMutation
AB026906.1SDHDg.7872G>T
Output Format

The output of a Mutalyzer Batch run is a Tab Delimited Text file, which has a header-row to clarify the results.

Users can upload a tab-delimited text file with the sequence variations to be checked. Files for the Name Checker and the Syntax Checker may contain any combination of reference sequences and sequence types for different genes. Mutalyzer's UD identifiers can also be used, but we strongly suggest to update any GenBank record following these instructions.

A message containing a link to the results will be send to the e-mail address specified, when the analysis is finished, but Mutalyzer's progress can be followed in the browser window also. Performance depends on the server load and the number of reference sequence records to be downloaded. The program will process approximately 100 variations per minute, when using a single reference sequence record.

The Batch checkers use JavaScript to update the progress report. In Internet Explorer, progress may not be reported correctly. Adding Mutalyzer to your trusted sites is one option to solve this.


Mutalyzer Output

Mutalyzer has been designed to issue warnings, when correcting entries, encountering inconsistencies, incomplete sequences or annotation, or identifying variations with potential effects on splicing before presenting the results of the analysis. Errors will be generated when the entries can not be processed properly (see below for more information).
The sequence variation description will always be in the format:

<Accession Number>.<version number>:<sequence type>.<mutation>
(Examples: NM_003002.1:c.5delC or AL449423.14:g.61866_85191del)
or
<Accession Number>.<version number><(Gene Symbol)>:<sequence type>.<mutation>
In the latter case, the gene symbol may be followed by transcript variant or protein isoform numbers (e.g., _v001 or _i001, respectively).
Example: the fictitious sequence variation AL449423.14:g.61866_85191del corresponds with the following changes in transcript variants and protein isoforms:
AL449423.14(CDKN2A_v001):c.-271_234del
AL449423.14(CDKN2A_v002):c.5_400del
AL449423.14(CDKN2A_v003):c.1_*3352del
and
CAH70600.1(CDKN2A_i001):p.Met1?
CAH70601.1(CDKN2A_i002):p.Gly2AspfsX41
CAH70599.1(CDKN2A_i003):p.Met1?

From the example “CAD55702.1:p.Pro2Arg (missense mutation)”, you can conclude that the protein in version 1 of the record CAD55702 has a mutation denoted as Pro2Arg (which signifies an arginine substituted for a proline at position 2).

Please note the following:

- Sequence variation descriptions using genomic references in combination with Sequence Type "Coding DNA" will result in the use of nucleotides in reverse complement for genes transcribed in the opposite orientation.

- Genbank Identifiers are always converted to Genbank Accession Numbers, which are automatically retrieved from the annotation based on the selected Sequence Type. Example: 4506864:c.5del will be converted into NM_003002.1:c.5delC


Reference File Loader Help

Users can upload their own reference sequence file in GenBank Flat file format, retrieve the genomic sequence of a gene with its flanking regions, or specify a chromosomal range for use as a reference sequence. Mutalyzer checks whether the file is in valid GenBank Flat file format. If so, Mutalyzer stores the file locally and returns a unique number the UD identifier that can be used with all different forms of the Mutalyzer Sequence Variation Nomenclature Checker. This option allows users to use reference files, which are not present in GenBank, or add information about alternative transcripts or proteins or additional genes contained within or derived from the reference sequence to an existing GenBank file. Users are encouraged to limit their use of this option by submitting annotation updates and corrections of existing GenBank files following these instructions.

Loader options:

The reference sequence file is a local file

Browse to locate your Genbank Flat file with a .gb extension and press the submit button.

 

The reference sequence file can be found at the following URL

Enter the URL of the website, where the Genbank Flat file with a .gb extension can be found and press the submit button.

 

Retrieve part of the reference genome for a (HGNC) gene symbol

This option retrieves part of the chromosomal reference sequence, which is annotated for this gene in the last genome build of the organism.

The organism name should not contain any spaces (e.g., use homo_sapiens, human or man)

Input:

Please enter the Gene symbol and organism name without spaces and specify the length of the flanking sequences
Gene symbol
Organism name
Number of 5' flanking nucleotides
Number of 3' flanking nucleotides

Retrieve a range of a chromosome

Use of NC_accession numbers without version number will result in retrieval of the latest version.

Input:

Please enter the accession number of the chromosome or contig and specify the range
Chromosome Accession Number
Start Position
Stop Position
Orientation

Mutalyzer output for all options:

Output:

Your reference sequence was loaded successfully. You now can use mutalyzer with the following accession number as reference: UD_127955523176
Download this reference sequence.

The Reference File Loader uses JavaScript to change the form depending on the selected option. In Internet Explorer, forms may not be displayed correctly. Adding Mutalyzer to your trusted sites is one option to solve this.


Mutalyzer Webservices

Mutalyzer's webservices provide programmatic access to different parts of Mutalyzer's functionality. In the future, these will be used by LOVD to convert coding DNA positions to chromosomal positions for mapping and display purposes. A full description of available webservices can be found at the Webservice documentation page. Example scripts and requirements can be found at the Webservice page.


Using Mutalyzer with sequences from other organisms

Mutalyzer can process Genbank reference files from other organisms than man and will apply the appropriate coding table to translate an open reading frame into a protein sequence. Please note that all variants will be described according to the HGVS standard human sequence variation nomenclature. When trying to retrieve genomic reference sequences using gene symbols with the Reference File Loader or when specifying a particular gene in a genomic reference sequence, the gene symbol should be similar to that used in the (genome) sequence annotation.

Errors and feature requests

Any error message gives an indication of the problem encountered and replicates the input of the user. Most errors occurring after mistyping should be easy to understand and can be corrected immediately by altering the data in the field specified. In other cases, Mutalyzer should advise you to contact us when the error persists. Please specify your input and which tool you used.

Occasionally, Mutalyzer will display an Internal Server Error message due to unexpected behavior. You can use Mutalyzer's bugtracking system to report errors and send in feature requests.

Citing Mutalyzer

When you use Mutalyzer, please cite this paper: Wildeman M, van Ophuizen E, den Dunnen JT, Taschner PE. Improving sequence variant descriptions in mutation databases and literature using the MUTALYZER sequence variation nomenclature checker. Hum Mutat 29:6-13 (2008) [PMID: 18000842].

Mutalyzer 2 has been completely redesigned by Jeroen F.J. Laros, with help from Gerben R. Stouten and Gerard C. P. Schaafsma, according to specifications provided by Peter E. M. Taschner and Johan T. den Dunnen. The different parts of the nomenclature checker functionality have been separated into modules, which can be used as independent webservices and undergo further development and extension in the future.


If you have any comments or suggestions be sure to let us know!

Last modified: November 5, 2010

mutalyzer@humgen.nl