bottom of frame  

Help and documentation for SelTarbase

MNRs: what is it?

MNRs (monunucleotide repeats) are small repetitive DNA elements consisting of one nucleotide (A, C, G, or T) as single unit, also known as mononucleotide-microsatellites. These are prone to deletion and insertion mutations due to defective deficient DNA mismatch repair (MMR deficiency). Manifestation of those mutations is called microsatellite instability (MSI). High grade MSI (MSI-H) can be found in up to 15% of human sporadic and nearly all cases of hereditary tumors of HNPCC/ Lynch syndrome. The underlying MMR deficiency is primarily caused by epigenetic inactivation of one allele of the MMR gene MLH1 by promoter methylation followed by any other inactivation of the second allele in most sporadic tumors. Mutation carriers in HNPCC/ Lynch syndrome families already have one defective allele of one of the MMR genes (MLH1, MSH2, MSH6, PMS2, a.o.) by birth (predisposition syndrome). The acquiring of the loss of the second allele leads to MMR deficiency and tumor development in a variety of organs (mainly colon, endometrium, and stomach). A defined number of MNR mutations is believed to be the driving force of MSI tumorigenesis. MSI-H tumors show a number of certain clinico-pathological properties. Chemoresistance to 5-FU and a more favourable prognosis are intensively discussed topics. The underlying mechanisms could be due to the implications of frameshift mutations as a consequence of cMNR/ iMNRx alterations leading to functional protein loss or immunologic effects. Therefore, a systematic classification of MNRs by their coding status seems very useful.

MNRs can be divided into several functional groups according to their coding status:

Currently, two are two categories of MNRs with a trailing x, which indicates, that the MNR is somehow involved in splicing or altered by splicing. A cMNRx, uMNR(5|3)x, or pMNRx shows different length at genomic and transcriptional level resulting from a splice event, these transcribed MNRs can be shorter or longer at genomic level. An iMNRx is a polypyrimidine iMNR (A/T) nearby a splice acceptor site. These iMNRs can lead to splicing alteration if varied in length resulting in an exon skipping. In two thirds of exon skipping the reading frame of the resulting skip-cDNA is shifted alike in cMNRs resulting in a peptide/ protein truncation and a neopeptide.

MNRs that end in an ? (e.g. cMNR?) cannot finally be classified into that group using the current Ensembl rel. 66_37, but there exist either evidence from other sources for that status or doubts in the maybe only lacking information within current Ensembl rel..

top of frame

Using SelTarbase

The versions of SelTarbase

There are four versions of SelTarbase:

In the following there is given the site structure of SelTarbase. The site is divided into three frames:

Navigation frame

The navigation frame enables to select the database version (data set: 2003, last, latest, FIUO) and provides the links to direct to the general sub pages (Services) within SelTarbase (Status History, Help and documentation, How to cite SelTarbase, Submit new data, Register with SelTarbase, and Contact/Impressum.
The remainder links are version-specific (Start here, the statistical analyses, the tract list, and the reference list).

Header frame

The header frame provides the login and the search function, where you can search SelTarbase for genes, accession numbers, tracts or authors respectively title. Filters if set are noted here. Currently, there are filters for references, MNR types and/or MNR length available.

Main page

Starting with some general information about SelTarbase ...

SelTarbase main page (1)

the main page always will provide a summary status.

SelTarbase main page (2)

This is an overview of the information content of SelTarbase, and the selected version will be compared with the previous one. Most users will start with version 'latest' and the compared smaller version is 'last'. Therefore, one can recognize the differences of these two versions (what has happened since). You will find data on references, genes, tracts (MNRs), and observations. Each item is split into 'analyzed' and 'included', as well as stratified by entity (colon, stomach, endometrium, and colon culture). Tracts (MNRs) additionally are subdivided in coding and non-coding ones.

top of frame

SelTarbase News

SelTarbase News provides a short overview, which references contributed new data to the mutation database compared to the previous version, and to which genes/ MNRs these data are related to.

SelTarbase News

top of frame

Regression analysis

For each entity (colon, stomach, endometrium, and colon culture) there is a separate regression analysis. It is a combined calculation of ncMNRs (open diamonds) and cMNRs (gray filled circles). Red marked ncMNRs or cMNRs changed since last release, respectively represent new data. The fitted regression line is drawn as solid black line, the upper and lower prediction lines as bold dashed gray lines. For details of the statistical methods see here.

SelTarbase regression analysis (1)

Each entity has two graphs with different x scale (maximum length x = 20 at the top, maximum length x = 40 at the bottom of the page), and a complete tract list for this entity. Tracts with suspicious mutation frequencies are marked green (elevated) or red (reduced). If you move your pointing device over data points that reside above the upper respectively below the lower prediction line, a popup will appear, showing the name of the gene and the mutation frequency (within the bottom graph (x = 40) for x > 15 only). If you click this name you will be forwarded to the respective position within the entity table.

SelTarbase regression analysis (2)

The table shows the gene name, alternative gene names (if available), the tract name (accession number, nucleotide, length, position), the total number of investigated tumors, the number of mutated tumors thereof, the resulting mutation frequency, and finally the number of references, that contributed to that numbers. You can choose the tract name to get to the respective position of the complete tract list (loading can take a while). You also can pick the reference number: a popup will appear and you can get to the respective position of the complete reference list.

SelTarbase regression analysis (5)

All entity related regression results are summarized in the Predicted Targets list. According to the table view of each entity here are all MNRs listed, that show a significantly deviated mutation frequency in any entity, elevated or reduced. Here is also cross linking for gene, tract, and entity-related information available.

A further regression analysis only for colorectal data is presented stratified by coding status of all included MNRs: cMNRs (coding, transcribed and translated into amino acids), ncMNRs (untranslated [uMNRs] and intronic [iMNRs]), and nMNRs (intergenic). Additional information is provided in the figure legend of this analysis.

SelTarbase regression analysis (3)

As a final supportive statistical analysis there is also a non-regression calculation, in which the mean mutation frequency and the two times standard error (2xSE) is shown for each nucleotide type and length. The positions of the mean points clearly describe an s-shape concerning course and mode. As well one can see, that the data density towards longer MNRs is strikingly decreasing and therefore, the variation is increasing. Additional information is also provided in the figure legend of this analysis.

SelTarbase regression analysis (4)

top of frame

Complete tract list

The tract list is ordered by gene name (hugoID) and meanwhile subdivided in the 26 starting letters due to page loading performance. You can select the respective starting letter and all MNRs whose gene name is starting with that letter will be shown.

SelTarbase tract list

From here, you can choose to see detail information of the listed tracts or follow the links to ensembl genome browser or NCBI EntrezGene for the respective genes. You can also decide to go directly to statistical regression analysis of an entity if data is provided by clicking onto the mutation rate in the respective column.

top of frame

Complete reference list

The reference list is ordered by the reference ID (first author, year, suffix) and the publication date (PubMed). In case of simultaneous publication date the PubMed ID is used the set the order.

SelTarbase reference list (1)

Additionally, there is a complete journal list, that provides an overview, where the included articles are published. This list is ordered in descending order by the number of contributions of each journal.

SelTarbase reference list (2)

Furthermore, there is a supplemental reference list, from which information of MSI-H colorectal cancer cell lines is derived or primer sequences for MNR analysis. References that also contributed to the mutation database are linked to the reference detail information.

SelTarbase reference list (3)

top of frame

Search function (genes, accession numbers, tracts)

Within the header frame there is a search function where you can put in a gene name, an accession number, or tract pseudonyms (e. g. BAT26), or even parts of them. Now there is the possibility to search for several terms at one time. By default all terms are searched for in an OR clause. For combined searches, where two or more terms should match together, prepend an "+" directly before the second, third, ... search term (e. g. "ACVR2 +TGFBR2" will only present results where both terms match. In case of more than 100 hits per sub-search a hint for a more precise search phrase is given in order to avoid long loading times resulting from such listings.

Option direct

If activated this option will take you immediately to the tract detail information of a single matching MNR to your query. With this option set only MNRs will be searched and other topics (MNR_ensembl, sample, int, and pdf) will be omitted. If you are more interested in an overview of repeat tracts of the gene you are looking for or if you wish to search also within the other topics, then you should perform your search without option direct.

Search function (MNRs)

The example below shows the results for the search for "TGFBR". There are two MNR hits: for the gene TGFBR2 and the most interesting long coding MNR A10 within this gene.
Additionally, now the results for the search for "TGFBR" within the references list is shown. Here, three references are reported.

SelTarbase search function (1)

By selection of the third link you will be directed to the tract detail information of "M85079.A10.709". Normally there would appear only one link per accession number. The most 5' localized MNR will be announced (here "M85079.G5.343"). If there are several published accession numbers for one gene, then all of them are reported. If there is a special repeat tract within a gene (with pseudonym, e. g. "BATRII"), then there will be announced more than one hit for this accession number.

Using the first link "show all MNRs of genes/ tracts containing ..." one will see a detailed tract list according to the complete tract list, containing only MNRs matching the search term.

SelTarbase Partial tract list

Since MNR_ensembl (database for human coding, untranslated, non-coding, and intronic MNRs) is now fully integrated within SelTarbase you also can search for them, too.
This service requires free registration and login.

Search function (references)

Alongside the query in tract detail information such as hugoID, accession number a.s.o., the phrase is also searched for in the included reference list within the reference ID, the PubMed ID, the DOI, the author names, and the manuscript titles by default. The returned list is sorted by the reference ID and allows for selecting the reference detail information by clicking the reference ID respectively to visit the reference' PubMed entry at NCBI. Here, now the same search algorithm is used as within the other search areas. That means that terms separated by spaces are treated as OR clauses. To perform a search with all words matching a "+" has to be prepended before the second, third, ... term of the search phrase.

SelTarbase search function (2)

Option pdf (version FIUO only)

In version FIUO (For Internal Use Only) the option pdf is available and activated by default, and allows for searching of fulltext pdf documents of included references.

SelTarbase search function (3)

Search function (cell lines / mutation status)

A complete new search tool was integrated now, allowing for reported detail mutational status information of MSI-H colorectal cancer cell lines by entering the name or even parts of the name of those cell lines. This option is not set by default. The result list is similar to the complete/ partial tract information, but providing information about the contributing reference, the mutational status (1 = mutated, 0 = wildtype), as well as the reported allele(s) if documented. Where is no detailed allele information documented a "-" appears. In the following example one can see the results for the search "SNU-C2A". Mutated tracts are highlighted in bold letters. The tract name forwards to the respective tract detail information and the reference ID to the reference detail information. By default only the first 100 tracts will be reported due to loading performance of the result list. In the case of more than 100 tract hits there is a link provided for obtaining the complete result list ("Your query (...) matched to more than 100 sample entries in SelTarbase. For a complete list of all ... hits click here.").

SelTarbase search function (4)

top of frame

Tract detail information

The Tract detail information is subdivided into several sections starting from general information about the MNR as length (maybe a reported length), the coding status (for a classification see above), the publication, in which this MNR was firstly mentioned, the accession number (EMBL, NCBI, ensembl), the official gene name (hugoID) as well as alternative names (including a pseudonym if exists: e. g. BAT26), the chromosomal location, and links to other databases (ensembl, NCBI, SOURCE, Genecards) for further information about the containing gene or locus.
The second part summarizes all available mutation frequency data in different tissues (Colon, Endometrium, and Stomach) with the respective underlying cumulative sample numbers. Here, one can go to the regression analyses or have an overview of contributing references.
The next part shows all available sequence information, starting from the relevant genomic sequence (using BLAST one can see where the MNR area fits to the chromosome), maybe some information about primers for genomic sequencing or fragment analysis (with reference), a transcript sequence, if the MNR is located on a transcript as well (cMNRs/uMNRs) or close to one (some iMNRs), a list of all currently annotated ensembl transcripts of the corresponding gene (apart from nMNRs) including ATG, STOP and MNR position, links to ensembl transcript and peptide information, as well as a link to the corresponding information within the subcellular localization database LOCATE, and the final coding status.

SelTarbase tract info (1)

The last section provides available detail information about mutational status and described allele(s) of human MSI-H colorectal cancer cell lines with the respective source. Either by clicking mutated allele(s) or following the link for "Show consequences of mutated alleles ..." one will end by the Tract transcription information (for cMNRs only).

SelTarbase tract info (2)

Tract transcription information

Here are shown all currently annotated ensembl transcripts for the respective gene. The transcript sequences are numbered and aligned with the corresponding amino acid sequence (length in bp and aa is shown). The MNR is highlighted in bold red letters with a dark yellow background. Wildtype transcripts that do not contain the MNR are grayed out.
The stop codon of each transcript is highlighted in black letters with dark yellow background. The last exon of each tanscript is highlighted in orange letters.
For each transcript and each mutated allele (up to four variants: m2 (MNR shifted minus 2), m1, p1 (MNR shifted plus 1), p2) the transcript sequence is shown as well as the derived amino acid sequence. The peptide sequence starting from the MNR end to first stop codon (neopeptide) is highlighted in red letters. The complete length of the variant peptide and the neopeptide is shown. The molecular weight and pI can be calculated by following the link under the aa length using the Compute pI/Mw tool at ExPASy Proteomics Server. Variant transcripts, in which the MNR is not translated are also grayed out, as well as the corresponding (wildype peptide sequence).
For each frameshift transcript there is shown one of the following symbols indicating whether the transcript is subject of nonsense-mediated mRNA decay (NMD): NMDs NMD-sensitiv, NMDi NMD-irrelevant, NMDe NMD-escape (experimentally shown), or NMDu NMD-unclear. PTC is the premature termination codon.
All transcripts containing the MNR are aligned by MNR position on load. Having scrolled one can reset this position transcript-wise by clicking on the transcript sequence or by hitting the "align MNRs" button for all transcripts again. Using the navigation arrows one can scroll all transcript sequences aligned by 10, 25, 50, 100, 200, and 500 bp steps down or upstream.

SelTarbase tract info (3)

iMNRx transcription information

For iMNRx's, i.e. polypyrimidine iMNRs (A/T) nearby splice acceptor sites, that could alter splicing if varied in length leading to exon skipping, as it has been shown for MRE11 (iMNRx MRE11_IVS4). Here are shown all currently annotated ensembl transcripts for the respective gene alike for cMNRs, but the iMNRx is not highlighted since it is not part of the cDNA.
For each transcript that contains the possible skip-exon the skip-cDNA is shown as well as the derived amino acid sequence. A neopeptide results in two thirds of exon skipping due to a frameshift situation depending from the length of the skipped exon (in about one third exon length is dividable by three resulting in a in frame deletion). See also Tract transcription information above.
All transcripts containing the skipped exons are aligned by exon start position on load. Having scrolled one can reset this position transcript-wise by clicking on the transcript sequence or by hitting the "align MNRs" button for all transcripts again. Using the navigation arrows one can scroll all transcript sequences aligned by 10, 25, 50, 100, 200, and 500 bp steps down or upstream.

SelTarbase tract info (3a)

If your are experiencing problems using the cDNA and/or protein tools in particular BLASTN links with sequences with a length of about 2000 and more, getting an Request Error as shown below, then you maybe try to use these functions from behind a firewall of your local network. Some of these proxies truncate long URLs, so that the called function will not work properly. Please, use the fasta tool provided instead and copy/paste this sequence for further analysis.

SelTarbase Request Error

Reference filter

Within the tract detail information one can set a reference filter by clicking the button "select" and clear again (button "all"). This filter allows for "scrolling" through all included MNRs, to which a single reference contributed to instead of through all included MNRs by using the navigation arrows in the upper left corner of the frame. The order is determind by hugoID, accession number, and tract position. Within a new window one can select one of the references that contributed data to the actual selected MNR. The reference, that firstly described this MNR is preselected. Leaving the tract detail information the filter will be cleared automatically.

SelTarbase set reference filter

MNR filter

Within the tract detail information one can also set additional filters by clicking the button "MNR" and clear again. These filters allow for "scrolling" through all included MNRs, limited by coding status or repeat length by using the navigation arrows in the upper left corner of the frame. Within a new window one can select several parameters: cMNRs for coding, uMNRs for untranslated, iMNRs for intronic, pMNRs for pseudogene-, and nMNRs for intergenic MNRs. Additionally, the MNR list to "scroll" can be restricted by a minimum and/or a maximum length of the MNRs. These filters are usable in combination with the reference filter. Leaving the tract detail information the filter will be cleared automatically.

SelTarbase set MNR filter

top of frame

Reference detail information

The reference detail information provides a complete list of all tracts, to which the respective reference contributed to. The design of that list is similar to the complete/ partial tract list. It supplies details about the coding status as well as the mutation frequencies in all investigated entities. The tract name forwards to the respective tract detail information.
Within this view the using of the navigation arrows in the upper left corner of the frame allows for "scrolling" through all included references sorted alphabetically by reference ID.

SelTarbase reference info

top of frame

Registering with SelTarbase

Some functions of SelTarbase are available for registered users only, since they are CPU and time consuming. Registration is required for model recalculation after data submission and access to MNR_ensembl. All public SelTarbase data (versions 2003, last, and latest; published MNR data) are accessible freely without registration. Only not yet published data from ourselves are not freely accessible (version FIUO).
Registration with SelTarbase is completely free.
There is no storage of data nor logging of any activity of registered users.
In order to obtain an account for SelTarbase please provide the following information on the registration form:

That's all. After submitting these data, an registration confirmation email will be sent to your email address. If you reply on this email within 24 hours your account will work immediately after confirmation has been recognized by SelTarbase. After 24 hours without confirmation the account request will be erased and the username becomes disposable again.

SelTarbase register


The login window provides the functionality for login. For login you should put in your username and password. If you have forgotten your password you can ask for sending your password again. If you are not registered yet, simply use the button "Register".

SelTarbase login

top of frame

Submit function

Here, we provide the possibility for people working on MNR data in MSI-H tumors to test their own new data in a fresh regression analysis using these new data in combination with all published data (data basis: latest).
For our internal users we additionally provide a larger data set of MNRs including a respectable number of MNR mutation data not published so far (data basis: FIUO).
You can upload your data with a minimum of information to protect your own unpublished data using the syntax of virtual MNR names ("x.N9.0", where x is your pseudo-accession number, N9 the nucleotide and length, and 0 your pseudo-position). If you have supplemental data to already published MNRs, then it is recommended to use the exact MNR names of SelTarbase for data submission, in order to add these data correctly. Otherwise, these data would falsely appear as additional new data points and additionally prevent from contributing to the known cumulative mutation frequency.
More detailed information for this service is available at Submit data.
This service requires free registration and login.

SelTarbase submit function

top of frame


MNR_ensembl, the database for human coding, untranslated, non-coding, and intronic MNRs, is now fully integrated within SelTarbase. Here, one can search the entire human genome for cMNRs, uMNRs, ncrMNRs, and iMNRs (mouse genome: cMNRs, uMNRs, iMNRs). Simply provide a gene name, an accession number, or an ensembl human/ mouse gene number (ENSG/ ENSMUSG). Here, now the same search algorithm is introduced as within the other search areas. That means that terms separated by spaces are treated as OR clauses. To perform a search with all words matching a "+" has to be prepended before the second, third, ... term of the search phrase. One can choose besides the actual release (human is based on ensembl rel. 66_37 and mouse on rel. 45.36f), which MNRs will be reported (default cMNRs only), and in which order the will be listed (default length DESC(ending), position), since most people will essentially be interested in the longest cMNR of a gene. But you can also decide to have a whole MNR list of that gene ordered by genomic position. The genomic orientation of that gene will be considered. This service requires free registration and login.

SelTarbase MNR_ensembl (1)

The example below shows the results for the search for "TGFBR" in combination with default values (only cMNRs, minimum MNR length 4 nt). There are four hits ordered by ENSG: TGFBR3, TGFBR1, TGFBRAP1, and TGFBR2.

SelTarbase MNR_ensembl (2)

The 2nd example shows the results for another search using the wildcard asterisk (*) as search term (CAVE: selects anything) in combination with a minimum MNR length of 10 nt and further restriction of candidates using human LOCATE (subcellular localization database, v3 (200706)) information by selection of the three GO terms (GO:0016020: membrane, GO:0016021: integral to membrane, and GO:0016023: cytoplasmic membrane-bound vesicle). Multiple selection of GO terms is done by pressing CTRL while left click. There are 20 hits also ordered by ENSG: ICA1, SEC63, ..., TGFBR2, ..., GRM7, and TOMM7. MNR_ensembl lookups using LOCATE information are performed as batch jobs. Results are presented after job completion/ reload and are visible as soon as the result notification email has been sent. These results can be re-fetched up to 24 hours after having started the search using the request ID (RID) which is also included within the result notification email.

SelTarbase MNR_ensembl (2a)

By selecting "TGFBR2" leaving all options at default, one will get the list of all cMNRs of TGFBR2 ordered by length in descending order and secondarily by position. By default there are shown all ENSTs containing MNRs. That means that an MNR comes up several times if it is included within several transcripts. To avoid this, yielding a complete MNR list of a gene without duplicate entries, one can select "unique positions" within show option of the search form. Then each MNR will only be shown once, but with the lowest ENST entry number it is annotated.

This list also provides the possibility to have a look at ensembl genome browsers transcript or peptide info page (via ENST or ENSP). Additionally, one can choose to see the hypothetical results of a frameshift mutation (Tract transcription information) of that MNR by clicking "cMNR" (cMNRs only).

SelTarbase MNR_ensembl (3)

Following the link under the genomic position, one will get a new window containing the genomic sequence flanking the MNR (default 150 bp upstream and downstream, respectively). The sequence is formatted in GCG and easily can be copied and pasted to any e. g. primer design tool. If the length of the flanking sequence is not sufficient, it can be increased to 300 or 600 bp.

SelTarbase MNR_ensembl (4)

top of frame

The model, assumptions and statistics

Several assumptions were made to model the correlation between MNR length and mutation frequency:

Data mining and evaluation

The first extensive survey of the literature (April 2002) revealed 110 publications referring to mutation analyses of 245 coding and noncoding microsatellites in 177 genes either in MSI-H colorectal, gastric, or endometrial cancer. For the actual data set (version latest) see here.
Now, PubMed is weekly screened for search terms (microsatellite instability, MSI, MMR, HNPCC). Manuscripts, of which title or abstract contain evidence for MNR data, are screened as fulltext. The total number of analyzed manuscripts is meanwhile higher than 1.700.

For statistical analyses only those primary data were included which unambiguously assigned specific MNR mutations to individual MSI-H tumor samples or cell lines.
Microsatellite status of tumor samples was in the vast majority of cases true MSI-H according to the Bethesda (BOLAND et al., 1998) or revised guidelines (UMAR et al., 2006). For the data of some older manuscripts however, authors only used dinucleotide tracts (e. g. D5S123 a. s. o.) for instabilotyping. Where raw data were available a quasi MSI-H status was confirmed and only those data were included. If not, "RER++"-samples were believed to be true MSI-H. A few included data rely on BAT26-MSI only. In all other cases, whenever instabilotyping data were available, the true MSI-H status was verified and only data of true MSI-H samples were considered. If instabilotyping clearly was not following the Bethesda or revised guidelines, so that the sample collective only represented a subset of true MSI-H tumors or included MSI-L tumors, and raw data were not obtainable for retyping, data were not included.
There was no discrimination between sporadic MSI-H and HNPCC MSI-H tumors regarding data inclusion, although documented if noted.
A minimum cumulative sample number (n = 10) was defined as a calculation entry criterion for each MNR in order to reduce sampling errors. Due to smaller total numbers of cell lines the threshold was decreased to 5 in order not to loose too many data points. This possibly leads to a higher inaccuracy, but to a higher variance by all means.

For the initial data set (version 2003) mutation data on 161 cMNRs in 108 genes originating from 101 publications met these criteria. In addition, unpublished mutation data on eight novel cMNRs were included. Also, cumulative mutation frequencies in MSI-H tumors for 25 noncoding microsatellites in 22 different genes originating from 33 publications and own analyses were included. This data set showed a disproportionate distribution regarding repeat type and length. In particular, homopolymeric runs of A represented the most commonly investigated MNR type [n(A) = 131, n(C) = 20, n(G) = 17, n(T) = 25] and thus excluded subsequent regression analysis by nucleotide. This is still the case for the actual data set (version latest) and also for the data set of version FIUO. Especially short and long MNRs were under represented [n(N≤4) = 9, n(N5) = 21, n(N6) = 24, n(N7) = 10, n(N8) = 59, n(N9) = 38, n(N10) = 21, n(N≥11) = 12]. Due to intensive detail revision of all formerly included references and careful interpretation of sequencing data the number of small MNRs could be increased to a high extent. In the actual version particularly long MNRs with more than 10 repeat units represent less than 3 % (1 %) of all MNRs included (CRC/ ZCRC). Overall, 194 MNRs (169 cMNRs; 25 non-coding MNRs) in 137 genes had been included for statistical analysis in version 2003.

Some MNRs had to be corrected in regard to the reported repeat tract length. In 39 of 2376 MNRs (1.6 %) there was a difference of the reported length in comparison to genomic information from ENSEMBL respectively NCBI. Thoroughly balancing the possible consequences and drawing additional sequence information of cDNA databases into account we corrected in these cases the reported length. In most cases, the difference was only one nucleotide (26/39) or two nucleotides (6/39). The highest differences were ascertained for ADAMTS9 (T37 instead of real T25), TRPM1 (reported as A14, in reality intermitted two times by a G resulting in two A5 tracts), BAT20 (T20 instead of real T25), and MYB (T19 instead of real T22), PPP3CA-T29 (reported as T26) as well as BAT40 (T37 reported as T40).

All data were stored in a MySQL database.

SelTarbase statistics

Building data tables

Cumulative mutation frequencies were determined separately for each tumor entity by several SQL statements. These tables were the basis for statistical regression calculation.

Statistical analysis

To model the dependency of the cumulative mutation rate yij on the repeat length xij of publications i to j a nonlinear regression model was chosen: yij = f(x,θ) + ε where f(x,θ) describes the nonlinear relationship between repeat length and mutation rate. The errors are assumed to be centered random variables, E(ε) = 0, having homogeneous variance. Assuming that the mutation rate is logistic in log length we use a four-parameter logistic regression model defined by

f(x,θ) = δ + α / [ 1 + exp{β - γ * log(x)} ]

with nonzero lower asymptote δ0. The upper asymptote δ + α1 represents the maximum mutation rate possible. This parameterization was chosen due to its good statistical properties (RATKOWSKY et al., 1986).

The parameter vector θ = (δ, α, β, γ)T is estimated by maximum likelihood (ML) using constraints for the lower and upper asymptotes as described above. The fitted curve is skew-symmetric with an inflection point at x = γ/β. To determine the accuracy of the parameter estimates, ˆθ, and estimates for functions of parameters, ˆλ = λ(ˆθ), we use the Wald test and the Likelihood Ratio test, respectively (HUET et al., 2003, Statistical Tools for Nonlinear Regression. A Practical Guide with S-PLUS examples, Springer, New York, USA). In addition asymptotic 99 % confidence and prediction intervals were computed.

The software used for data analysis and visualization is R version 2.4.0 (2006-10-03) (R project), with the additional software library nls2 (version 2003.2) for nonlinear regression (nls2).

top of frame