This page contains information that we have assembled to help you solve problems that you may experience. We recommend reading this information even if you don't think that you are having problems, since it may help you to obtain better sequences.
What to do when the sequence isn't good
Why can't I open the files?
All I see is 5 "N"s...
The most common reasons for bad data
The "flat line" or "DOA - Dead on Analysis"
Reading near the primer
Unusual base composition of templates and secondary structure
We provide you with a lot of information when you get your sequence. In addition to the sequence itself as a text file, you will get the electropherogram file and the sample reference spreadsheet. First check (using the sample sheet) that the samples you have are the right numbers (mistakes do happen!). Next use a chromatogram viewing application to view the actual data. If you can see that there is just 5 "N"s, this means that the base calling program did not find any data worth analysing. If you take a look at the raw data, you will see no good sequence peaks and if you take a look at the annotation page, you will see that the signal intensities for the individual bases are all very low (typically less than 50). If you get sequence, but it is not good, or it stops quickly, then this indicates a problem as well. The chromatogram file contains a lot of information that will help you to decide what the most appropriate course of action to take is and it is vital that you look at it. If you still have no idea what to do, firstly read the rest of this page (it may answer your questions!) then contact us to ask our advice.
We store the data on our server in data archives that compress the information. These arvhives have a ".zip" file extension and require a piece of software to expand up the archive so that the results files are available. Most computers should be able to expand up these archives using software already available. However, if you are having problems, there is free software is available to do this.
The base caller we use will produce just 5 "N"s if it decides that there is nothing worth analysing. It is usually (but not always) correct in its decision. Since the most common cause of analysis failure is a lack of signal, if the raw data indicates a lack of sequence peaks and the annotation shows that the signal intensities for the bases is very low (typically less than 50) then this is the problem. Take a look below for more help. If there are visible products (especially very short products - but not just dye terminators), then please contact us and we will attempt to reanalyse the sample(s) to obtain data for you.
Over the years it has become obvious that certain common mistakes appear time and time again. Here they are in decending order of frequency (solution to problem is brackets):
- Too little DNA (Check on agarose gel and use correct amount)
- Too much DNA (Check on agarose gel and use correct amount)
- Primer with too low Tm (Ensure Tm is about 60 Deg. C; follow design guide)
- Multiple templates (Pick well spaced colonies off plate)
- Multiple priming sites (Check to make sure template has only one priming site)
A total lack of any sequence can be very frustrating and does not give much information on which to base a solution to the problem. The most common reasons for this type of result are:
- One of the reaction components left out!
- Very low (<50ng) amount of template DNA (top of list above)
- Lack of priming site for primer used
- Totally unsuitable primer used (e.g. very low Tm, degenerate)
- Reaction products lost on ethanol precipitation
Some of these problems are ones that we have control over (eg addition of reaction components) and we can make mistakes. We are, afterall, only human! If you really think that the reaction should have worked, please contact us and we will be happy to repeat the sample using material already provided. If we have made a mistake (i.e. you get a good result with the repeat), then we will not charge for the repeat or the original. However, we do reserve the right to charge for both reactions should the repeat also fail (showing that the failure was not down to us). Hence, please double check everything before asking for repeats.
Due to the nature of DNA sequencing, the signal intensity always tends to decrease somewhat as the sequence extends. However, this should not be to the extent that the sequence becomes unreadable within a short distance of the primer. When a sequence starts off okay and then "dies" quickly (typically after a couple of hundred bases), this is often a result of low template quantity (see the list above again!). However, there are other causes. Salt or ethanol in the template DNA reduces the processivity of Taq DNA polymerase and results in the inability of the enzyme to produce long extension products. Salt in plasmid preps can be due to insufficient washing of ethanol pellets, insufficient washing of resin columns (mini-preps) or residual Caesium Chloride in the sample (maxi-preps). Ethanol in the DNA from precipitations that have not been dried properly can also result in the same effect.
Too much template also causes this effect (although the reaction has to be massively overloaded with template) since there is an excess of priming events leading to an abundance of short products. Massively too much DNA also leads to "retention" of the DNA on the capillary and therefore the sequence data only starts to come through when the run is almost finished. The intensity is also often too high for accurate sequence determination as well. This effect will be seen as very broad "ragged" peaks late in the read and nothing before it. The solution to this is to use the correct amount of DNA that will not cause capillary retention.
Guess what the most common reason for this is? That's it, lack of template! If the signal is low, then the background starts to become a problem. Automated DNA sequencers will always try to find a sequence even if there actually isn't anything worth analysing. As stated above, if the base caller finds nothing at all to analyse, it will often just produce 5 "N"s. However, if it feels that there is something to analyse, it will try to do this. The basecaller we use will not call "N"s normally, since it gives a quality score to each base instead (depending on your chromatogram reader, you may or may not be able to access this information). Hence a sequence full of mistaken base calls is usually due to weak sequence. Even in good sequence, it is possible to find regions where the sequence is not so good and this can be due to template-specific effects (i.e. "odd" sequence) or due to contamination with unicorporated dye-terminators (see below - reading near the primer). Contamination with greater amounts of ethanol than those that cause short reads can yield generally poor sequence over the entire read. RNA in the sample will also do this since the primers and Taq will tend to bind to it and result in weak/absent sequence. The host used to generate plasmid templates can have a significant effect. Some hosts (such as HB101 and its deriviatives, including TG1, TG2 and the JM100 series) contain large amounts of carbohydrates that are released on lysis and can contaminate DNA prepared from them. These strains also contain an intact endA locus and so produce a nuclease that can degrade the plasmid DNA and result in a poor template.
With any sequencing strategy there is a finite limit to how near to the primer it is possible to read. This is based on a number of factors to do with when chain termination begins and the resolving power of the system used to separate extension products. With automated DNA sequencers it is possible to read very near to the primer (up to a few bases away). However, with standard dye-terminator reactions reading closer than 20-30 bases is not usually possible due to the ineffeciency of recovery of very short extension products during the purification step and lack of resolution of the short fragments. Should you wish to read very close to the primer, please contact us to discuss approaches that can be used to attempt this. One option you could try is to use modified primers. One such solution is "AMBeR Sequencing Primers", which are available from Biolegio.com (distributed via Nimagen in the UK). These primers are said to allow reading from very near to the primer and can be useful if you cannot design a primer further away. If possible, using a regular primer that is 50 bases or so away from the area where you wish to start reading from is the simplest option.
This is quite different from what is seen with weak sequence where the background signal becomes significant, leading to ambiguous base calls. When multiple sequences are seen two or more distinct peaks are present at each base location (unless, or course, both/all bases happen to be the same one), resulting in "peaks on top of peaks". There are several causes of this. The presence of more than one template in the sample is a common cause. What is often seen is that the sequence is good within the plasmid sequence, then becomes unreadable past the cloning site. The plasmid backbone sequence is the same in both/all templates while the inserts are different, with predictable results. The presence of more than one priming site in the plasmid (or more than one primer in the reaction!) will also cause this and the only way around this is to use a different primer.
Another reason for multiple sequences is poly A tails. Most cDNAs are generated by oligo dT priming. This results in a poly A/T stretch at the 3' end of the cDNA. The sequencing chemistry utilises Taq DNA polymerase. This polymerase is not very good at polymerising accurately through long stretches of homopolymeric sequence and it "slips"; by which I mean that it will either add extra nucleotides or remove nucleotides. Hence, not all the synthesised products will be of the same length. This results in multiple sequences after the poly A/T region. Depending on the length of the poly A/T section, the result can be either perfectly fine (generally less than 20 residues of A/T), mildly affected (20-40 A/Ts) or almost unreadable (over 40 A/Ts). In mild cases, it is often possible to correct the sequence by looking for "pre-peaks". This is a identical (but weaker) sequence running 1 nt before the correct sequence and is characterised by a small signal that is identical to the next major signal. e.g. If there is a G followed by a C, one will see a weak C peak under the G peak. If this causes an ambiguous base call (N), then it is fairly easy to correct this. Since this effect is a result of an inherent property of the polymerase, there is no way for us to correct it.
One effect that can be seen that has nothing to do with the template, but causes the same effect is due to poor primer synthesis. Primers are synthesised from the 3' end and so if things go wrong (e.g. a base is not incorporated) it is often at the 5' end (made last). Since the sequencing reaction products are extended from the 3' end, these errors are not removed (if a base was missing from the 3' end, the polymerase would simply fill it in and nobody would ever know!). If a proportion of the primer is n-1 (i.e. lacking the last base) then sequence products made with this primer will be of two lengths (n and n-1 plus whatever gets added by the polymerase). This will yield two sequences at every position. Depending on the amount of n-1 primer, this may or may not be a problem. Usually the amount of n-1 primer is very low (a good primer). However if the amount of n-1 primer is above about 10% of the total, then it is possible to see this in the sequence as a "pre-read" (i.e. there is a small amount of the next base at every position in the sequence). Whilst it is possible to HPLC purify the full length primer from the truncated one, an easier solution is to have the primer remade.
Some templates have an unusual abundance of certain bases such as AT
or GC or contain repeating regions or regions capable of forming stable
secondary structures (e.g. hair-pin loops). Alternatively they may
contain homopolymeric regions, where only a single base is present. All
of these characteristics can present problems. AT-rich templates can
make it very difficult to design primers with the desired
characteristics (see primer design), whilst GC-rich templates are often
very difficult to sequence satisfactorily because the Taq polymerase
has great difficulty separating the strands of DNA once they anneal
together. Any secondary structures within a single strand also tend to
be much more difficult for Taq to polymerise through. Altered reaction
conditions can help (e.g. increased denaturation time/temperature and
the inclusion of compounds that destabilise base pairing such as DMSO
at 5-10%) as can subcloning out smaller regions of the DNA.
Homopolymeric regions of sequence are very difficult for Taq to read
through accurately, since it tends to "slip" in such areas and miss out
bases. This then results in a classical double sequence as described
above. Long PolyA tails on cDNA clones are often a cause of this type
We have various methods at our disposal that can (emphasise can, not
will!) help with reading through difficult regions of sequence. Strong
secondary structures cause special difficulty for the standard reaction
mix and often result in "hard stops" where the sequence just dies off
well before it should because the polymerase is incapable of separating
the base-paired nucleotides. A number of approaches can be used
to assist with such problems and customers should contact us to discuss
their possible use. Sometimes these approaches still fail to generate
sequence and in these circumstances the customer has various options:
- Give up (usually not really an option!).
- Try to generate a PCR product containing the area of sequence and use either 7-deaza-GTP or dITP in the PCR reaction instead of dGTP. These two base analogues are not capable of forming as many hydrogen bonds as dG and so generate weaker base-pair interactions. This often allows Taq polymerase to separate the base-paired nucleotides in secondary structures and so read through the problem area.
- Another option that can be tried is a really clever method that relies on randomly mutating the template DNA so that secondary structures (or other problem regions) are changed to sequences that no longer cause problems for the sequencing. This method is known as "Sequencing Analysis via Mutagenesis" (SAM). Quite a lot of work is involved, since mutagenic PCR reactions must be performed using various base analogues that result in altered sequence of the PCR product (about 20% change is good), followed by sub-cloning of these PCR products and sequencing of reasonable numbers of them. By mutating the region, the secondary structure or repeat that caused a problem will be abolished in various of the PCR products and so good sequence can be obtained. By sequencing multiple products, a concensus sequence can be produced that represents the original (unmutated) sequence. This approach clearly entails a significant amount of work, but it has been shown to be highly effective at sequencing through "impossible" regions of DNA. The work is published in a Nucleic Acids Research paper (Nucl. Acids Res. (2004) 32(3): e35 doi:10.1093/nar/gnh022).