Repeated DNA Sequences is best solved with a fixed-size sliding window over every 10-character substring. The core move is to record each window the first time you see it, then add it to the result only when it appears again. A plain hash-set solution is easy to explain, while a 2-bit encoding turns the running window into a compact rolling state update.
Problem Statement
In Repeated DNA Sequences, you are given a string made only of the DNA characters A, C, G, and T. Your task is to find every substring of length 10 that appears at least twice anywhere in the string, and the return order does not matter.
The interview focus is not generating all substrings blindly, but scanning overlapping length-10 windows without missing duplicates or reporting the same repeated block multiple times. For example, in s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", the repeated windows are "AAAAACCCCC" and "CCCCCAAAAA", while in s = "AAAAAAAAAAAAA", many overlapping windows exist but the answer still contains only "AAAAAAAAAA" once.
Examples
Example 1
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example details omitted.
Example 2
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Example details omitted.
Constraints
- 1 <= s.length <= 105
- s[i] is either 'A', 'C', 'G', or 'T'.
Solution Approach
Start with a fixed-size sliding window
Because every valid answer has length 10, slide one window from left to right and inspect each substring s[i:i+10]. This avoids pointless work on other lengths and keeps the reasoning tied to the exact constraint that makes Repeated DNA Sequences a window problem instead of a general substring search problem.
Use two sets to separate seen windows from repeated windows
Store each 10-letter sequence in a seen set the first time it appears. When the same sequence appears again, place it into a second set or the answer collection. That second bucket matters because this problem asks for each repeated DNA block once, even when overlaps cause the same 10-letter substring to appear three or more times.
Optimize with 2-bit encoding and running state updates
Since DNA has only four characters, map A, C, G, and T to 2 bits and encode each 10-letter window into 20 bits. Then update the window state by shifting left, adding the new character, and masking away bits that fell out of the 10-letter range. This removes repeated substring construction and matches the intended sliding-window-with-running-state pattern for Repeated DNA Sequences.
Complexity Analysis
| Metric | Value |
|---|---|
| Time | Depends on the final approach |
| Space | Depends on the final approach |
With direct substring hashing, the scan touches each starting index once, so the overall pass is linear in the number of windows, with extra memory for stored patterns. If substring creation is treated as copying 10 characters, the constant factor stays small because the window length is fixed. With 2-bit rolling encoding, each step updates a 20-bit state in O(1) time and still uses set space proportional to the number of distinct 10-letter windows encountered.
What Interviewers Usually Probe
- They expect you to notice the substring length is fixed at 10, which strongly suggests a sliding window instead of variable-length scanning.
- They want you to handle overlap correctly, because repeated DNA blocks can start one index apart and still count.
- They may push for a bit manipulation upgrade after the hash-set solution, especially if they mention A, C, G, and T as a tiny alphabet.
Common Pitfalls or Variants
Common pitfalls
- Adding a sequence to the result every time it repeats, which produces duplicates when the same 10-letter block appears many times.
- Forgetting the early return when s.length < 10, which leads to unnecessary logic around impossible windows.
- Breaking the rolling encoding by not masking out old bits, causing the running state to represent more than 10 characters.
Follow-up variants
- Return the count of each repeated 10-letter DNA sequence instead of only the unique repeated strings.
- Change the target length from 10 to k, which keeps the same sliding-window idea but changes the mask width in the encoded version.
- Process a stream of DNA characters online, where the last 10 characters define the current window and repeats are detected incrementally.
How GhostInterview Helps
- GhostInterview can walk window by window through Repeated DNA Sequences and show exactly when a DNA block moves from seen to repeated.
- GhostInterview can compare the plain hash-set version with the 2-bit rolling-state version so you can choose the right implementation trade-off quickly.
- GhostInterview can catch problem-specific mistakes like duplicate result inserts, off-by-one window bounds, and incorrect bit masking for the 10-letter window.
Topic Pages
Related GhostInterview Pages
- LeetCode Interview Copilot - Use GhostInterview as a live solver when you want direct help with LeetCode-style coding questions.
- Coding Interview Assistant - See how GhostInterview supports array, string, linked list, graph, and tree interview workflows.
- How GhostInterview Works - Review the screenshot, reasoning, and answer flow before using the solver in a live interview.
FAQ
What is the best pattern for Repeated DNA Sequences?
The cleanest pattern is a fixed-length sliding window backed by hashing. Because every target substring is exactly 10 characters long, you only need to scan those windows and record which ones have appeared before.
Why does overlap matter in this problem?
The repeated 10-letter DNA sequences can overlap heavily, especially in inputs like "AAAAAAAAAAAAA". You cannot skip ahead after a match because the next valid repeated window may start just one position later.
Should I use strings or bit manipulation for LeetCode 187?
Start with strings and sets if you want the fastest correct explanation. Use bit manipulation when you want a tighter implementation that updates the 10-letter window as a compact rolling state rather than rebuilding substrings.
Why do we need two sets or an equivalent repeated marker?
One set tracks whether a 10-letter sequence has appeared before, and the second prevents duplicate answers. Without that separation, a sequence that appears three times would be appended multiple times even though the output should list it once.
What edge case is easiest to miss in Repeated DNA Sequences?
The easiest miss is when the input length is less than 10, because no valid DNA substring can exist. Another frequent miss is mishandling the encoded window mask, which silently corrupts later matches.
Need direct help with Repeated DNA Sequences instead of spending more time grinding it?
Download GhostInterview when you want a LeetCode solver, not another long practice loop. Capture Repeated DNA Sequences from a screenshot, get the answer path and complexity, and use supported stealth workflows that stay outside captured layers.
Capture the prompt fast instead of rewriting the problem by hand.
Get the solution path, trade-offs, and complexity summary in one pass.
Stay outside captured layers on supported screen-share workflows.
Stay in the same pattern family
Check if binary string contains all integers from 1 to n as substrings, leveraging sliding window and bit manipulation techniques.
Open problem page#76 Minimum Window SubstringFind the smallest substring of s containing all characters from t using a sliding window with running state updates for exact matches.
Open problem page#30 Substring with Concatenation of All WordsFind all starting indices of substrings in a string that are concatenations of a given list of words.
Open problem page