LeetCode Problem

How to Solve Repeated DNA Sequences

Repeated DNA Sequences is best solved with a fixed-size sliding window over every 10-character substring. The core move is to record each window the first time you see it, then add it to the result only when it appears again. A plain hash-set solution is easy to explain, while a 2-bit encoding turns the running window into a compact rolling state update.

GhostInterview Help

Need help with Repeated DNA Sequences without spending extra time grinding it?

GhostInterview can read Repeated DNA Sequences from a screenshot, generate the answer path, explain the complexity, and support solver-first interview workflows when you need direct help fast.

Screenshot Input

Capture the prompt fast instead of rewriting the problem by hand.

Answer + Complexity

Get the solution path, trade-offs, and complexity summary in one pass.

Stealth Workflow

Stay outside captured layers on supported screen-share workflows.

Problem #187Sliding window with running state updatesReviewed 2026-03-07
Difficulty
Medium
Primary pattern
Sliding window with running state updates
Answer-first problem summary
Step-by-step approach and complexity
GhostInterview solver workflow

Repeated DNA Sequences is best solved with a fixed-size sliding window over every 10-character substring. The core move is to record each window the first time you see it, then add it to the result only when it appears again. A plain hash-set solution is easy to explain, while a 2-bit encoding turns the running window into a compact rolling state update.

Problem Statement

In Repeated DNA Sequences, you are given a string made only of the DNA characters A, C, G, and T. Your task is to find every substring of length 10 that appears at least twice anywhere in the string, and the return order does not matter.

The interview focus is not generating all substrings blindly, but scanning overlapping length-10 windows without missing duplicates or reporting the same repeated block multiple times. For example, in s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", the repeated windows are "AAAAACCCCC" and "CCCCCAAAAA", while in s = "AAAAAAAAAAAAA", many overlapping windows exist but the answer still contains only "AAAAAAAAAA" once.

Examples

Example 1

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example details omitted.

Example 2

Input: s = "AAAAAAAAAAAAA"

Output: ["AAAAAAAAAA"]

Example details omitted.

Constraints

  • 1 <= s.length <= 105
  • s[i] is either 'A', 'C', 'G', or 'T'.

Solution Approach

Start with a fixed-size sliding window

Because every valid answer has length 10, slide one window from left to right and inspect each substring s[i:i+10]. This avoids pointless work on other lengths and keeps the reasoning tied to the exact constraint that makes Repeated DNA Sequences a window problem instead of a general substring search problem.

Use two sets to separate seen windows from repeated windows

Store each 10-letter sequence in a seen set the first time it appears. When the same sequence appears again, place it into a second set or the answer collection. That second bucket matters because this problem asks for each repeated DNA block once, even when overlaps cause the same 10-letter substring to appear three or more times.

Optimize with 2-bit encoding and running state updates

Since DNA has only four characters, map A, C, G, and T to 2 bits and encode each 10-letter window into 20 bits. Then update the window state by shifting left, adding the new character, and masking away bits that fell out of the 10-letter range. This removes repeated substring construction and matches the intended sliding-window-with-running-state pattern for Repeated DNA Sequences.

Complexity Analysis

MetricValue
TimeDepends on the final approach
SpaceDepends on the final approach

With direct substring hashing, the scan touches each starting index once, so the overall pass is linear in the number of windows, with extra memory for stored patterns. If substring creation is treated as copying 10 characters, the constant factor stays small because the window length is fixed. With 2-bit rolling encoding, each step updates a 20-bit state in O(1) time and still uses set space proportional to the number of distinct 10-letter windows encountered.

What Interviewers Usually Probe

  • They expect you to notice the substring length is fixed at 10, which strongly suggests a sliding window instead of variable-length scanning.
  • They want you to handle overlap correctly, because repeated DNA blocks can start one index apart and still count.
  • They may push for a bit manipulation upgrade after the hash-set solution, especially if they mention A, C, G, and T as a tiny alphabet.

Common Pitfalls or Variants

Common pitfalls

  • Adding a sequence to the result every time it repeats, which produces duplicates when the same 10-letter block appears many times.
  • Forgetting the early return when s.length < 10, which leads to unnecessary logic around impossible windows.
  • Breaking the rolling encoding by not masking out old bits, causing the running state to represent more than 10 characters.

Follow-up variants

  • Return the count of each repeated 10-letter DNA sequence instead of only the unique repeated strings.
  • Change the target length from 10 to k, which keeps the same sliding-window idea but changes the mask width in the encoded version.
  • Process a stream of DNA characters online, where the last 10 characters define the current window and repeats are detected incrementally.

How GhostInterview Helps

  • GhostInterview can walk window by window through Repeated DNA Sequences and show exactly when a DNA block moves from seen to repeated.
  • GhostInterview can compare the plain hash-set version with the 2-bit rolling-state version so you can choose the right implementation trade-off quickly.
  • GhostInterview can catch problem-specific mistakes like duplicate result inserts, off-by-one window bounds, and incorrect bit masking for the 10-letter window.

Topic Pages

FAQ

What is the best pattern for Repeated DNA Sequences?

The cleanest pattern is a fixed-length sliding window backed by hashing. Because every target substring is exactly 10 characters long, you only need to scan those windows and record which ones have appeared before.

Why does overlap matter in this problem?

The repeated 10-letter DNA sequences can overlap heavily, especially in inputs like "AAAAAAAAAAAAA". You cannot skip ahead after a match because the next valid repeated window may start just one position later.

Should I use strings or bit manipulation for LeetCode 187?

Start with strings and sets if you want the fastest correct explanation. Use bit manipulation when you want a tighter implementation that updates the 10-letter window as a compact rolling state rather than rebuilding substrings.

Why do we need two sets or an equivalent repeated marker?

One set tracks whether a 10-letter sequence has appeared before, and the second prevents duplicate answers. Without that separation, a sequence that appears three times would be appended multiple times even though the output should list it once.

What edge case is easiest to miss in Repeated DNA Sequences?

The easiest miss is when the input length is less than 10, because no valid DNA substring can exist. Another frequent miss is mishandling the encoded window mask, which silently corrupts later matches.

GhostInterview Solver

Need direct help with Repeated DNA Sequences instead of spending more time grinding it?

Download GhostInterview when you want a LeetCode solver, not another long practice loop. Capture Repeated DNA Sequences from a screenshot, get the answer path and complexity, and use supported stealth workflows that stay outside captured layers.

Screenshot Input

Capture the prompt fast instead of rewriting the problem by hand.

Answer + Complexity

Get the solution path, trade-offs, and complexity summary in one pass.

Stealth Workflow

Stay outside captured layers on supported screen-share workflows.