Technical SEO

robots.txt for SaaS: What to Block, What to Allow, and the Mistakes That Get You Deindexed

A single bad line in robots.txt can silently block Googlebot from your entire website. Here's the complete guide — with copy-paste templates for Next.js, Webflow, and WordPress.

April 15, 2026 · 10 min read

We ran a technical audit on a funded Indian SaaS company last month. They'd been live for two years, published 60+ blog posts, and built backlinks through guest posting. Their organic traffic had flatlined for 18 months.

The culprit: a single line in their robots.txt file from a developer who'd blocked the entire /blog/ directory during a content migration and forgot to re-enable it.

Sixty blog posts. Invisible to Google. For over a year.

robots.txt is the most dangerous file on your website — not because it's complicated, but because mistakes are silent. Google won't email you. Search Console may flag it if the blocked pages are important enough, but it won't catch everything. By the time you notice organic traffic dropping, weeks or months may have passed.

What Is robots.txt?

robots.txt is a plain text file that lives at the root of your domain: https://yoursite.com/robots.txt. It follows the Robots Exclusion Protocol (REP) and tells search engine crawlers — Googlebot, Bingbot, and others — which pages they're allowed to crawl.

The file structure is simple:

robots.txt — basic structure
# Basic robots.txt structure

User-agent: *          # Applies to all bots
Disallow: /admin/      # Block this directory
Allow: /public/        # Explicitly allow this path

User-agent: Googlebot  # Applies only to Google
Disallow: /api/        # Block API paths from Google only

Sitemap: https://yoursite.com/sitemap.xml  # Declare your sitemap

Key rules:

Critical distinction: robots.txt controls crawling, not indexing. Blocking a page in robots.txt prevents Googlebot from reading it — but if other pages link to that URL, Google can still index it as a shell (URL + title from backlinks, no content). To prevent indexing, use <meta name="robots" content="noindex"> on the page itself — but Google must be able to crawl it to read that tag.

The robots.txt Directives You Need to Know

Directive What it does Example
User-agent Specifies which bot the following rules apply to. * = all bots. User-agent: Googlebot
Disallow Tells the specified bot not to crawl this path. Empty value = allow all. Disallow: /admin/
Allow Explicitly allows a path — used to carve out exceptions within a broader Disallow. Allow: /blog/
Sitemap Declares the URL of your sitemap. Google uses this to discover pages efficiently. Sitemap: https://yoursite.com/sitemap.xml
Crawl-delay Asks bots to wait N seconds between requests. Google officially ignores this — use Search Console's crawl rate settings instead. Crawl-delay: 10

The Right robots.txt for a SaaS Marketing Site

Most SaaS companies have a marketing site (public-facing, fully indexable) and a product app (authenticated, should not be indexed). Here's the baseline template for a marketing site:

robots.txt — SaaS marketing site (baseline)
# autoseobot.com robots.txt
# Updated: 2026-04-15

User-agent: *
# Allow everything by default — only block what needs protecting
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /login
Disallow: /signup
Disallow: /checkout/
Disallow: /_next/          # Next.js internal routes
Disallow: /cdn-cgi/        # Cloudflare internal paths

# Allow crawling of all public content
Allow: /blog/
Allow: /pricing/
Allow: /features/
Allow: /

# Declare sitemap location
Sitemap: https://yoursite.com/sitemap.xml

If your app lives at a separate subdomain (app.yoursite.com), that subdomain needs its own robots.txt completely blocking Googlebot:

robots.txt — app subdomain (block all crawlers)
# app.yoursite.com/robots.txt
# This is the authenticated product — do not index

User-agent: *
Disallow: /

# No sitemap declaration

What to Always Block in robots.txt

What You Must Never Block

Next.js sites: Do NOT block /_next/static/. This directory contains your compiled CSS, JavaScript, and fonts. Blocking it prevents Googlebot from rendering your pages correctly, which can result in pages being treated as empty by Google's indexer. The /_next/ path in robots.txt should be Disallow: /_next/ only if you're specifically targeting non-rendering bots. For Googlebot, /_next/static/ must be accessible.

The 7 robots.txt Mistakes That Cost SaaS Companies Rankings

Mistake 1: Disallow: / (blocking the entire site)

❌ The problem

A developer adds Disallow: / for a staging environment and pushes the same robots.txt to production. Or a CMS migration tool overwrites the production robots.txt with a "block all" version. Google stops crawling the entire site. Traffic drops gradually over weeks as cached pages expire.

✅ The fix

Check yoursite.com/robots.txt right now. If it says Disallow: / with no other rules, you have a site-wide block. Remove it immediately, then submit an "Inspect URL" request in Google Search Console for your homepage to trigger recrawl. Also set up a Google Search Console alert — GSC will notify you if your robots.txt changes in a way that blocks important pages.

Mistake 2: Accidentally blocking CSS and JavaScript

❌ The problem

An older SEO guide suggested blocking CSS and JS to "save crawl budget." This was wrong then and catastrophic now. Google renders pages using a headless Chromium instance — if it can't load your CSS and JS, it sees a broken, unstyled page with potentially no visible content. Rankings tank for affected pages.

✅ The fix

Remove any rules like Disallow: *.css or Disallow: *.js. Then use Google Search Console's URL Inspection tool to "Test Live URL" — it shows the rendered version of the page. If it looks broken or empty, crawlers have been seeing the same broken version.

Mistake 3: Blocking a blog directory during migration and forgetting

❌ The problem

During a content migration, a developer adds Disallow: /blog/ to prevent Google from indexing half-migrated content. The migration completes, but the disallow rule stays. Every new blog post is blocked from the moment it's published. The team sees new posts appearing in search — but only because they're being indexed from sitemaps, not crawled properly.

✅ The fix

Add a comment to any temporary disallow rule: # TEMPORARY — remove after migration completes [date]. After any site migration, audit the full robots.txt as part of your launch checklist. Monitor Search Console's Coverage report for "Blocked by robots.txt" errors.

Mistake 4: Platform-generated robots.txt overwriting your custom rules

❌ The problem

Webflow, WordPress, or a headless CMS generates its own robots.txt file automatically. Your custom rules conflict with or are overwritten by the platform-generated version. Webflow by default generates a minimal robots.txt — if you've published a custom one, the platform version may take precedence.

✅ The fix

Always verify what's actually being served at yoursite.com/robots.txt — not just what you've configured in your CMS. Fetch it with curl -s https://yoursite.com/robots.txt and compare against your intended configuration. For Next.js, use the next-sitemap package or the built-in robots config in next.config.js to generate robots.txt correctly at build time.

Mistake 5: Staging URLs crawlable and indexed

❌ The problem

Your Vercel preview deployment (yoursite-abc123.vercel.app) has no robots.txt blocking crawlers. Google discovers it via links in PR comments, Slack previews, or your own Vercel dashboard. Google indexes near-identical versions of your production pages at a different URL — creating duplicate content issues and potentially ranking the staging version instead of production.

✅ The fix

For Vercel: use vercel.json to add an X-Robots-Tag: noindex response header on preview deployments. For Next.js, conditionally set robots.txt to Disallow: / when VERCEL_ENV !== 'production'. Also add the canonical tag pointing to production on every staging page as a secondary measure.

Mistake 6: Using wildcards incorrectly

❌ The problem

Disallow: /*?* is meant to block URLs with query parameters. But if written as Disallow: /* (missing the ?), it blocks all URLs — the asterisk matches everything. Similarly, Disallow: /blog* blocks /blog, /blog-archive, /blog-categories, and /blog/ — probably not intended.

✅ The fix

Be explicit. Use Disallow: /blog/ with a trailing slash to block only the blog directory — not URLs that happen to start with "blog." Test your wildcard rules using Google Search Console's robots.txt Tester before deploying.

Mistake 7: No Sitemap declaration

❌ The problem

robots.txt makes no mention of the sitemap. Googlebot has to discover your sitemap through other means (Search Console submission, links). Pages not yet linked internally may not get discovered or crawled efficiently. New blog posts can take weeks to be indexed instead of days.

✅ The fix

Always end your robots.txt with Sitemap: https://yoursite.com/sitemap.xml. If you have multiple sitemaps (blog, products, pages), list each one on a separate Sitemap: line. This is one of the highest-leverage lines in your robots.txt — it tells every crawler exactly where to find your content index.

robots.txt for Common SaaS Tech Stacks

Next.js (App Router)

In Next.js 13+ App Router, create a app/robots.ts file:

Next.js — app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  const isProduction = process.env.VERCEL_ENV === 'production'

  return {
    rules: isProduction
      ? [
          {
            userAgent: '*',
            allow: '/',
            disallow: ['/api/', '/admin/', '/dashboard/', '/login', '/signup'],
          },
        ]
      : [
          {
            userAgent: '*',
            disallow: '/',  // Block all crawlers on staging/preview
          },
        ],
    sitemap: 'https://yoursite.com/sitemap.xml',
  }
}

Next.js (Pages Router / static file)

Place robots.txt in the public/ directory:

public/robots.txt — Pages Router
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /login
Disallow: /signup
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Webflow

In Webflow: Site Settings → SEO → Custom robots.txt. Webflow will serve this at /robots.txt. If this field is empty, Webflow generates a default permissive robots.txt. Always fill it in explicitly.

WordPress

WordPress generates robots.txt dynamically at the site root. You can override the default via a plugin (Rank Math or Yoast both have robots.txt editors) or by editing it directly in the plugin's "Tools" section. The WordPress default allows everything — make sure plugins or custom code haven't added unintended Disallow rules.

How to Audit Your robots.txt Right Now

  1. Fetch the live file: curl -s https://yoursite.com/robots.txt — read every line carefully.
  2. Check for site-wide block: Look for Disallow: / with no Allow exceptions. If it's there, fix immediately.
  3. Test critical URLs in Google Search Console: Settings → robots.txt Tester. Test your homepage, a blog post, and your pricing page. All should show "Allowed."
  4. Check Search Console Coverage: Under "Excluded," look for "Blocked by robots.txt." If you see important pages here, you have a misconfiguration.
  5. Confirm Sitemap is declared: Your robots.txt should include at least one Sitemap: line with an absolute URL.
  6. Verify no CSS/JS is blocked: Search your robots.txt for .css, .js, _next, or static — if you see Disallow rules targeting these, remove them.

Quick health check: Run curl -s https://yoursite.com/robots.txt | grep -i disallow to list every blocked path in one view. Review each line and ask: "Does this path have content Google should see?" If yes — it should be allowed.

Frequently Asked Questions

What is a robots.txt file and what does it do?

A robots.txt file at yoursite.com/robots.txt tells search engine crawlers which pages they can and cannot crawl. Googlebot checks it before crawling any page. If a page is disallowed, Google won't crawl it — but may still index the URL if other pages link to it. Robots.txt controls crawling, not indexing.

Does robots.txt prevent a page from being indexed in Google?

No — robots.txt prevents crawling, not indexing. If other pages link to a disallowed URL, Google can still index it as a shell (URL without content). To prevent indexing, use a noindex meta tag — but Google must be able to crawl the page to read that tag. So for pages you never want indexed: either disallow AND ensure no links point to them, or allow crawling and add noindex.

Should I disallow my staging environment in robots.txt?

Yes — add Disallow: / to your staging domain's robots.txt, plus a noindex HTTP header on all pages. For Vercel preview deployments, use the VERCEL_ENV environment variable to serve a blocking robots.txt on non-production environments. Password protection is the most reliable option.

What's the difference between robots.txt Disallow and noindex?

Disallow blocks crawling — Googlebot won't fetch the page. noindex blocks indexing — Googlebot crawls the page, reads the tag, and removes it from index. Google must crawl a page to read its noindex tag. If you disallow + have noindex, Google can't read the noindex. Use Disallow for admin areas and internal APIs. Use noindex for crawlable-but-not-indexable pages like thank-you pages and filtered results.

How do I check if my robots.txt is blocking Googlebot?

Use Google Search Console's robots.txt Tester (Settings > robots.txt). Test your homepage, blog directory, and pricing page — all should show "Allowed." Also check the Coverage report for pages showing "Blocked by robots.txt." Run curl -s https://yoursite.com/robots.txt to see exactly what crawlers see.

Can robots.txt hurt my SEO?

Yes — severely misconfigured robots.txt can destroy organic traffic. Common damage: blocking your homepage with Disallow: /, blocking CSS/JS so Google sees broken pages, or blocking a blog directory and not noticing for months. The most dangerous aspect is that mistakes are silent — traffic drops gradually as Google stops refreshing pages, with no immediate error in your analytics.

Want us to audit your robots.txt?

Our free technical audit checks your robots.txt for blocking mistakes, CSS/JS access, sitemap declaration, staging exposure, and crawl budget waste — with a prioritized fix list delivered within 24 hours.

Get Your Free Technical Audit