We ran a technical audit on a funded Indian SaaS company last month. They'd been live for two years, published 60+ blog posts, and built backlinks through guest posting. Their organic traffic had flatlined for 18 months.
The culprit: a single line in their robots.txt file from a developer who'd blocked the entire /blog/ directory during a content migration and forgot to re-enable it.
Sixty blog posts. Invisible to Google. For over a year.
robots.txt is the most dangerous file on your website — not because it's complicated, but because mistakes are silent. Google won't email you. Search Console may flag it if the blocked pages are important enough, but it won't catch everything. By the time you notice organic traffic dropping, weeks or months may have passed.
What Is robots.txt?
robots.txt is a plain text file that lives at the root of your domain: https://yoursite.com/robots.txt. It follows the Robots Exclusion Protocol (REP) and tells search engine crawlers — Googlebot, Bingbot, and others — which pages they're allowed to crawl.
The file structure is simple:
# Basic robots.txt structure
User-agent: * # Applies to all bots
Disallow: /admin/ # Block this directory
Allow: /public/ # Explicitly allow this path
User-agent: Googlebot # Applies only to Google
Disallow: /api/ # Block API paths from Google only
Sitemap: https://yoursite.com/sitemap.xml # Declare your sitemap
Key rules:
User-agent: *applies to all crawlers.User-agent: Googlebotapplies only to Google's crawler.Disallow: /path/blocks a directory.Disallow: /specific-page.htmlblocks a single page.Disallow:with no path after it — or not having the file at all — means allow everything.- Rules are processed top to bottom. The most specific matching rule wins.
Critical distinction: robots.txt controls crawling, not indexing. Blocking a page in robots.txt prevents Googlebot from reading it — but if other pages link to that URL, Google can still index it as a shell (URL + title from backlinks, no content). To prevent indexing, use <meta name="robots" content="noindex"> on the page itself — but Google must be able to crawl it to read that tag.
The robots.txt Directives You Need to Know
| Directive | What it does | Example |
|---|---|---|
User-agent |
Specifies which bot the following rules apply to. * = all bots. |
User-agent: Googlebot |
Disallow |
Tells the specified bot not to crawl this path. Empty value = allow all. | Disallow: /admin/ |
Allow |
Explicitly allows a path — used to carve out exceptions within a broader Disallow. | Allow: /blog/ |
Sitemap |
Declares the URL of your sitemap. Google uses this to discover pages efficiently. | Sitemap: https://yoursite.com/sitemap.xml |
Crawl-delay |
Asks bots to wait N seconds between requests. Google officially ignores this — use Search Console's crawl rate settings instead. | Crawl-delay: 10 |
The Right robots.txt for a SaaS Marketing Site
Most SaaS companies have a marketing site (public-facing, fully indexable) and a product app (authenticated, should not be indexed). Here's the baseline template for a marketing site:
# autoseobot.com robots.txt
# Updated: 2026-04-15
User-agent: *
# Allow everything by default — only block what needs protecting
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /login
Disallow: /signup
Disallow: /checkout/
Disallow: /_next/ # Next.js internal routes
Disallow: /cdn-cgi/ # Cloudflare internal paths
# Allow crawling of all public content
Allow: /blog/
Allow: /pricing/
Allow: /features/
Allow: /
# Declare sitemap location
Sitemap: https://yoursite.com/sitemap.xml
If your app lives at a separate subdomain (app.yoursite.com), that subdomain needs its own robots.txt completely blocking Googlebot:
# app.yoursite.com/robots.txt
# This is the authenticated product — do not index
User-agent: *
Disallow: /
# No sitemap declaration
What to Always Block in robots.txt
- Admin and authentication paths:
/admin/,/wp-admin/,/login,/signup,/auth/. These pages have no SEO value and shouldn't be indexed. - API endpoints:
/api/. JSON responses from API endpoints aren't indexable content — blocking them saves crawl budget. - Internal application routes: Anything behind authentication (
/dashboard/,/settings/,/account/). Googlebot can't log in — these pages return login redirects or errors, wasting crawl budget. - Checkout and payment flows:
/checkout/,/payment/. These are transactional flows with no indexable content and may contain session-sensitive URLs. - Search result pages:
/search?,/?s=. Internal search results are dynamically generated, often thin content, and create an infinite crawl loop if not blocked. - Staging and test paths:
/staging/,/test/,/dev/. If these are subpaths on your production domain (rare but happens), block them.
What You Must Never Block
- Your blog and content directories:
/blog/,/resources/,/case-studies/. This is the content you've invested in — never block it. - Your CSS and JavaScript files: Google renders pages to understand their content. If Googlebot can't load your CSS/JS, it can't render your pages correctly and may rank them poorly or not at all. Never block
*.cssor*.jsvia robots.txt. - Your sitemap: Never disallow your sitemap path.
- Core marketing pages: Homepage, pricing page, features pages, landing pages. These should always be crawlable.
- Images: Blocking image crawling (
Disallow: /*.jpg$) removes your images from Google Images and may affect ranking signals for pages that include them.
Next.js sites: Do NOT block /_next/static/. This directory contains your compiled CSS, JavaScript, and fonts. Blocking it prevents Googlebot from rendering your pages correctly, which can result in pages being treated as empty by Google's indexer. The /_next/ path in robots.txt should be Disallow: /_next/ only if you're specifically targeting non-rendering bots. For Googlebot, /_next/static/ must be accessible.
The 7 robots.txt Mistakes That Cost SaaS Companies Rankings
Mistake 1: Disallow: / (blocking the entire site)
❌ The problem
A developer adds Disallow: / for a staging environment and pushes the same robots.txt to production. Or a CMS migration tool overwrites the production robots.txt with a "block all" version. Google stops crawling the entire site. Traffic drops gradually over weeks as cached pages expire.
✅ The fix
Check yoursite.com/robots.txt right now. If it says Disallow: / with no other rules, you have a site-wide block. Remove it immediately, then submit an "Inspect URL" request in Google Search Console for your homepage to trigger recrawl. Also set up a Google Search Console alert — GSC will notify you if your robots.txt changes in a way that blocks important pages.
Mistake 2: Accidentally blocking CSS and JavaScript
❌ The problem
An older SEO guide suggested blocking CSS and JS to "save crawl budget." This was wrong then and catastrophic now. Google renders pages using a headless Chromium instance — if it can't load your CSS and JS, it sees a broken, unstyled page with potentially no visible content. Rankings tank for affected pages.
✅ The fix
Remove any rules like Disallow: *.css or Disallow: *.js. Then use Google Search Console's URL Inspection tool to "Test Live URL" — it shows the rendered version of the page. If it looks broken or empty, crawlers have been seeing the same broken version.
Mistake 3: Blocking a blog directory during migration and forgetting
❌ The problem
During a content migration, a developer adds Disallow: /blog/ to prevent Google from indexing half-migrated content. The migration completes, but the disallow rule stays. Every new blog post is blocked from the moment it's published. The team sees new posts appearing in search — but only because they're being indexed from sitemaps, not crawled properly.
✅ The fix
Add a comment to any temporary disallow rule: # TEMPORARY — remove after migration completes [date]. After any site migration, audit the full robots.txt as part of your launch checklist. Monitor Search Console's Coverage report for "Blocked by robots.txt" errors.
Mistake 4: Platform-generated robots.txt overwriting your custom rules
❌ The problem
Webflow, WordPress, or a headless CMS generates its own robots.txt file automatically. Your custom rules conflict with or are overwritten by the platform-generated version. Webflow by default generates a minimal robots.txt — if you've published a custom one, the platform version may take precedence.
✅ The fix
Always verify what's actually being served at yoursite.com/robots.txt — not just what you've configured in your CMS. Fetch it with curl -s https://yoursite.com/robots.txt and compare against your intended configuration. For Next.js, use the next-sitemap package or the built-in robots config in next.config.js to generate robots.txt correctly at build time.
Mistake 5: Staging URLs crawlable and indexed
❌ The problem
Your Vercel preview deployment (yoursite-abc123.vercel.app) has no robots.txt blocking crawlers. Google discovers it via links in PR comments, Slack previews, or your own Vercel dashboard. Google indexes near-identical versions of your production pages at a different URL — creating duplicate content issues and potentially ranking the staging version instead of production.
✅ The fix
For Vercel: use vercel.json to add an X-Robots-Tag: noindex response header on preview deployments. For Next.js, conditionally set robots.txt to Disallow: / when VERCEL_ENV !== 'production'. Also add the canonical tag pointing to production on every staging page as a secondary measure.
Mistake 6: Using wildcards incorrectly
❌ The problem
Disallow: /*?* is meant to block URLs with query parameters. But if written as Disallow: /* (missing the ?), it blocks all URLs — the asterisk matches everything. Similarly, Disallow: /blog* blocks /blog, /blog-archive, /blog-categories, and /blog/ — probably not intended.
✅ The fix
Be explicit. Use Disallow: /blog/ with a trailing slash to block only the blog directory — not URLs that happen to start with "blog." Test your wildcard rules using Google Search Console's robots.txt Tester before deploying.
Mistake 7: No Sitemap declaration
❌ The problem
robots.txt makes no mention of the sitemap. Googlebot has to discover your sitemap through other means (Search Console submission, links). Pages not yet linked internally may not get discovered or crawled efficiently. New blog posts can take weeks to be indexed instead of days.
✅ The fix
Always end your robots.txt with Sitemap: https://yoursite.com/sitemap.xml. If you have multiple sitemaps (blog, products, pages), list each one on a separate Sitemap: line. This is one of the highest-leverage lines in your robots.txt — it tells every crawler exactly where to find your content index.
robots.txt for Common SaaS Tech Stacks
Next.js (App Router)
In Next.js 13+ App Router, create a app/robots.ts file:
import { MetadataRoute } from 'next'
export default function robots(): MetadataRoute.Robots {
const isProduction = process.env.VERCEL_ENV === 'production'
return {
rules: isProduction
? [
{
userAgent: '*',
allow: '/',
disallow: ['/api/', '/admin/', '/dashboard/', '/login', '/signup'],
},
]
: [
{
userAgent: '*',
disallow: '/', // Block all crawlers on staging/preview
},
],
sitemap: 'https://yoursite.com/sitemap.xml',
}
}
Next.js (Pages Router / static file)
Place robots.txt in the public/ directory:
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /login
Disallow: /signup
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Webflow
In Webflow: Site Settings → SEO → Custom robots.txt. Webflow will serve this at /robots.txt. If this field is empty, Webflow generates a default permissive robots.txt. Always fill it in explicitly.
WordPress
WordPress generates robots.txt dynamically at the site root. You can override the default via a plugin (Rank Math or Yoast both have robots.txt editors) or by editing it directly in the plugin's "Tools" section. The WordPress default allows everything — make sure plugins or custom code haven't added unintended Disallow rules.
How to Audit Your robots.txt Right Now
- Fetch the live file:
curl -s https://yoursite.com/robots.txt— read every line carefully. - Check for site-wide block: Look for
Disallow: /with no Allow exceptions. If it's there, fix immediately. - Test critical URLs in Google Search Console: Settings → robots.txt Tester. Test your homepage, a blog post, and your pricing page. All should show "Allowed."
- Check Search Console Coverage: Under "Excluded," look for "Blocked by robots.txt." If you see important pages here, you have a misconfiguration.
- Confirm Sitemap is declared: Your robots.txt should include at least one
Sitemap:line with an absolute URL. - Verify no CSS/JS is blocked: Search your robots.txt for
.css,.js,_next, orstatic— if you see Disallow rules targeting these, remove them.
Quick health check: Run curl -s https://yoursite.com/robots.txt | grep -i disallow to list every blocked path in one view. Review each line and ask: "Does this path have content Google should see?" If yes — it should be allowed.
Frequently Asked Questions
What is a robots.txt file and what does it do?
A robots.txt file at yoursite.com/robots.txt tells search engine crawlers which pages they can and cannot crawl. Googlebot checks it before crawling any page. If a page is disallowed, Google won't crawl it — but may still index the URL if other pages link to it. Robots.txt controls crawling, not indexing.
Does robots.txt prevent a page from being indexed in Google?
No — robots.txt prevents crawling, not indexing. If other pages link to a disallowed URL, Google can still index it as a shell (URL without content). To prevent indexing, use a noindex meta tag — but Google must be able to crawl the page to read that tag. So for pages you never want indexed: either disallow AND ensure no links point to them, or allow crawling and add noindex.
Should I disallow my staging environment in robots.txt?
Yes — add Disallow: / to your staging domain's robots.txt, plus a noindex HTTP header on all pages. For Vercel preview deployments, use the VERCEL_ENV environment variable to serve a blocking robots.txt on non-production environments. Password protection is the most reliable option.
What's the difference between robots.txt Disallow and noindex?
Disallow blocks crawling — Googlebot won't fetch the page. noindex blocks indexing — Googlebot crawls the page, reads the tag, and removes it from index. Google must crawl a page to read its noindex tag. If you disallow + have noindex, Google can't read the noindex. Use Disallow for admin areas and internal APIs. Use noindex for crawlable-but-not-indexable pages like thank-you pages and filtered results.
How do I check if my robots.txt is blocking Googlebot?
Use Google Search Console's robots.txt Tester (Settings > robots.txt). Test your homepage, blog directory, and pricing page — all should show "Allowed." Also check the Coverage report for pages showing "Blocked by robots.txt." Run curl -s https://yoursite.com/robots.txt to see exactly what crawlers see.
Can robots.txt hurt my SEO?
Yes — severely misconfigured robots.txt can destroy organic traffic. Common damage: blocking your homepage with Disallow: /, blocking CSS/JS so Google sees broken pages, or blocking a blog directory and not noticing for months. The most dangerous aspect is that mistakes are silent — traffic drops gradually as Google stops refreshing pages, with no immediate error in your analytics.
Want us to audit your robots.txt?
Our free technical audit checks your robots.txt for blocking mistakes, CSS/JS access, sitemap declaration, staging exposure, and crawl budget waste — with a prioritized fix list delivered within 24 hours.
Get Your Free Technical Audit