What the Bot — StratechMedia

Access control

robots.txt

A plain text file at the root of every website that tells automated crawlers — including AI systems — what they are and are not allowed to access. It has existed since 1994 and was originally designed for search engine bots. It was never designed with AI in mind, but it is now the primary instrument publishers use to block AI crawlers.

What it means for AI visibility

If you block a crawler in robots.txt, it cannot access your content from that point forward. This does not remove your content from AI systems that have already scraped it. It does not shape how AI currently represents you. It is a technical instruction — not a policy, not a legal declaration, not a licensing agreement.

robots.txt

# Allow all crawlers (default if no robots.txt exists)
User-agent: *
Allow: /

# Block all AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

# Block training, allow citation (nuanced approach)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

The difference between GPTBot (training crawler) and ChatGPT-User (live browsing) matters. Most publishers block both. Some block training but allow live citation.

In our scan of 5,125 domains: 21% block all AI crawlers — 47% allow all — 28% have set no policy at all.See the data →

Declaration standard

ai.txt

A declaration file developed by Spawning.ai that allows publishers to state specifically what AI systems may and may not do with their content — with more nuance than robots.txt allows. Where robots.txt says yes or no to access, ai.txt can say: no training, yes citation, contact us for licensing. Globally, 1.9% of publisher domains have implemented it. In Denmark, 25% have — the highest rate in our dataset.

What it means for AI visibility

ai.txt is what separates a block from a policy. A publisher with only robots.txt has issued a technical instruction. A publisher with ai.txt has made a documented institutional decision — one that specifies terms and can be referenced in a legal or commercial context.

ai.txt

# ai.txt — Spawning standard

[ai]
version = 1.0

[permissions]
allow_training = false
allow_citation = true
allow_summarization = true
allow_search_indexing = true

[contact]
licensing = mailto:rights@publisher.com
inquiries = mailto:ai@publisher.com

This configuration blocks training data use while explicitly allowing AI systems to cite and summarise content — the most commercially useful position for publishers who want to be in AI answers without surrendering training rights.

Globally, 1.9% of publisher domains have implemented ai.txt. In Denmark, 25% have — the highest rate in the dataset.See the data →

Authority signal

llms.txt

A machine-readable file that tells AI agents what a publisher considers authoritative — what to read, what to trust, and how to reach the publisher for licensing or commercial terms. Originally proposed by Jeremy Howard (fast.ai) in 2024. Globally, 13% of publisher domains have implemented it.

What it means for AI visibility

llms.txt is the file that makes a publisher legible to an AI agent — not just crawlable, but understandable. Without it, an AI system can reach your content but has no machine-readable guide to what you consider your most important material, who you are, or how to contact you commercially.

llms.txt

# llms.txt

> KrimiNyt is a Danish true crime media covering criminal cases,
> police investigations and court proceedings since 2023.
> Editorial contact: redaktion@kriminyt.dk
> Licensing: rights@kriminyt.dk
> Training data: not permitted without license agreement

## Key content areas

- [Cases](/sager)
- [Investigations](/efterforskning)
- [Court](/retten)

## Editorial standards

- [Editorial guidelines](/om-kriminyt/redaktionelle-retningslinjer)
- [Corrections policy](/om-kriminyt/rettelser)

llms.txt is written in plain language with Markdown structure. AI agents read it to understand who you are before they read your articles.

13% of 5,125 publisher domains scanned have published llms.txt.See the data →

Structured data

JSON-LD

JSON-LD (JavaScript Object Notation for Linked Data) is the primary mechanism by which AI systems understand what a piece of content is, who created it, and what rights apply. It sits inside a script tag in a page's HTML — invisible to human readers, but the first thing a structured crawler reads. 47% of publishers globally have implemented JSON-LD. The question is not just whether it is present, but what it says.

What it means for AI visibility

A publisher with no JSON-LD is present on the web but unidentifiable. An AI system sees 'a page with text.' A publisher with JSON-LD for their organisation, their editors, and their articles is a named editorial entity — one whose journalism is identifiable, citable, and attributable. This is the difference between being crawlable and being citable.

JSON-LD

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Nordic publishers block AI at three times the global rate",
  "datePublished": "2026-05-01",
  "author": {
    "@type": "Person",
    "name": "Susanne Sperling",
    "url": "https://stratechmedia.com/about"
  },
  "publisher": {
    "@type": "NewsMediaOrganization",
    "name": "StratechMedia"
  },
  "isAccessibleForFree": true,
  "copyrightHolder": {
    "@type": "Organization",
    "name": "StratechMedia"
  }
}
</script>

Most publishers with JSON-LD have only partial implementation — organisation name but no author, or author but no rights holder. Partial JSON-LD is better than none, but incomplete.

47% of 5,125 publishers globally have JSON-LD implemented. The question is not whether it is present — but what it says.See the data →

Structured data

NewsArticle schema

A specific Schema.org type that identifies a piece of content as journalism — as distinct from a blog post, a product page, or a social media post. When a publisher uses NewsArticle schema, they are telling AI systems: this content was produced by a named editor, has a publication date, belongs to a news organisation, and carries editorial accountability. In our dataset: 9% of Danish publishers use it. Norway, Sweden and Finland: 0%.

What it means for AI visibility

AI systems increasingly distinguish between editorial and non-editorial sources. A page with NewsArticle schema signals that this is journalism — with a rights holder, a date, and a human editor responsible for it. A page without it is just content. Publishers blocking AI crawlers with robots.txt have not told AI systems what that content is. They defend the door but have not put a name on the building.

NewsArticle schema

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline here",
  "datePublished": "2026-05-01T09:00:00+01:00",
  "author": [{
    "@type": "Person",
    "name": "Editor Name",
    "url": "https://example.com/editors/name"
  }],
  "publisher": {
    "@type": "NewsMediaOrganization",
    "name": "Publisher Name",
    "url": "https://example.com"
  },
  "articleSection": "Politics",
  "inLanguage": "da",
  "copyrightYear": 2026,
  "copyrightHolder": {
    "@type": "Organization",
    "name": "Publisher Name"
  }
}
</script>

Note @type: NewsMediaOrganization for the publisher — not just Organization. This specifically signals to AI systems that the publisher is a news entity.

In our dataset: 9% of Danish publishers use NewsArticle schema. Norway, Sweden and Finland: 0%.See the data →

Paywall signal

isAccessibleForFree

A Schema.org property that tells AI systems whether content is behind a paywall. Without it, an AI system or crawler that encounters a login prompt, a cookie gate, or an email form has no machine-readable way to know whether the content behind it is free or paid. Only 6.1% of Nordic publishers have implemented it.

What it means for AI visibility

If your site has any gate — registration wall, cookie consent, email form — AI crawlers may classify your entire site as paywalled, even if 90% of your content is free. isAccessibleForFree: true is the signal that says: this content is open. Without it, ChatGPT, Perplexity and Google may refuse to surface your content, citing 'gated content' — even when it is not.

isAccessibleForFree

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article title",
  "isAccessibleForFree": true,
  "hasPart": {
    "@type": "WebPageElement",
    "isAccessibleForFree": true,
    "cssSelector": ".article-body"
  }
}
</script>

isAccessibleForFree: false on paywalled articles is equally important — it tells AI systems not to try to surface content users cannot reach.

Brand visibility

Your brand in the LLMs

You do not have one identity in AI. You have potentially six — one per model. ChatGPT, Perplexity, Gemini, Claude, Grok and Copilot were each trained on different data, at different times, from different sources. ChatGPT might describe your organisation one way. Perplexity might describe it differently, based on what it has indexed most recently. Gemini might not know you exist at all. None of them are wrong — they are each reflecting whatever signal they received, from whatever source was available at the time they were trained or last updated.

What it means for AI visibility

This means your brand in AI answers is not a single thing you can control with one action. Each LLM is its own audience, with its own sources and its own update cycle. A publisher who appears correctly in ChatGPT may be misrepresented in Gemini, or missing entirely in Perplexity. The only way to know is to check — regularly, across all of them. Because every model update, every new training run, every change in the sources a model prioritises can change how you are described. What was accurate last month may not be accurate today.

Your brand in the LLMs

# Check: how does each LLM describe your organisation?

Prompt to test across all models:
"What does [publisher name] cover?
Who is the editor?
What are they known for?
Are they a reliable source on [topic]?"

Then check:
→ Is the description accurate?
→ Is the editor named correctly?
→ Is your coverage area described correctly?
→ Are you described as a publisher, or as something else?
→ Is the information current, or outdated by 1-2 years?

Platforms to test:
• chat.openai.com (ChatGPT)
• perplexity.ai
• gemini.google.com
• claude.ai
• copilot.microsoft.com

Discrepancies between models are normal and expected — but they should be known and monitored. A model that misidentifies your editorial stance, your coverage area, or your ownership structure is shaping how readers perceive you before they ever reach your site.

AI behaviour

Hallucination

A hallucination is when an AI system generates information that sounds accurate but is not. The model does not know it is wrong — it is producing a plausible-sounding output based on patterns in its training data, not on verified facts. For publishers, hallucination is not just a concern about what AI says about other topics. It is a concern about what AI says about you: your editorial stance, your ownership, your coverage area, your editors, your history.

What it means for AI visibility

If an AI system has insufficient or outdated information about your organisation, it fills the gap with inference. The result may be a description that is partially correct, subtly wrong, or in some cases completely fabricated — but stated with the same confidence as accurate information. Publishers who are not present in AI training data in a structured, authoritative way are more likely to be hallucinated about. The fix is not to complain to the model. The fix is to give it better source material: structured data, named authorship, documented editorial standards, and regular presence in authoritative sources that AI systems trust.

Hallucination

# Signs you may be being hallucinated about:

→ AI describes your outlet as covering topics you don't cover
→ AI names editors who no longer work for you
→ AI attributes articles to your publication that you didn't publish
→ AI describes your ownership or funding incorrectly
→ AI refers to your organisation by a name you don't use
→ AI gives your founding date, subscriber count or reach incorrectly

# What reduces hallucination risk:

→ JSON-LD with accurate Organisation and Person markup
→ Named editors on articles (bylines + author pages)
→ Wikipedia presence with accurate, cited information
→ Regular appearance in AI-indexed authoritative sources
→ llms.txt with accurate self-description
→ Consistent naming across all platforms (social, LinkedIn, press releases)

Hallucination risk is higher for smaller or regional publishers, because AI models have less training signal to draw on. The less you have told AI systems about yourself in structured, machine-readable form, the more room there is for inference.

Governance signal

Editorial policy page

A publicly accessible page that states how a publisher makes editorial decisions — who is responsible, what standards apply, how errors are corrected, and what the relationship is between editorial and commercial operations. For AI systems, this is a trust signal: it is how a publisher says, at the institutional level, that there is a human with accountability behind the content. In our dataset, 0% of Finnish publishers have an editorial policy page.

What it means for AI visibility

AI systems increasingly use editorial policy pages as a source credibility signal. Publishers without one are indistinguishable — at the machine level — from content farms and AI-generated content sites. For EU AI Act compliance, an editorial policy page that addresses AI use is one of the transparency provisions in the regulation — delayed, but on its way.

Editorial policy page

<!-- Minimum content for AI legibility: -->

Publisher name and legal entity
Editorial director: [name, contact]
Editorial standards: [link or text]
Corrections policy: [process described]
AI use policy: [how AI is used in editorial, if at all]
Commercial independence statement
Date last reviewed: [date]

<!-- Reference it in JSON-LD on homepage: -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsMediaOrganization",
  "name": "Publisher Name",
  "ethicsPolicy": "https://example.com/editorial-policy",
  "correctionsPolicy": "https://example.com/corrections",
  "masthead": "https://example.com/about/team"
}
</script>

The editorial policy page URL should be referenced in JSON-LD on the homepage and article pages — not just exist as a standalone page that no structured data points to.

Structured data

OpenGraph

OpenGraph (og:) meta tags are HTML properties originally designed to control how content appears when shared on social media. They specify title, description, image, and content type. AI crawlers and aggregators read them as a fast summary of what a page is about — before they read the body text. Most publishers have OpenGraph tags. Most have not reviewed them with AI systems in mind.

What it means for AI visibility

When an AI crawler or aggregator reads a publisher's page, og:title and og:description often form the first representation of that content in a summary or citation. If those tags are generic, outdated, or auto-generated, the AI's representation of your article may be based on bad metadata — not the article itself. og:type: article is a basic signal that this is content, not a homepage or a product page.

OpenGraph

<meta property="og:type" content="article" />
<meta property="og:title" content="Nordic publishers block AI at three times the global rate" />
<meta property="og:description" content="A data report on 689 publisher domains across five Nordic countries." />
<meta property="og:image" content="https://example.com/images/article-og.jpg" />
<meta property="og:url" content="https://example.com/article-slug" />
<meta property="article:published_time" content="2026-05-01T09:00:00+01:00" />
<meta property="article:author" content="https://example.com/editors/name" />

article:published_time and article:author are OpenGraph extensions for news content. They are not widely used but are read by AI systems that want to establish recency and attribution.

Technical signal

canonical

The canonical tag (rel="canonical") tells crawlers which URL is the authoritative version of a page when the same content exists at multiple URLs — AMP versions, paginated articles, syndicated copies, or URL parameter variants. Search engines have respected canonical tags for over a decade. AI crawlers increasingly do too.

What it means for AI visibility

Publishers with syndicated content, AMP pages, or complex URL structures risk having AI systems index the wrong version of their content — or index multiple copies and treat them as different sources. A canonical tag that points to the correct, permanent URL ensures that citations and attributions point to the publisher's own domain, not a syndication partner, a cached copy, or a platform-hosted version.

canonical

<!-- On all article pages: -->
<link rel="canonical" href="https://publisher.com/article-slug" />

<!-- On AMP version, point back to original: -->
<link rel="canonical" href="https://publisher.com/article-slug" />

<!-- On paginated series, each page is canonical to itself: -->
<link rel="canonical" href="https://publisher.com/article-slug/page/2" />

The most common canonical error for publishers is AMP pages without a canonical pointing back to the main domain. AI systems that index only the AMP version may attribute the content to Google's AMP cache rather than the publisher.

Technical signal

sitemap.xml

An XML file that lists all URLs on a site, with optional metadata about when each page was last updated and how frequently it changes. Search crawlers have used sitemaps for years to discover content systematically. AI crawlers use sitemaps for the same reason: to find pages they might otherwise miss. A sitemap submitted via robots.txt is one of the clearest signals a publisher can send about what they consider canonical content.

What it means for AI visibility

A publisher who blocks AI crawlers in robots.txt but has a sitemap is sending a locked door with a window. A publisher who allows AI crawlers but has no sitemap may find that only their most-linked pages are indexed — leaving everything behind a navigation layer, in archives, or in less-linked sections invisible to AI. For publishers with large back-catalogues, a sitemap is the difference between partial AI visibility and comprehensive AI visibility.

sitemap.xml

# In robots.txt, reference your sitemap:
Sitemap: https://publisher.com/sitemap.xml

# sitemap.xml structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
  <url>
    <loc>https://publisher.com/article-slug</loc>
    <lastmod>2026-05-01</lastmod>
    <news:news>
      <news:publication>
        <news:name>Publisher Name</news:name>
        <news:language>en</news:language>
      </news:publication>
      <news:title>Article headline</news:title>
      <news:publication_date>2026-05-01T09:00:00+01:00</news:publication_date>
    </news:news>
  </url>
</urlset>

The Google News sitemap extension (xmlns:news) is read by several AI systems in addition to Google News. It allows publishers to include structured metadata about articles directly in the sitemap — title, publication date, and language.

Structured data

Speakable schema

A Schema.org property that marks specific sections of an article as the most suitable for audio rendering and summarisation — originally designed for Google Assistant, now relevant for any AI system that needs to identify the most quotable, summary-worthy content on a page. It is a way of telling an AI: this is the part that matters most.

What it means for AI visibility

AI systems that generate summaries, answer questions, or power voice interfaces benefit from knowing which part of an article is the core claim — not the preamble, not the boilerplate, not the related-articles section. Speakable schema gives publishers direct editorial control over what gets quoted. Without it, the AI decides which sentence to pull — and it may pull the wrong one.

Speakable schema

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline",
  "speakable": {
    "@type": "SpeakableSpecification",
    "cssSelector": [".article-headline", ".article-summary", ".article-lead"]
  }
}
</script>

Speakable schema is underused by publishers — it is one of the few structured data properties that directly signals editorial prioritisation to AI systems. Mark your headline and first two paragraphs as speakable at minimum.

Monetisation signal

ads.txt

Authorised Digital Sellers — a plain text file that lists which companies are authorised to sell programmatic advertising on a publisher's behalf. ads.txt has no direct effect on AI crawlers. Its relevance here is structural: it represents the revenue model that AI is disrupting. Publishers who are protecting their content from AI crawlers are, in most cases, protecting ad-supported pageviews. Understanding ads.txt is understanding what is at stake.

What it means for AI visibility

A publisher whose content is summarised by an AI and answered directly — without a click — loses the pageview. No pageview means no ad impression. No ad impression means no revenue from that piece of content. This is why blocking AI crawlers is commercially rational for ad-funded publishers right now. The tension is that blocking also removes you from AI citations — so you lose both the pageview and the citation. ads.txt is not a solution to this problem. It is the reason the problem is real.

ads.txt

# ads.txt — authorised sellers for publisher.com
# Format: domain, publisher-id, relationship, cert-authority-id

google.com, pub-1234567890, DIRECT, f08c47fec0942fa0
appnexus.com, 1234, DIRECT
rubiconproject.com, 5678, RESELLER, 0bfd66d529a55807

The ads.txt model — where publisher revenue depends on human clicks — is the core tension in the AI access debate. Publishers who shift to licensing and citation-based models (via Tollbit, Perplexity Publisher Program, or direct API deals) are building revenue that does not require a click.

Monetisation infrastructure

Tollbit

Tollbit is a commercial infrastructure layer that allows publishers to set a price for AI crawler access at the HTTP level. When a Tollbit-enabled AI crawler requests a page, it receives a price in the response headers. If it accepts the price, it pays and gets access. If not, it is blocked. Publishers set their own prices per page or per domain. Tollbit handles billing, access control, and reporting. It is the first working implementation of AI access as a metered commercial transaction.

What it means for AI visibility

Tollbit changes the question from 'open or closed' to 'at what price.' A publisher who has blocked all AI crawlers could unblock them for paying systems while keeping free access blocked. A publisher who has been open for free could start charging without losing citation. Tollbit requires the same infrastructure that enables structured AI access: a clear robots.txt policy, ideally llms.txt, and page-level canonical and structured data so AI systems know what they are paying for.

Tollbit

# To signal Tollbit availability, publishers add to robots.txt:
User-agent: TollbitBot
Allow: /

# Tollbit responds with HTTP headers on crawl:
# X-Robots-Tag: tollbit
# Tollbit-Price: 0.001 USD per page

# Publishers configure pricing at tollbit.com:
# - Set price per domain or per content type
# - Set free allowance (e.g. 100 pages/month before billing starts)
# - Set whitelist of AI companies that get free access

As of mid-2026, not all AI companies respect Tollbit pricing — compliance varies by crawler. Publishers using Tollbit can see in logs which crawlers paid, which ignored the price signal, and which were blocked.

Monetisation infrastructure

Perplexity Publisher Program

Perplexity's revenue-sharing arrangement for publishers whose content is used in AI-generated answers. Publishers who participate receive a share of advertising revenue generated by answers that draw on their content. Participation requires open access — publishers who block PerplexityBot cannot participate. The programme is the first AI platform to implement a structured commercial relationship with publishers at scale, and it requires the kind of infrastructure this glossary describes: llms.txt, structured data, and a clear access policy.

What it means for AI visibility

The Perplexity Publisher Program changes the calculation for publishers who are blocking AI crawlers. A publisher who blocks Perplexity gets no citation and no revenue. A publisher who participates may receive citation credit and a revenue share. The amounts are not yet large, but the model points to where AI monetisation for publishers is heading: away from pageview-based advertising and toward citation-based licensing. Publishers who block now may find themselves outside these arrangements as they grow.

Perplexity Publisher Program

# To participate in the Perplexity Publisher Program:
# 1. Unblock PerplexityBot in robots.txt:
User-agent: PerplexityBot
Allow: /

# 2. Add llms.txt to signal your content policy:
# (see llms.txt entry above)

# 3. Ensure articles have NewsArticle schema with:
# - headline, datePublished, author, publisher

# 4. Apply at perplexity.ai/publishers
# Perplexity reviews and confirms eligibility

The Perplexity Publisher Program is still evolving. Revenue shares and eligibility criteria have changed since launch. Check perplexity.ai/publishers for current terms.

Monetisation infrastructure

Cloudflare Pay Per Crawl

Cloudflare's implementation of metered AI crawler access, available to publishers using Cloudflare's CDN. Publishers can set a price per crawl request from AI systems. When a supported AI crawler requests a page, Cloudflare intercepts the request, presents the price, and either grants access (if the crawler accepts) or blocks it. Publishers who are already on Cloudflare can activate this without changing their tech stack.

What it means for AI visibility

Cloudflare Pay Per Crawl links access and payment at the infrastructure level — before a request even reaches the publisher's server. For publishers already using Cloudflare for CDN, DDoS protection, or caching, it is a low-friction path to charging AI crawlers without a separate integration. The constraint is the same as Tollbit: publishers who block AI crawlers entirely cannot participate. Open access is a prerequisite for commercial access.

Cloudflare Pay Per Crawl

# Cloudflare Pay Per Crawl is configured in the Cloudflare dashboard:
# - Navigate to: Security > Bots > AI Scrapers and Crawlers
# - Set pricing per 1,000 requests or per page type
# - Configure which AI user-agents are subject to pricing
# - Set a free tier (e.g. 1,000 requests/month before billing)

# No robots.txt changes required — Cloudflare handles access control
# before requests reach your origin server.

# Supported crawlers check for pricing headers:
# CF-AIBot-Price: 0.002 USD per 1000 requests

Not all AI crawlers respect Cloudflare's pricing signals — crawler compliance is voluntary until regulation requires it. Cloudflare publishes data on which crawlers pay and which do not.

Governance signal

EU AI Act compliance signals

The EU AI Act's transparency provisions — delayed but advancing — require providers of general-purpose AI models to document and disclose what content they trained on. For publishers, this creates an audit trail question: can you demonstrate what policy you had in place, when, and whether it was respected? The compliance signals are the same files already in this glossary — robots.txt, ai.txt, llms.txt — but now they carry a legal dimension. A policy file that existed at training time is a record. An absence of policy is a gap.

What it means for AI visibility

Publishers who have had a documented AI policy in place since before training occurred are in a stronger position to assert rights under the EU AI Act. Publishers who set no policy — or who have no dated record of what their policy was — may find it harder to demonstrate that their content was used without authorisation. The Act does not require publishers to have allowed access; it requires AI providers to disclose what was used. But the publisher's documented position is evidence in any dispute.

EU AI Act compliance signals

# Minimum compliance documentation for publishers:

1. robots.txt with dated AI policy
   → Keep a version history (e.g. via git or CMS audit log)

2. ai.txt with explicit terms
   → Date of implementation matters — earlier is stronger

3. llms.txt with licensing contact
   → Gives AI companies a documented path to legitimate licensing

4. Editorial policy page stating AI position
   → Required under EU AI Act for content AI disclosure

5. Server logs showing AI crawler activity
   → Evidence of which crawlers accessed what, and when

The EU AI Act deadline has moved — but the obligation on AI model providers to document what they trained on is coming. Publishers who want to assert rights need their own documentation in order now. A policy that exists only in someone's memory is not a policy.

Governance signal

Named editor signal

The presence of a named, identifiable human editor on a piece of content — a byline, an author page, and ideally a Person schema in JSON-LD that links the name to a stable URL. AI systems use named authorship as an editorial quality and attribution signal. Content with a named editor is more likely to be cited with attribution. Content without a named editor is indistinguishable — at the machine level — from AI-generated content, syndicated filler, or anonymous commentary.

What it means for AI visibility

As AI-generated content proliferates, named human editorship becomes a differentiator. A publisher whose articles carry structured author data — name, role, author page URL, linked social profiles — signals to AI systems that this content has human accountability behind it. A publisher whose articles have a byline in the body text but no structured author data has a human editor that machines cannot see.

Named editor signal

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline",
  "author": [{
    "@type": "Person",
    "name": "Editor Name",
    "url": "https://publisher.com/editors/name",
    "sameAs": [
      "https://www.linkedin.com/in/editorname",
      "https://twitter.com/editorname"
    ],
    "jobTitle": "Senior Reporter",
    "worksFor": {
      "@type": "NewsMediaOrganization",
      "name": "Publisher Name"
    }
  }]
}
</script>

sameAs links — to LinkedIn, Twitter/X, or a Wikipedia page — help AI systems confirm that the named editor is a real person with a verifiable identity. This is one of the strongest signals against hallucinated authorship.

Governance signal

AI labeling

Machine-readable and human-readable signals that indicate whether content was produced with AI assistance, generated fully by AI, or reviewed and edited by a human. Required under some jurisdictions; increasingly expected by platforms and AI systems. Schema.org has proposed properties for this; some publishers have implemented their own labeling conventions. The EU AI Act includes provisions on AI-generated content disclosure.

What it means for AI visibility

AI systems that index content need to distinguish between human journalism and AI-generated text — both for quality assessment and for their own training data curation. A publisher who labels AI-assisted content clearly is giving AI systems a signal they increasingly need. A publisher who does not label at all leaves AI systems to infer — and inference at scale tends to penalise ambiguity. For publishers who use AI tools in their workflow, labeling is also a trust signal to readers in an environment where AI-generated content is widespread.

AI labeling

<!-- Human-readable disclosure in article body: -->
<p class="ai-disclosure">
  This article was reported and written by a human editor.
  AI tools were used for research assistance only.
</p>

<!-- Machine-readable: proposed Schema.org approach -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline",
  "author": {"@type": "Person", "name": "Editor Name"},
  "creativeWorkStatus": "Published",
  "additionalProperty": {
    "@type": "PropertyValue",
    "name": "aiAssistanceLevel",
    "value": "research-only"
  }
}
</script>

There is no universal standard for AI labeling yet. The most credible approach is a combination: a human-readable disclosure in the article body, a structured data property where one exists, and an editorial policy page that describes your AI use policy in institutional terms.

The files that decideif AI can find you.

robots.txt

ai.txt

llms.txt

JSON-LD

NewsArticle schema

isAccessibleForFree

Your brand in the LLMs

Hallucination

Editorial policy page

OpenGraph

canonical

sitemap.xml

Speakable schema

ads.txt

Tollbit

Perplexity Publisher Program

Cloudflare Pay Per Crawl

EU AI Act compliance signals

Named editor signal

AI labeling

The files that decide
if AI can find you.