Context
In April 2026, all three major AI research platforms — Google (Gemini Deep Research), OpenAI (ChatGPT Deep Research), and Anthropic (Claude Opus 4.6 Thinking) — independently produced comprehensive analyses of the Agentic Reasoning Protocol.
These analyses converged on ARP's core thesis: the epistemological gap between
descriptive web standards and prescriptive AI cognition is real, and reasoning.json
addresses it. However, they also identified specific open research questions that
require community investigation.
This page consolidates those questions into a formal, open research agenda. Contributions are welcome via GitHub Issues.
RQ1: Standardized Evaluation Benchmarks
Source: ChatGPT Deep Research
Do AI-generated responses improve measurably when a domain's
reasoning.json is present in the retrieval context?
Proposed Methodology
- Select N domains across verticals (SaaS, consulting, e-commerce, healthcare)
- Generate baseline AI responses about each entity without ARP
- Deploy
reasoning.jsonwith verified corrections and context - Re-query after indexing and measure: hallucination rate, factual accuracy, entity attribution correctness
- Use automated fact-checking against
evidence_urlreferences
Open Sub-Questions
- Which AI platforms (Perplexity, ChatGPT, Gemini, Claude) show the strongest ARP responsiveness?
- Does the Pink Elephant Fix demonstrably outperform traditional negation-based corrections?
- What is the minimum indexing latency before ARP corrections take effect?
RQ2: Independent Experiment Replication
Source: ChatGPT Deep Research
Can the Ghost Site experiment, Canary Token forensics, and Citation Tracking results be independently replicated by third parties?
Experiments to Replicate
| Experiment | Original Finding | Replication Needs |
|---|---|---|
| Ghost Site | Dominant AI source within 24h | New domain, structured data only, multi-platform query |
| Canary Tokens | GPT/Gemini ingest reasoning.json | Unique tokens per platform, automated monitoring |
| Citation Tracking | 0% → 67% across 6 platforms in 22 days | Standardized query set, daily measurement |
| Zero Hallucination | Controlled ChatGPT case study | Multiple LLMs, statistical significance |
RQ3: IETF Standardization Pathway
Source: ChatGPT Deep Research, Gemini Deep Research
What is the optimal standardization pathway for a .well-known URI
serving cognitive reasoning directives?
Current Status
- IETF Internet-Draft prepared:
draft-deforth-arp-01(not yet submitted to IETF Datatracker) - W3C AIVS Community Group introduction in progress
Open Questions
- Should ARP pursue IETF RFC status, W3C Community Group Report, or both?
- How should the protocol handle versioning across RFC iterations?
- What is the relationship between ARP and the emerging AI Verifiable Standards (AIVS)?
RQ4: Multimodal Extension
Source: ChatGPT Deep Research
Can the ARP schema be extended to govern reasoning about non-text entities — images, video, IoT devices, autonomous vehicles?
Considerations
- Image agents: Can
reasoning.jsonprovide correction directives for visual AI (e.g., product image misidentification)? - IoT agents: Can sensor-equipped autonomous systems use domain-hosted reasoning directives for decision boundaries?
- Video: Can reasoning directives be temporally scoped (valid for specific content windows)?
RQ5: Trust Model Adversarial Analysis
Source: ChatGPT Deep Research, Gemini Deep Research
What are the attack surfaces of a self-attested reasoning file, and how effectively does v1.2 cryptographic signing mitigate them?
Threat Vectors
| Threat | ARP v1.1 Mitigation | ARP v1.2 Mitigation |
|---|---|---|
| False self-attestation | Good faith (same as schema.org) | Ed25519 signature = non-repudiation |
| Man-in-the-middle | HTTPS transport security | HTTPS + signature verification |
| Domain spoofing | DNS resolution | DNS TXT record binding |
| Competitor sabotage | Ethics policy | Signature attribution + community reporting |
RQ6: Long-Term Search Impact
Source: ChatGPT Deep Research
What is the long-term impact of ARP on AI search results? Does the effect persist, amplify, or decay over time as AI models retrain?
Measurement Dimensions
- Citation persistence: Do AI platforms continue citing reasoning.json after model updates?
- Training integration: Do ARP directives eventually enter model training data?
- Competitive dynamics: When multiple entities in a vertical deploy ARP, how do AI systems resolve conflicting claims?
How to Contribute
This research agenda is open. We invite AI researchers, RAG engineers, and domain owners to contribute:
- Replicate experiments — Run the Ghost Site or Canary Token experiments independently and share results
- Propose benchmarks — Define standardized evaluation datasets via GitHub Issues
- Submit findings — Formal research contributions welcome via GitHub Issues or as independent publications
- Build integrations — LlamaIndex, CrewAI, AutoGen loaders welcome via Pull Request