Semantic Scholar API - Complete Developer Guide
The Semantic Scholar API (from Ai2 / Semantic Scholar) is one of the most practical ways to programmatically work with scholarly metadata: search papers by relevance, fetch paper details, list citations/references, look up authors and their publications, and—when you need higher throughput—download bulk datasets and query locally. The API uses standard HTTP verbs and responses, and is explicitly organized into three services: Academic Graph, Recommendations, and Datasets. (See the official overview and tutorial.)
1) What is the Semantic Scholar API?
Semantic Scholar is widely known as a free, AI-powered research search engine. For developers, the most relevant piece is its REST API that provides structured access to paper and author metadata and related graph relationships (citations, references, co-authorship, venues). According to Semantic Scholar’s API overview, the REST API helps you “find and explore scientific publication data about authors, papers, citations, venues, and more” and is organized into Academic Graph, Recommendations, and Datasets services. That framing matters: the API is not only a “search endpoint,” but a platform for graph traversal and bulk-access workflows. :contentReference[oaicite:0]{index=0}
What you can build
If you can describe your product as “research discovery,” “citation intelligence,” or “paper metadata enrichment,” you can likely use the Semantic Scholar API. Common builds include:
- Literature discovery tools: search by keywords, filter by year, and show the most relevant results with paper metadata.
- Citation graphs: traverse references and citations to visualize how ideas connect over time.
- Author pages: show an author profile, their papers, topics, and citation counts.
- Reading list workflows: enrich citations from a bibliography (DOIs, arXiv IDs, PubMed IDs) into structured paper objects.
- Recommendation feeds: recommend “similar papers” from a seed paper or a set of interests.
- Data pipelines: pull large-scale metadata via datasets, then run local analytics and custom ranking.
How “Semantic Scholar API” differs from scraping
Scraping an academic site is brittle and often violates terms. Semantic Scholar provides documented endpoints, clear request/response formats, and explicit best practices: use batch endpoints, limit fields, and download datasets for large needs. That combination is exactly what you want when building a production integration that must be stable over time. :contentReference[oaicite:2]{index=2}
2) The Semantic Scholar API suite (Academic Graph, Recommendations, Datasets)
The official tutorial states that Semantic Scholar contains three APIs, each with its own base URL:
| API | What it does | Base URL |
|---|---|---|
| Academic Graph API | Paper + author details, citations, references, and scholarly graph traversal | https://api.semanticscholar.org/graph/v1 |
| Recommendations API | Recommend papers similar to a given paper (and related workflows) | https://api.semanticscholar.org/recommendations/v1 |
| Datasets API | Download full datasets and incremental diffs for local querying and high-rate needs | https://api.semanticscholar.org/datasets/v1 |
The overview page also explains the same high-level organization (Academic Graph, Recommendations, Datasets) and emphasizes that Academic Graph provides data for authors, papers, citations, venues, and more—including embeddings—so you can link back to semanticscholar.org. :contentReference[oaicite:3]{index=3}
Best for product apps
Use Academic Graph + Recommendations for interactive search, paper details, and personalized “similar papers.”
Best for heavy pipelines
Use Datasets for large-scale analyses, high-throughput enrichment, and local ranking.
3) Quickstart: your first requests
The official tutorial walks through searching for papers and authors, requesting paper details, and using recommendations and datasets. It also clarifies what an endpoint is, what query parameters and headers are, and how to interpret standard HTTP status codes (including 429 for rate limiting). :contentReference[oaicite:4]{index=4}
Base request shape
Most Semantic Scholar API requests follow a simple pattern:
- Choose a base URL (Graph, Recommendations, or Datasets).
- Choose a resource path (e.g.,
/paper/searchor/paper/{paperId}). - Add query parameters (e.g.,
fields=...,limit=..., year filters, offsets). - Optionally add headers like
x-api-keyif authenticated.
fields small (only ask for what you need).
For even larger needs, download datasets and query locally. :contentReference[oaicite:5]{index=5}
Example: fetch paper details (Graph API)
The tutorial shows the paper details endpoint pattern as /paper/{paper_id}, and demonstrates using a fields parameter
to retrieve exactly what you need (e.g., title, year, abstract, citationCount). :contentReference[oaicite:6]{index=6}
# Example request shape (Python-style pseudocode)
# Base URL (Graph API): https://api.semanticscholar.org/graph/v1
# Resource path: /paper/{paperId}
# Query params: fields=title,year,abstract,citationCount
GET https://api.semanticscholar.org/graph/v1/paper/{paperId}?fields=title,year,abstract,citationCount
# Optional header for authenticated usage:
x-api-key: YOUR_PRIVATE_API_KEY
In real apps, you’ll wrap this in a client that also handles caching, retries on transient errors, and backoff on 429 responses.
4) Authentication & API keys (and how auth really works)
Semantic Scholar supports public access for most endpoints, but also offers API keys for authenticated usage. The official overview page says certain endpoints require authentication via an API key and that using an API key is recommended as a best practice. It also warns: you receive a private API key via email and you should not share your key with anyone. :contentReference[oaicite:7]{index=7}
Header-based auth: x-api-key
The tutorial demonstrates adding your API key via request headers using the x-api-key header. :contentReference[oaicite:8]{index=8}
headers = {
"x-api-key": "YOUR_PRIVATE_API_KEY"
}
When you should request a key
Even if you can call many endpoints without auth, requesting an API key is generally worth it because:
- It gives you a dedicated (though initially modest) rate limit profile.
- It makes support and troubleshooting easier because your traffic is identifiable.
- Some endpoints require authentication.
Key management and safety
Treat the API key as a secret. Common failures include:
- Embedding the key in a public frontend bundle
- Checking the key into git
- Printing the key into logs or error traces
- Exposing the key through “copy as cURL” debugging tools
Instead, store keys in server-side environment variables, secrets managers, or CI/CD secret stores. Your client-side app should call your own backend, and the backend calls Semantic Scholar with the key.
5) Rate limits, throttling, and how to avoid 429s
Rate limits are one of the most important practical aspects of the Semantic Scholar API—because research apps often feel “bursty” by nature. Users search quickly, click around, open details, and request lots of citations.
Unauthenticated behavior
According to the official overview page, most endpoints are available without authentication, but unauthenticated access is rate-limited to 1000 requests per second shared among all unauthenticated users, and requests may be further throttled during heavy use. :contentReference[oaicite:9]{index=9}
Authenticated behavior (API key)
The same official overview page states that authenticated users have access to higher rate limits and recommends including your key with every request. It also notes: the introductory rate limit for an API key is 1 RPS on all endpoints. :contentReference[oaicite:10]{index=10}
That “1 RPS” starter limit surprises some developers. The right interpretation is: the platform is intentionally conservative for new keys, and you should design your integration using:
- Batch/bulk endpoints whenever possible
- Fields minimization to reduce response size and server load
- Caching for popular queries and repeat paper lookups
- Debouncing on search boxes and interactive UI controls
- Queue-based enrichment for high-volume pipelines
Official best practices to reduce slowdowns
The tutorial’s “How to make requests faster and more efficiently” section is essentially a production checklist: use an API key, use batch endpoints, limit fields, and use datasets when you need higher request rates than API keys provide. :contentReference[oaicite:11]{index=11}
Backoff strategy (what to do on 429)
The tutorial includes 429 as “Too Many Requests” and advises slowing down. :contentReference[oaicite:12]{index=12} For production, implement:
- Exponential backoff with jitter
- Client-side request budgets per user session
- Retry only idempotent requests (GET; POST only if safe)
- Circuit breaker for prolonged throttling (serve cached results or degrade gracefully)
6) Core endpoint map (Graph API focus)
The Graph API is the workhorse for most applications. The tutorial describes that the Academic Graph API “returns details about papers, paper authors,
paper citations and references,” and uses endpoints like /paper/search, /paper/{paperId}, and their bulk/batch variations. :contentReference[oaicite:13]{index=13}
Core resources you’ll use
Papers
Search papers by relevance, fetch details, list references/citations, and enrich external IDs into canonical paper objects.
Authors
Look up authors, fetch profiles, list an author’s papers, and build author-centric views.
References & citations
Traverse the graph: what a paper cites (references) and who cites the paper (citations).
Venues & metadata
Support filtering and display: venues, years, publication types, open access fields, and more (based on selected fields).
Common Graph API routes (conceptual)
Exact request/response shapes are best verified in the official API documentation, but these are the routes most developers rely on:
| Job to be done | Typical route pattern | Notes |
|---|---|---|
| Search papers by relevance | GET /paper/search |
Use query, limit, year filters, and fields. :contentReference[oaicite:14]{index=14} |
| Paper details by ID | GET /paper/{paper_id} |
Request only needed fields. :contentReference[oaicite:15]{index=15} |
| Bulk paper search | POST /paper/search/bulk (concept) |
Tutorial references bulk endpoints for efficiency. :contentReference[oaicite:16]{index=16} |
| Batch paper details | POST /paper/batch (concept) |
Tutorial references batch endpoints for efficiency. :contentReference[oaicite:17]{index=17} |
| Author lookup | GET /author/{author_id} |
Often combined with author’s papers endpoints. |
| Author search | GET /author/search |
Find author IDs from name queries. |
Note: Some of the exact paths and parameters are dynamic and maintained in the official docs and Postman collection; always verify before shipping.
IDs you’ll encounter (practical approach)
In scholarly systems, IDs are messy: DOI, arXiv IDs, PubMed IDs, internal corpus IDs, and canonical Semantic Scholar paper IDs. Your integration should:
- Store the canonical paper identifier returned by Semantic Scholar for future lookups
- Retain original IDs (DOI/arXiv/PMID) for traceability
- Deduplicate papers by DOI when possible, but don’t assume DOI exists for everything
- Log “ID resolution misses” for later cleanup
7) Fields: response shaping, cost control, and speed
The single highest-leverage performance tool in Semantic Scholar’s API is the fields query parameter.
The official tutorial states that most endpoints include a fields parameter letting users specify what data they want returned,
and explicitly advises avoiding more fields than needed because it can slow down response rates. :contentReference[oaicite:18]{index=18}
Why “fields” matters in real products
Think of fields like a “projection” in SQL:
- It reduces payload size (faster over the network)
- It reduces server work (faster response)
- It reduces your parsing cost
- It lowers the chance of hitting response size ceilings
Suggested field bundles (human-friendly presets)
Use presets in your code, not ad-hoc field strings scattered everywhere. Example presets:
- Search card: title, year, venue, authors (names), URL, openAccessPdf (if needed)
- Paper details: abstract, citationCount, referenceCount, influentialCitationCount (if available), publication date
- Graph view: references (IDs + title), citations (IDs + title), topics/fields of study (if available)
- Export: DOI/arXiv/PMID + BibTeX/citation style fields if your UI supports “Cite” features
fields as a product contract and performance budget.
8) Pagination and batch retrieval (how to scale without pain)
Many Semantic Scholar endpoints are paginated. The official tutorial includes a dedicated “Pagination” section in Additional Resources and also encourages using batch/bulk endpoints for large quantities of data. :contentReference[oaicite:19]{index=19}
Why batch/bulk endpoints matter
When your rate limit is constrained (especially at 1 RPS initially for keys), batch/bulk endpoints become the only practical way to retrieve meaningful volumes:
- Bulk search: retrieve more results in one response rather than firing many small queries
- Batch details: fetch details for multiple paper IDs at once
The tutorial explicitly calls out examples: paper relevance search has a bulk version, and the paper details endpoint has a batch version. :contentReference[oaicite:20]{index=20}
Pagination pattern for UI
For interactive search UIs, a common pattern is:
- Perform an initial search with a small limit (e.g., 10–20)
- Render results immediately
- Fetch the next page when the user scrolls or clicks “Load more”
- Cache pages to avoid refetch on back navigation
Pagination pattern for pipelines
For back-office enrichment, you typically:
- Queue a list of IDs
- Pull N IDs per worker tick
- Use batch endpoints if available
- Persist a checkpoint (last processed ID or cursor) to resume after failures
9) Recommendations API (build “similar papers” features)
The API overview lists a Recommendations service that “provides recommended papers similar to a given paper.” :contentReference[oaicite:21]{index=21} The official tutorial includes “Step 2: Get recommended papers,” indicating a guided workflow for using this API in practice. :contentReference[oaicite:22]{index=22}
Common recommendation UX patterns
- Paper page sidebar: “Similar papers” list with filters (year range, venue relevance)
- Reading list augmentation: recommend new papers for a folder or topic
- Research feed: periodic refresh of recommended papers based on user interests
How to keep recommendations trustworthy
Recommendations feel magical when they’re grounded in data that users can verify. Show:
- Why it’s recommended (shared topics, shared citations, similar abstracts—whatever your UX supports)
- Clear metadata (venue/year/authors)
- Direct link to the Semantic Scholar page for verification when appropriate
10) Datasets API and bulk data workflows
If you need request rates higher than API keys provide, the official tutorial explicitly recommends downloading Semantic Scholar datasets and querying locally. :contentReference[oaicite:23]{index=23} The Datasets API exists to make that practical: it provides endpoints for downloading and maintaining datasets, and the tutorial includes guidance on downloading full datasets and updating with incremental diffs. :contentReference[oaicite:24]{index=24}
When datasets are the right approach
- You need to enrich millions of citations in a batch job
- You want to build your own ranking/retrieval system over academic metadata
- You want offline analytics (topic modeling, venue analysis, citation networks)
- You need consistent throughput without UI-driven variability
Typical dataset pipeline architecture
A robust dataset-driven system often looks like:
- Bootstrap: download an initial full dataset snapshot using Datasets API
- Ingest: parse JSON archives into a warehouse (Postgres/BigQuery/Snowflake) or a search index (Elasticsearch/OpenSearch)
- Serve: run your own API or UI against your local store
- Update: apply incremental diffs regularly to stay current
Choosing between “live API” vs “bulk datasets”
| Need | Use live APIs | Use datasets |
|---|---|---|
| Interactive search UI | ✅ Great | Sometimes (if you run your own search index) |
| High-volume enrichment | Limited by rate limits | ✅ Best |
| Custom ranking / experimentation | Limited | ✅ Best |
| Minimal ops / fastest MVP | ✅ Best | Heavier setup |
11) Data quality, coverage, and real-world caveats
Scholarly data is inherently messy. Your product quality depends on how you handle:
- Coverage gaps: not every paper has a DOI or complete abstract
- Author disambiguation: same name, different person; same person, multiple name variants
- Venue normalization: conferences and journals have naming inconsistencies
- Open access links: some papers have open PDFs; many do not
- Permission boundaries: metadata is not the same as redistributing full text
Permissions and full text
Semantic Scholar’s FAQ on permissions makes it explicit: if you want permission to use/publish/distribute any part of a paper, you must contact the author or publisher directly (Semantic Scholar cannot provide those permissions for you). :contentReference[oaicite:25]{index=25} This is crucial for product compliance: your app can show metadata and links, but treat full-text hosting or redistribution as a separate legal problem.
Trustworthy UX: show provenance
Users trust academic tools more when they can verify sources. Consider:
- Link back to the Semantic Scholar page for a paper (when appropriate)
- Show stable identifiers (DOI, arXiv ID) when available
- Display retrieval timestamps in your UI (“Data fetched on…”) for transparency
- Provide an “Report an issue” link for metadata errors or missing records
12) Production architecture: cache, retries, queues, and cost control
A production-grade Semantic Scholar integration is mostly about systems engineering, not “just calling an endpoint.” Below are patterns that prevent slow UIs, 429 storms, and fragile pipelines.
Pattern A: Read-through cache for paper details
Paper detail calls repeat constantly across users. Implement a cache layer:
- Key by canonical paper ID (and optionally by fields preset)
- TTL of 1–7 days depending on how “fresh” you need counts/metadata to be
- Cache negative results briefly (e.g., 5 minutes) to avoid repeated misses
Pattern B: Debounce search and prefetch carefully
Search boxes can trigger dozens of calls per user. Debounce typing (e.g., 300–600ms), limit “search-as-you-type” to short results, and only fetch full details when a user opens a result.
Pattern C: Bulk enrichment via queue + batch endpoints
For pipelines (bibliography import, citation enrichment):
- Accept user upload (DOIs, arXiv IDs, PMIDs)
- Normalize and deduplicate IDs
- Queue ID resolution jobs
- Use batch endpoints when possible (tutorial recommends batch/bulk usage) :contentReference[oaicite:26]{index=26}
- Persist results to your DB and notify user when ready
Pattern D: Resilience and retries
The tutorial lists standard status codes including 500 and 429. :contentReference[oaicite:27]{index=27} Treat 429 differently from 500:
- 429: backoff, reduce concurrency, serve cached data if possible
- 5xx: retry with backoff (short window), then fail gracefully
- 4xx: do not retry blindly; validate parameters and fix inputs
Pattern E: Observability (you’ll need it)
Track these metrics:
- Requests per endpoint and per user
- Latency (p50, p95) by endpoint
- 429 rate (throttle events)
- Cache hit ratio
- Batch job completion time and failure rate
Concrete sample: a safe server-side client (Node)
// Minimal server-side client pattern (Node 18+)
// - Uses x-api-key header if available
// - Handles timeouts
// - Adds basic exponential backoff on 429/5xx
const BASE = "https://api.semanticscholar.org/graph/v1";
function sleep(ms){ return new Promise(r => setTimeout(r, ms)); }
async function s2Fetch(path, { params = {}, apiKey, method = "GET", body } = {}) {
const url = new URL(BASE + path);
Object.entries(params).forEach(([k,v]) => {
if (v !== undefined && v !== null) url.searchParams.set(k, String(v));
});
const headers = { "Accept":"application/json" };
if (apiKey) headers["x-api-key"] = apiKey;
if (body) headers["Content-Type"] = "application/json";
const maxAttempts = 5;
let attempt = 0;
while (true) {
attempt++;
const ctrl = new AbortController();
const t = setTimeout(() => ctrl.abort(), 15000);
try {
const res = await fetch(url, {
method,
headers,
body: body ? JSON.stringify(body) : undefined,
signal: ctrl.signal
});
if (res.status === 429 || (res.status >= 500 && res.status <= 599)) {
if (attempt >= maxAttempts) throw new Error(`Retry limit hit: ${res.status}`);
const backoff = Math.min(2000 * Math.pow(2, attempt - 1), 20000);
const jitter = Math.floor(Math.random() * 250);
await sleep(backoff + jitter);
continue;
}
if (!res.ok) {
const text = await res.text().catch(() => "");
throw new Error(`HTTP ${res.status}: ${text.slice(0, 400)}`);
}
return await res.json();
} finally {
clearTimeout(t);
}
}
}
// Example usage:
// const paper = await s2Fetch(`/paper/${paperId}`, {
// apiKey: process.env.S2_API_KEY,
// params: { fields: "title,year,abstract,citationCount" }
// });
13) License & compliance (what you must read before shipping)
Semantic Scholar provides an API License Agreement. The license page states it is a legal agreement between the licensee and The Allen Institute for AI (AI2), governs your use of the API, and includes key terms about how you may use S2 Data. :contentReference[oaicite:28]{index=28}
Key points called out on the license page
A few items that are clearly visible on the license page (summarized, not quoted):
- The agreement is between you (and your organization) and AI2, and governs use of the Semantic Scholar API. :contentReference[oaicite:29]{index=29}
- AI2 grants a limited, non-exclusive, non-transferable, non-sublicensable, terminable license to use the API to access and display S2 Data in accordance with documentation and the agreement. :contentReference[oaicite:30]{index=30}
- If you are provided an API key, you may not disclose it beyond authorized users within your organization. :contentReference[oaicite:31]{index=31}
14) FAQs
Do I need an API key to use the Semantic Scholar API?
Many endpoints are public without authentication, but unauthenticated usage is rate-limited and shared across all unauthenticated users. The official overview says some endpoints require an API key and recommends including your key with every request. :contentReference[oaicite:32]{index=32}
What are the base URLs for the Semantic Scholar APIs?
The official tutorial lists three: Graph API https://api.semanticscholar.org/graph/v1, Recommendations API
https://api.semanticscholar.org/recommendations/v1, and Datasets API https://api.semanticscholar.org/datasets/v1. :contentReference[oaicite:33]{index=33}
What are the unauthenticated and authenticated rate limits?
The official overview says unauthenticated access is rate-limited to 1000 requests per second shared among all unauthenticated users, and that the introductory rate limit for an API key is 1 request per second on all endpoints. :contentReference[oaicite:34]{index=34}
How do I send my API key?
Use the x-api-key request header (shown in the official tutorial). :contentReference[oaicite:35]{index=35}
How do I avoid hitting rate limits?
Follow the official tutorial’s guidance: use an API key, use batch/bulk endpoints, and keep fields small.
For large needs, download datasets and query locally. :contentReference[oaicite:36]{index=36}
Can I download the entire corpus?
Semantic Scholar provides a Datasets API for downloading and maintaining datasets (including incremental diffs), and the tutorial describes how to download full datasets and update with diffs. :contentReference[oaicite:37]{index=37}
Can I redistribute full paper content from Semantic Scholar?
For permission to use, publish, or distribute any part of a paper, Semantic Scholar’s FAQ says you must contact the author or publisher directly. :contentReference[oaicite:38]{index=38}
Is there a publicly accessible API with a search endpoint?
Yes. Semantic Scholar’s FAQ points to the Semantic Scholar Academic Graph API and mentions you can download a subset of the corpus as a single artifact. :contentReference[oaicite:39]{index=39}
Where should I verify the exact endpoint parameters and available fields?
Use the official API documentation and tutorial. The tutorial explicitly points to the documentation for request/response formatting details. :contentReference[oaicite:40]{index=40}
What’s the best approach for a research product: live API or datasets?
Use the live Graph API for interactive experiences and the Datasets API for high-throughput or custom local querying. The official tutorial recommends datasets for needs exceeding key-based request rates. :contentReference[oaicite:41]{index=41}
15) Official sources (verify current schemas, fields, and policies)
Semantic Scholar evolves over time. Use these official pages for the latest, authoritative details:
- API overview (rate limits, services, key request): https://www.semanticscholar.org/product/api :contentReference[oaicite:42]{index=42}
- API tutorial (base URLs, best practices, batch/bulk guidance): https://www.semanticscholar.org/product/api/tutorial :contentReference[oaicite:43]{index=43}
- API documentation hub: https://api.semanticscholar.org/api-docs/ :contentReference[oaicite:44]{index=44}
- API license agreement: https://www.semanticscholar.org/product/api/license :contentReference[oaicite:45]{index=45}
- FAQ: public API search endpoint: https://www.semanticscholar.org/faq/public-api :contentReference[oaicite:46]{index=46}
- FAQ: permissions to use/distribute paper content: https://www.semanticscholar.org/faq/permission-to-use :contentReference[oaicite:47]{index=47}
Developer checklist (copy into your repo)
- ✅ Use the correct base URL:
/graph/v1,/recommendations/v1, or/datasets/v1:contentReference[oaicite:48]{index=48} - ✅ Send
x-api-keyon server-side requests if you have a key :contentReference[oaicite:49]{index=49} - ✅ Keep
fieldsminimal for speed :contentReference[oaicite:50]{index=50} - ✅ Prefer batch/bulk endpoints for high-volume retrieval :contentReference[oaicite:51]{index=51}
- ✅ Cache paper details and common searches
- ✅ Implement backoff on 429 and retries on 5xx :contentReference[oaicite:52]{index=52}
- ✅ Read and comply with the API license agreement :contentReference[oaicite:53]{index=53}
- ✅ Don’t redistribute paper content without publisher/author permission :contentReference[oaicite:54]{index=54}