Semantic Scholar API - Complete Developer Guide

The Semantic Scholar API (from Ai2 / Semantic Scholar) is one of the most practical ways to programmatically work with scholarly metadata: search papers by relevance, fetch paper details, list citations/references, look up authors and their publications, and—when you need higher throughput—download bulk datasets and query locally. The API uses standard HTTP verbs and responses, and is explicitly organized into three services: Academic Graph, Recommendations, and Datasets. (See the official overview and tutorial.)

1) What is the Semantic Scholar API?

Semantic Scholar is widely known as a free, AI-powered research search engine. For developers, the most relevant piece is its REST API that provides structured access to paper and author metadata and related graph relationships (citations, references, co-authorship, venues). According to Semantic Scholar’s API overview, the REST API helps you “find and explore scientific publication data about authors, papers, citations, venues, and more” and is organized into Academic Graph, Recommendations, and Datasets services. That framing matters: the API is not only a “search endpoint,” but a platform for graph traversal and bulk-access workflows. :contentReference[oaicite:0]{index=0}

What you can build

If you can describe your product as “research discovery,” “citation intelligence,” or “paper metadata enrichment,” you can likely use the Semantic Scholar API. Common builds include:

Literature discovery tools: search by keywords, filter by year, and show the most relevant results with paper metadata.
Citation graphs: traverse references and citations to visualize how ideas connect over time.
Author pages: show an author profile, their papers, topics, and citation counts.
Reading list workflows: enrich citations from a bibliography (DOIs, arXiv IDs, PubMed IDs) into structured paper objects.
Recommendation feeds: recommend “similar papers” from a seed paper or a set of interests.
Data pipelines: pull large-scale metadata via datasets, then run local analytics and custom ranking.

Important reality check

The Semantic Scholar API is a metadata + graph product—not a full-text distribution mechanism. If you want to reproduce or distribute part of a paper, Semantic Scholar’s FAQ says you must contact the author or publisher directly for permissions. :contentReference[oaicite:1]{index=1}

How “Semantic Scholar API” differs from scraping

Scraping an academic site is brittle and often violates terms. Semantic Scholar provides documented endpoints, clear request/response formats, and explicit best practices: use batch endpoints, limit fields, and download datasets for large needs. That combination is exactly what you want when building a production integration that must be stable over time. :contentReference[oaicite:2]{index=2}

2) The Semantic Scholar API suite (Academic Graph, Recommendations, Datasets)

The official tutorial states that Semantic Scholar contains three APIs, each with its own base URL:

API	What it does	Base URL
Academic Graph API	Paper + author details, citations, references, and scholarly graph traversal	`https://api.semanticscholar.org/graph/v1`
Recommendations API	Recommend papers similar to a given paper (and related workflows)	`https://api.semanticscholar.org/recommendations/v1`
Datasets API	Download full datasets and incremental diffs for local querying and high-rate needs	`https://api.semanticscholar.org/datasets/v1`

The overview page also explains the same high-level organization (Academic Graph, Recommendations, Datasets) and emphasizes that Academic Graph provides data for authors, papers, citations, venues, and more—including embeddings—so you can link back to semanticscholar.org. :contentReference[oaicite:3]{index=3}

Best for product apps

Use Academic Graph + Recommendations for interactive search, paper details, and personalized “similar papers.”

Best for heavy pipelines

Use Datasets for large-scale analyses, high-throughput enrichment, and local ranking.

3) Quickstart: your first requests

The official tutorial walks through searching for papers and authors, requesting paper details, and using recommendations and datasets. It also clarifies what an endpoint is, what query parameters and headers are, and how to interpret standard HTTP status codes (including 429 for rate limiting). :contentReference[oaicite:4]{index=4}

Base request shape

Most Semantic Scholar API requests follow a simple pattern:

Choose a base URL (Graph, Recommendations, or Datasets).
Choose a resource path (e.g., /paper/search or /paper/{paperId}).
Add query parameters (e.g., fields=..., limit=..., year filters, offsets).
Optionally add headers like x-api-key if authenticated.

Performance tip baked into the official tutorial

The tutorial explicitly recommends: use an API key, use batch/bulk endpoints, and keep fields small (only ask for what you need). For even larger needs, download datasets and query locally. :contentReference[oaicite:5]{index=5}

Example: fetch paper details (Graph API)

The tutorial shows the paper details endpoint pattern as /paper/{paper_id}, and demonstrates using a fields parameter to retrieve exactly what you need (e.g., title, year, abstract, citationCount). :contentReference[oaicite:6]{index=6}

# Example request shape (Python-style pseudocode)
# Base URL (Graph API): https://api.semanticscholar.org/graph/v1
# Resource path: /paper/{paperId}
# Query params: fields=title,year,abstract,citationCount

GET https://api.semanticscholar.org/graph/v1/paper/{paperId}?fields=title,year,abstract,citationCount

# Optional header for authenticated usage:
x-api-key: YOUR_PRIVATE_API_KEY

In real apps, you’ll wrap this in a client that also handles caching, retries on transient errors, and backoff on 429 responses.

4) Authentication & API keys (and how auth really works)

Semantic Scholar supports public access for most endpoints, but also offers API keys for authenticated usage. The official overview page says certain endpoints require authentication via an API key and that using an API key is recommended as a best practice. It also warns: you receive a private API key via email and you should not share your key with anyone. :contentReference[oaicite:7]{index=7}

Header-based auth: `x-api-key`

The tutorial demonstrates adding your API key via request headers using the x-api-key header. :contentReference[oaicite:8]{index=8}

headers = {
  "x-api-key": "YOUR_PRIVATE_API_KEY"
}

When you should request a key

Even if you can call many endpoints without auth, requesting an API key is generally worth it because:

It gives you a dedicated (though initially modest) rate limit profile.
It makes support and troubleshooting easier because your traffic is identifiable.
Some endpoints require authentication.

Key management and safety

Treat the API key as a secret. Common failures include:

Embedding the key in a public frontend bundle
Checking the key into git
Printing the key into logs or error traces
Exposing the key through “copy as cURL” debugging tools

Instead, store keys in server-side environment variables, secrets managers, or CI/CD secret stores. Your client-side app should call your own backend, and the backend calls Semantic Scholar with the key.

5) Rate limits, throttling, and how to avoid 429s

Rate limits are one of the most important practical aspects of the Semantic Scholar API—because research apps often feel “bursty” by nature. Users search quickly, click around, open details, and request lots of citations.

Unauthenticated behavior

According to the official overview page, most endpoints are available without authentication, but unauthenticated access is rate-limited to 1000 requests per second shared among all unauthenticated users, and requests may be further throttled during heavy use. :contentReference[oaicite:9]{index=9}

Authenticated behavior (API key)

The same official overview page states that authenticated users have access to higher rate limits and recommends including your key with every request. It also notes: the introductory rate limit for an API key is 1 RPS on all endpoints. :contentReference[oaicite:10]{index=10}

That “1 RPS” starter limit surprises some developers. The right interpretation is: the platform is intentionally conservative for new keys, and you should design your integration using:

Batch/bulk endpoints whenever possible
Fields minimization to reduce response size and server load
Caching for popular queries and repeat paper lookups
Debouncing on search boxes and interactive UI controls
Queue-based enrichment for high-volume pipelines

Official best practices to reduce slowdowns

The tutorial’s “How to make requests faster and more efficiently” section is essentially a production checklist: use an API key, use batch endpoints, limit fields, and use datasets when you need higher request rates than API keys provide. :contentReference[oaicite:11]{index=11}

Backoff strategy (what to do on 429)

The tutorial includes 429 as “Too Many Requests” and advises slowing down. :contentReference[oaicite:12]{index=12} For production, implement:

Exponential backoff with jitter
Client-side request budgets per user session
Retry only idempotent requests (GET; POST only if safe)
Circuit breaker for prolonged throttling (serve cached results or degrade gracefully)

Design principle

Assume you will sometimes be throttled—then build an experience that still feels “fast” using cached results, progressive rendering, and background fetching.

6) Core endpoint map (Graph API focus)

The Graph API is the workhorse for most applications. The tutorial describes that the Academic Graph API “returns details about papers, paper authors, paper citations and references,” and uses endpoints like /paper/search, /paper/{paperId}, and their bulk/batch variations. :contentReference[oaicite:13]{index=13}

Core resources you’ll use

Papers

Search papers by relevance, fetch details, list references/citations, and enrich external IDs into canonical paper objects.

Authors

Look up authors, fetch profiles, list an author’s papers, and build author-centric views.

References & citations

Traverse the graph: what a paper cites (references) and who cites the paper (citations).

Venues & metadata

Support filtering and display: venues, years, publication types, open access fields, and more (based on selected fields).

Common Graph API routes (conceptual)

Exact request/response shapes are best verified in the official API documentation, but these are the routes most developers rely on:

Job to be done	Typical route pattern	Notes
Search papers by relevance	`GET /paper/search`	Use `query`, `limit`, year filters, and `fields`. :contentReference[oaicite:14]{index=14}
Paper details by ID	`GET /paper/{paper_id}`	Request only needed fields. :contentReference[oaicite:15]{index=15}
Bulk paper search	`POST /paper/search/bulk` (concept)	Tutorial references bulk endpoints for efficiency. :contentReference[oaicite:16]{index=16}
Batch paper details	`POST /paper/batch` (concept)	Tutorial references batch endpoints for efficiency. :contentReference[oaicite:17]{index=17}
Author lookup	`GET /author/{author_id}`	Often combined with author’s papers endpoints.
Author search	`GET /author/search`	Find author IDs from name queries.

Note: Some of the exact paths and parameters are dynamic and maintained in the official docs and Postman collection; always verify before shipping.

IDs you’ll encounter (practical approach)

In scholarly systems, IDs are messy: DOI, arXiv IDs, PubMed IDs, internal corpus IDs, and canonical Semantic Scholar paper IDs. Your integration should:

Store the canonical paper identifier returned by Semantic Scholar for future lookups
Retain original IDs (DOI/arXiv/PMID) for traceability
Deduplicate papers by DOI when possible, but don’t assume DOI exists for everything
Log “ID resolution misses” for later cleanup

7) Fields: response shaping, cost control, and speed

The single highest-leverage performance tool in Semantic Scholar’s API is the fields query parameter. The official tutorial states that most endpoints include a fields parameter letting users specify what data they want returned, and explicitly advises avoiding more fields than needed because it can slow down response rates. :contentReference[oaicite:18]{index=18}

Why “fields” matters in real products

Think of fields like a “projection” in SQL:

It reduces payload size (faster over the network)
It reduces server work (faster response)
It reduces your parsing cost
It lowers the chance of hitting response size ceilings

Suggested field bundles (human-friendly presets)

Use presets in your code, not ad-hoc field strings scattered everywhere. Example presets:

Search card: title, year, venue, authors (names), URL, openAccessPdf (if needed)
Paper details: abstract, citationCount, referenceCount, influentialCitationCount (if available), publication date
Graph view: references (IDs + title), citations (IDs + title), topics/fields of study (if available)
Export: DOI/arXiv/PMID + BibTeX/citation style fields if your UI supports “Cite” features

Rule of thumb

If your UI does not display a field, do not request it. Treat fields as a product contract and performance budget.

8) Pagination and batch retrieval (how to scale without pain)

Many Semantic Scholar endpoints are paginated. The official tutorial includes a dedicated “Pagination” section in Additional Resources and also encourages using batch/bulk endpoints for large quantities of data. :contentReference[oaicite:19]{index=19}

Why batch/bulk endpoints matter

When your rate limit is constrained (especially at 1 RPS initially for keys), batch/bulk endpoints become the only practical way to retrieve meaningful volumes:

Bulk search: retrieve more results in one response rather than firing many small queries
Batch details: fetch details for multiple paper IDs at once

The tutorial explicitly calls out examples: paper relevance search has a bulk version, and the paper details endpoint has a batch version. :contentReference[oaicite:20]{index=20}

Pagination pattern for UI

For interactive search UIs, a common pattern is:

Perform an initial search with a small limit (e.g., 10–20)
Render results immediately
Fetch the next page when the user scrolls or clicks “Load more”
Cache pages to avoid refetch on back navigation

Pagination pattern for pipelines

For back-office enrichment, you typically:

Queue a list of IDs
Pull N IDs per worker tick
Use batch endpoints if available
Persist a checkpoint (last processed ID or cursor) to resume after failures

9) Recommendations API (build “similar papers” features)

The API overview lists a Recommendations service that “provides recommended papers similar to a given paper.” :contentReference[oaicite:21]{index=21} The official tutorial includes “Step 2: Get recommended papers,” indicating a guided workflow for using this API in practice. :contentReference[oaicite:22]{index=22}

Common recommendation UX patterns

Paper page sidebar: “Similar papers” list with filters (year range, venue relevance)
Reading list augmentation: recommend new papers for a folder or topic
Research feed: periodic refresh of recommended papers based on user interests

How to keep recommendations trustworthy

Recommendations feel magical when they’re grounded in data that users can verify. Show:

Why it’s recommended (shared topics, shared citations, similar abstracts—whatever your UX supports)
Clear metadata (venue/year/authors)
Direct link to the Semantic Scholar page for verification when appropriate

Best practice for rate limits

Load recommendations lazily. Users often open paper details faster than you can fetch recommendations for every result card. Fetch recommendations only when a user opens a paper page or explicitly clicks “Show similar.”

10) Datasets API and bulk data workflows

If you need request rates higher than API keys provide, the official tutorial explicitly recommends downloading Semantic Scholar datasets and querying locally. :contentReference[oaicite:23]{index=23} The Datasets API exists to make that practical: it provides endpoints for downloading and maintaining datasets, and the tutorial includes guidance on downloading full datasets and updating with incremental diffs. :contentReference[oaicite:24]{index=24}

When datasets are the right approach

You need to enrich millions of citations in a batch job
You want to build your own ranking/retrieval system over academic metadata
You want offline analytics (topic modeling, venue analysis, citation networks)
You need consistent throughput without UI-driven variability

Typical dataset pipeline architecture

A robust dataset-driven system often looks like:

Bootstrap: download an initial full dataset snapshot using Datasets API
Ingest: parse JSON archives into a warehouse (Postgres/BigQuery/Snowflake) or a search index (Elasticsearch/OpenSearch)
Serve: run your own API or UI against your local store
Update: apply incremental diffs regularly to stay current

Choosing between “live API” vs “bulk datasets”

Need	Use live APIs	Use datasets
Interactive search UI	✅ Great	Sometimes (if you run your own search index)
High-volume enrichment	Limited by rate limits	✅ Best
Custom ranking / experimentation	Limited	✅ Best
Minimal ops / fastest MVP	✅ Best	Heavier setup

11) Data quality, coverage, and real-world caveats

Scholarly data is inherently messy. Your product quality depends on how you handle:

Coverage gaps: not every paper has a DOI or complete abstract
Author disambiguation: same name, different person; same person, multiple name variants
Venue normalization: conferences and journals have naming inconsistencies
Open access links: some papers have open PDFs; many do not
Permission boundaries: metadata is not the same as redistributing full text

Permissions and full text

Semantic Scholar’s FAQ on permissions makes it explicit: if you want permission to use/publish/distribute any part of a paper, you must contact the author or publisher directly (Semantic Scholar cannot provide those permissions for you). :contentReference[oaicite:25]{index=25} This is crucial for product compliance: your app can show metadata and links, but treat full-text hosting or redistribution as a separate legal problem.

Trustworthy UX: show provenance

Users trust academic tools more when they can verify sources. Consider:

Link back to the Semantic Scholar page for a paper (when appropriate)
Show stable identifiers (DOI, arXiv ID) when available
Display retrieval timestamps in your UI (“Data fetched on…”) for transparency
Provide an “Report an issue” link for metadata errors or missing records

12) Production architecture: cache, retries, queues, and cost control

A production-grade Semantic Scholar integration is mostly about systems engineering, not “just calling an endpoint.” Below are patterns that prevent slow UIs, 429 storms, and fragile pipelines.

Pattern A: Read-through cache for paper details

Paper detail calls repeat constantly across users. Implement a cache layer:

Key by canonical paper ID (and optionally by fields preset)
TTL of 1–7 days depending on how “fresh” you need counts/metadata to be
Cache negative results briefly (e.g., 5 minutes) to avoid repeated misses

Pattern B: Debounce search and prefetch carefully

Search boxes can trigger dozens of calls per user. Debounce typing (e.g., 300–600ms), limit “search-as-you-type” to short results, and only fetch full details when a user opens a result.

Pattern C: Bulk enrichment via queue + batch endpoints

For pipelines (bibliography import, citation enrichment):

Accept user upload (DOIs, arXiv IDs, PMIDs)
Normalize and deduplicate IDs
Queue ID resolution jobs
Use batch endpoints when possible (tutorial recommends batch/bulk usage) :contentReference[oaicite:26]{index=26}
Persist results to your DB and notify user when ready

Pattern D: Resilience and retries

The tutorial lists standard status codes including 500 and 429. :contentReference[oaicite:27]{index=27} Treat 429 differently from 500:

429: backoff, reduce concurrency, serve cached data if possible
5xx: retry with backoff (short window), then fail gracefully
4xx: do not retry blindly; validate parameters and fix inputs

Pattern E: Observability (you’ll need it)

Track these metrics:

Requests per endpoint and per user
Latency (p50, p95) by endpoint
429 rate (throttle events)
Cache hit ratio
Batch job completion time and failure rate

Developer experience tip

Provide an internal “request inspector” page that logs the exact request you made (minus secrets) and the response status. This turns debugging from guesswork into evidence-based fixes.

Concrete sample: a safe server-side client (Node)

// Minimal server-side client pattern (Node 18+)
// - Uses x-api-key header if available
// - Handles timeouts
// - Adds basic exponential backoff on 429/5xx

const BASE = "https://api.semanticscholar.org/graph/v1";

function sleep(ms){ return new Promise(r => setTimeout(r, ms)); }

async function s2Fetch(path, { params = {}, apiKey, method = "GET", body } = {}) {
  const url = new URL(BASE + path);
  Object.entries(params).forEach(([k,v]) => {
    if (v !== undefined && v !== null) url.searchParams.set(k, String(v));
  });

  const headers = { "Accept":"application/json" };
  if (apiKey) headers["x-api-key"] = apiKey;
  if (body) headers["Content-Type"] = "application/json";

  const maxAttempts = 5;
  let attempt = 0;

  while (true) {
    attempt++;
    const ctrl = new AbortController();
    const t = setTimeout(() => ctrl.abort(), 15000);

    try {
      const res = await fetch(url, {
        method,
        headers,
        body: body ? JSON.stringify(body) : undefined,
        signal: ctrl.signal
      });

      if (res.status === 429 || (res.status >= 500 && res.status <= 599)) {
        if (attempt >= maxAttempts) throw new Error(`Retry limit hit: ${res.status}`);
        const backoff = Math.min(2000 * Math.pow(2, attempt - 1), 20000);
        const jitter = Math.floor(Math.random() * 250);
        await sleep(backoff + jitter);
        continue;
      }

      if (!res.ok) {
        const text = await res.text().catch(() => "");
        throw new Error(`HTTP ${res.status}: ${text.slice(0, 400)}`);
      }

      return await res.json();
    } finally {
      clearTimeout(t);
    }
  }
}

// Example usage:
// const paper = await s2Fetch(`/paper/${paperId}`, {
//   apiKey: process.env.S2_API_KEY,
//   params: { fields: "title,year,abstract,citationCount" }
// });

13) License & compliance (what you must read before shipping)

Semantic Scholar provides an API License Agreement. The license page states it is a legal agreement between the licensee and The Allen Institute for AI (AI2), governs your use of the API, and includes key terms about how you may use S2 Data. :contentReference[oaicite:28]{index=28}

Key points called out on the license page

A few items that are clearly visible on the license page (summarized, not quoted):

The agreement is between you (and your organization) and AI2, and governs use of the Semantic Scholar API. :contentReference[oaicite:29]{index=29}
AI2 grants a limited, non-exclusive, non-transferable, non-sublicensable, terminable license to use the API to access and display S2 Data in accordance with documentation and the agreement. :contentReference[oaicite:30]{index=30}
If you are provided an API key, you may not disclose it beyond authorized users within your organization. :contentReference[oaicite:31]{index=31}

Practical compliance guidance

Read the full license and align your product behavior to it: how you display data, whether you cache or redistribute, how you handle attribution, and how you protect API keys. If your use case involves bulk redistribution or commercial resale of data, treat that as a licensing conversation—not a “technical workaround.”

14) FAQs

Do I need an API key to use the Semantic Scholar API?

Many endpoints are public without authentication, but unauthenticated usage is rate-limited and shared across all unauthenticated users. The official overview says some endpoints require an API key and recommends including your key with every request. :contentReference[oaicite:32]{index=32}

What are the base URLs for the Semantic Scholar APIs?

The official tutorial lists three: Graph API https://api.semanticscholar.org/graph/v1, Recommendations API https://api.semanticscholar.org/recommendations/v1, and Datasets API https://api.semanticscholar.org/datasets/v1. :contentReference[oaicite:33]{index=33}

What are the unauthenticated and authenticated rate limits?

The official overview says unauthenticated access is rate-limited to 1000 requests per second shared among all unauthenticated users, and that the introductory rate limit for an API key is 1 request per second on all endpoints. :contentReference[oaicite:34]{index=34}

How do I send my API key?

Use the x-api-key request header (shown in the official tutorial). :contentReference[oaicite:35]{index=35}

How do I avoid hitting rate limits?

Follow the official tutorial’s guidance: use an API key, use batch/bulk endpoints, and keep fields small. For large needs, download datasets and query locally. :contentReference[oaicite:36]{index=36}

Can I download the entire corpus?

Semantic Scholar provides a Datasets API for downloading and maintaining datasets (including incremental diffs), and the tutorial describes how to download full datasets and update with diffs. :contentReference[oaicite:37]{index=37}

Can I redistribute full paper content from Semantic Scholar?

For permission to use, publish, or distribute any part of a paper, Semantic Scholar’s FAQ says you must contact the author or publisher directly. :contentReference[oaicite:38]{index=38}

Is there a publicly accessible API with a search endpoint?

Yes. Semantic Scholar’s FAQ points to the Semantic Scholar Academic Graph API and mentions you can download a subset of the corpus as a single artifact. :contentReference[oaicite:39]{index=39}

Where should I verify the exact endpoint parameters and available fields?

Use the official API documentation and tutorial. The tutorial explicitly points to the documentation for request/response formatting details. :contentReference[oaicite:40]{index=40}

What’s the best approach for a research product: live API or datasets?

Use the live Graph API for interactive experiences and the Datasets API for high-throughput or custom local querying. The official tutorial recommends datasets for needs exceeding key-based request rates. :contentReference[oaicite:41]{index=41}

15) Official sources (verify current schemas, fields, and policies)

Semantic Scholar evolves over time. Use these official pages for the latest, authoritative details:

API overview (rate limits, services, key request): https://www.semanticscholar.org/product/api :contentReference[oaicite:42]{index=42}
API tutorial (base URLs, best practices, batch/bulk guidance): https://www.semanticscholar.org/product/api/tutorial :contentReference[oaicite:43]{index=43}
API documentation hub: https://api.semanticscholar.org/api-docs/ :contentReference[oaicite:44]{index=44}
API license agreement: https://www.semanticscholar.org/product/api/license :contentReference[oaicite:45]{index=45}
FAQ: public API search endpoint: https://www.semanticscholar.org/faq/public-api :contentReference[oaicite:46]{index=46}
FAQ: permissions to use/distribute paper content: https://www.semanticscholar.org/faq/permission-to-use :contentReference[oaicite:47]{index=47}

If you’re publishing this page on your site

Add a “Changelog” and update it when rate limits, endpoints, or licensing terms change. These are the most important “moving parts” for developers and for SEO on API-related queries.

Developer checklist (copy into your repo)

✅ Use the correct base URL: /graph/v1, /recommendations/v1, or /datasets/v1 :contentReference[oaicite:48]{index=48}
✅ Send x-api-key on server-side requests if you have a key :contentReference[oaicite:49]{index=49}
✅ Keep fields minimal for speed :contentReference[oaicite:50]{index=50}
✅ Prefer batch/bulk endpoints for high-volume retrieval :contentReference[oaicite:51]{index=51}
✅ Cache paper details and common searches
✅ Implement backoff on 429 and retries on 5xx :contentReference[oaicite:52]{index=52}
✅ Read and comply with the API license agreement :contentReference[oaicite:53]{index=53}
✅ Don’t redistribute paper content without publisher/author permission :contentReference[oaicite:54]{index=54}