aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--proposals/2021-01-29_citation_api.md274
1 files changed, 274 insertions, 0 deletions
diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md
new file mode 100644
index 00000000..1e329d61
--- /dev/null
+++ b/proposals/2021-01-29_citation_api.md
@@ -0,0 +1,274 @@
+
+Describes schemas, APIs, use-cases, and data store for citation graph.
+
+## Use Cases
+
+**Outbound reference web pages:** on fatcat.wiki and scholar.archive.org, want
+to have a page that lists all of the works cited by ("outgoing") a paper or
+other fatcat release.
+
+- query by fatcat `release_ident`
+- nice to have: list references in the order they appear in the paper, and
+ annotate with any "key" used in the source document itself (either an index
+ number or a short name for the reference)
+- need to have a formatted reference string for each reference, even if we have
+ not "linked" to a specific fatcat release (aka, would need structured or
+ unstructured citation text to display)
+
+
+**Inbound reference web pages:** on fatcat.wiki and scholar.archive.org, want
+to display a list of all works which cite a specific work ("inbound"
+citations).
+
+- query by fatcat `release_ident`, or possibly by `work_ident` and ability to
+ say "cites a different version of the same work"
+- nice to have: citation context snippet surrounding the citation
+- like outbound, want to have good display options and access options for each
+ entry
+- nice to have: non-traditional works (eg, mentions from wikipedia)
+
+**Inbound reference IA services:** OpenLibrary.org and/or web.archive.org might
+want to show a count or list of papers that reference a web page (by URL) or
+book (by openlibrary work identifier).
+
+**Inbound reference counts:** ability to display number of inbound citation
+links for a release or work, on demand. Eg, on a fatcat.wiki release landing
+page. Not sure how important this use-case is.
+
+**Bulk Metadata Releases:** we will want to share this citation graph as an
+artifact. We can easily serialize this format into JSON and share that, or push
+into a columnar file format like Parquet to get storage efficiency advances,
+type/schema enforcement, and easier ingest and use for large-scale data
+analysis.
+
+TODO: more?
+
+
+## Schemas
+
+First, a combined JSON/pydantic/elasticsearch object that represents a
+reference between two things:
+
+ BiblioRef ("bibliographic reference")
+ _key: Optional[str] elasticsearch doc key
+ ("release", source_release_ident, ref_index)
+ ("wikipedia", source_wikipedia_article, ref_index)
+ update_ts: Optional[datetime] elasticsearch doc timestamp
+
+ # metadata about source of reference
+ source_release_ident: Optional[str]
+ source_work_ident: Optional[str]
+ source_wikipedia_article: Optional[str]
+ with lang prefix like "en:Superglue"
+ # skipped: source_openlibrary_work
+ # skipped: source_url_surt
+ source_release_stage: Optional[str]
+ source_year: Optional[int]
+
+ # context of the reference itself
+ ref_index: int
+ 1-indexed, not 0-indexed
+ ref_key: Optional[str]
+ eg, "Lee86", "BIB23"
+ ref_locator: Optional[str]
+ eg, page number
+
+ # target of reference (identifiers)
+ target_release_ident: Optional[str]
+ target_work_ident: Optional[str]
+ target_openlibrary_work: Optional[str]
+ target_url_surt: Optional[str]
+ target_url: Optional[str]
+ would not be stored in elasticsearch, but would be auto-generated
+ by all "get" methods from the SURT, so calling code does not need
+ to do SURT transform
+ # skipped: target_wikipedia_article
+
+ match_provenance: str
+ crossref, pubmed, grobid, etc
+ match_status: Optional[str]
+ strong, weak, etc
+ TODO: "match_strength"?
+ match_reason: Optional[str]
+ "doi", "isbn", "fuzzy title, author", etc
+ maybe "fuzzy-title-author"?
+
+ target_unstructured: string (only if no release_ident link/match)
+ target_csl: free-form JSON (only if no release_ident link/match)
+ CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+ generated from unstructured by a GROBID parse, if needed
+
+Then, two wrapper objects that add more complete metadata. These would be
+pydantic/JSON objects, used in python code, and maybe exposed via API, but not
+indexed in elasticsearch. These are the objects that would, eg, be used by
+jinja templated to display lists of references in the user interface.
+
+ AccessOption
+ access_type: str
+ describes type of access link
+ controlled values: wayback, ia_file, repository, loginwall, etc
+ access_url: str
+ note: for `target_url` refs, would do a CDX lookup and this URL
+ would be a valid/HTTP-200 web.archive.org capture URL
+ mimetype: Optional[str]
+ application/pdf, text/html, etc
+ blank for landing pages
+ size_bytes: Optional[int]
+ thumbnail_url: Optional[str]
+
+ CslBiblioRef
+ # an "enriched" version of BiblioRef with metadata about the source or
+ # target entity. would be "hydrated" via a lookup to, eg, the
+ # `fatcat_release` elasticsearch index (fast mget fetch with a single
+ # request), as opposed to fatcat API fetches
+ biblio_ref: BiblioRef
+ source_csl/target_csl: free-form CSL-JSON
+ source_access/target_access: List[AccessOption]
+
+ FatcatBiblioRef
+ # enriched version of BiblioRef with complete ReleaseEntity object as
+ # fetched from the fatcat API. CSL-JSON metadata would be derived from
+ # the full release entity.
+ biblio_ref: BiblioRef
+ source_release/target_release: Optional[ReleaseEntity]
+ complete ReleaseEntity from API, with optional expand/hide fields
+ source_csl/target_csl: free-form CSL-JSON
+ CSL-JSON version of ReleaseEntity metadata
+ source_access/target_access: List[AccessOption]
+
+
+## Datastore
+
+Would store in Elasticsearch as a live database, at least to start.
+
+TODO: try generating ~1 million of these objects to estimate index size (at
+billions of docs).
+
+Might be reasonable to use PostgreSQL in the future, with more explicit control
+over indexes and tuning for latency. But Elasticsearch is pretty easy to
+operate (eg, replicas).
+
+
+## Methods / Implementation
+
+ get_outbound_refs(
+ release_ident | work_ident | wikipedia_article,
+ limit: int = 100,
+ offset: Optional[int] = None,
+ ) -> List[BiblioRef]
+
+ get_inbound_refs(
+ release_ident | work_ident | openlibrary_work | url_surt | url,
+ consolidate_works: bool = True,
+ # for work_ident lookups, whether to
+ filter_stage: List[str],
+ # eg, only include "published" sources
+ filter_type: List[str],
+ # eg, only include "fatcat" sources, not "wikipedia" article refs
+ limit: int = 25,
+ offset: Optional[int] = None,
+ ) -> List[BiblioRef]
+
+ count_inbound_refs(...) -> int
+ same parameters as get_inbound_refs(), but returns just a count
+
+ get_all_outbound_refs(...) -> List[BiblioRef]
+ get_all_inbound_refs(...) -> List[BiblioRef]
+ same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?)
+ (optional; maybe not public)
+
+ # run elasticsearch mget query for all ref idents and include "enriched" refs when possible
+ # for outbound URL refs, would do wayback CDX fetches to find a direct wayback URL
+ # TODO: for openlibrary, would this query openlibrary.org API? or some fatcat-specific index?
+ enrich_inbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
+ enrich_outbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
+
+ # run fatcat API fetches for each ref and return "enriched" refs
+ enrich_inbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
+ enrich_outbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
+
+## HTTP API Endpoints
+
+Possible HTTP API endpoints... not even sure we would use these or expose them
+publicly?
+
+ citations-api.fatcat.wiki
+ /refs/inbound
+ &release_ident=
+ &work_ident=
+ &openlibrary_work=
+ &url=
+ /refs/outbound
+ &release_ident=
+ &work_ident=
+ /refs/csl/outbound
+ /refs/fatcat/outbound
+
+ api.fatcat.wiki/citations/v0
+ /inbound
+
+ fatcat.wiki/release/{release_ident}/refs/outbound.json
+ fatcat.wiki/work/{work_ident}/refs/outbound.json
+ &filter_type
+ &filter_stage
+ &limit
+ &offset
+
+ fatcat.wiki/refs/openlibrary/{openlibrary_ident}/inbound.json
+
+ fatcat.wiki/refs/url/inbound.json
+ &url=
+
+## Design Notes
+
+This proposed schema is relatively close to what the "normalize" SQL table
+would look like (many-to-many relationship).
+
+Especiall for "redistributing as bulk corpus", we might want to consider an
+alternative data model which is a single source entity containing a list of
+outbound references. Could even be a single source *work* for fatcat content,
+with many release under the entity. One advantage of this is that source
+metadata (eg, `release_ident`) is not duplicated on multiple rows.
+
+We could have "source objects" as a data model in the database as well; this
+would make "outbound" queries a trivial key lookup, instead of a query by
+`source_release_ident`. However, for "inbound" reference queries, many large
+rows would be returned, with unwanted metadata.
+
+Another alternative design would be storing more metadata about source and
+target in each row. This would remove the ned to do separate
+"hydration"/"enrich" fetches. This would probably blow up in the index size
+though, and would require more aggressive re-indexing (in a live-updated
+scenario). Eg, when a new fulltext file is updated (access option), would need
+to update all citation records pointing to that work.
+
+## Third-Party Comparison
+
+Microsoft Academic provides a simple (source, destination) pair, at the
+"edition" level. An additional citation context table, which is (source,
+destination, context:str). A separate "PaperResources" table has typed URLs
+(type can be project, data, code), flagged as "cites" or "own". Presumably this
+allows mentions and citations of specific software and datasets, distinct from
+software and datasets described as part of the contribution of the paper itself.
+
+Open Citations REST API schema:
+
+ occ_id: the OpenCitations Corpus local identifier of the citing bibliographic resource (e.g. "br/2384552");
+ author: the semicolon-separated list of authors of the citing bibliographic resource;
+ year: the year of publication of the citing bibliographic resource;
+ title: the title of the citing bibliographic resource;
+ source_title: the title of the venue where the citing bibliographic resource has been published;
+ volume: the number of the volume in which the citing bibliographic resource has been published;
+ issue: the number of the issue in which the citing bibliographic resource has been published;
+ page: the starting and ending pages of the citing bibliographic resource in the context of the venue where it has been published;
+ doi: the DOI of the citing bibliographic resource;
+ occ_reference: the semicolon-separated OpenCitations Corpus local identifiers of all the bibliograhic resources cited by the citing bibliographic resource in consideration;
+ doi_reference: the semicolon-separated DOIs of all the cited bibliograhic resources that have such identifier associated;
+ citation_count: the number of citations received by the citing bibliographic resource.
+
+## TODO / Questions
+
+Should the enriched objects just extend the existing object type? Eg, have
+fields that are only sometimes set (`Optional[]`), like we have with
+`ReleaseEntity` (which always has `container_id` but only sometimes
+a full `ContainerEntity` at `container`).