citation graph schema/API proposal (first draft)

author: Bryan Newbold <bnewbold@robocracy.org> 2021-01-29 20:51:18 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-02-26 14:53:58 -0800
commit: 6e40cc800ec5428602a8f18b42111acf920ff2fd (patch)
tree: 5fb7a456106100abc535900fc10256c4971dbbc6 /proposals
parent: 017d6f96632cd2a083c94bf9c5bbbd107df3e222 (diff)
download: fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.tar.gz
fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.zip
1 files changed, 274 insertions, 0 deletions
diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md
new file mode 100644
index 00000000..1e329d61
--- /dev/null
+++ b/proposals/2021-01-29_citation_api.md
@@ -0,0 +1,274 @@
+
+Describes schemas, APIs, use-cases, and data store for citation graph.
+
+## Use Cases
+
+**Outbound reference web pages:** on fatcat.wiki and scholar.archive.org, want
+to have a page that lists all of the works cited by ("outgoing") a paper or
+other fatcat release.
+
+- query by fatcat `release_ident`
+- nice to have: list references in the order they appear in the paper, and
+  annotate with any "key" used in the source document itself (either an index
+  number or a short name for the reference)
+- need to have a formatted reference string for each reference, even if we have
+  not "linked" to a specific fatcat release (aka, would need structured or
+  unstructured citation text to display)
+
+
+**Inbound reference web pages:** on fatcat.wiki and scholar.archive.org, want
+to display a list of all works which cite a specific work ("inbound"
+citations).
+
+- query by fatcat `release_ident`, or possibly by `work_ident` and ability to
+  say "cites a different version of the same work"
+- nice to have: citation context snippet surrounding the citation
+- like outbound, want to have good display options and access options for each
+  entry
+- nice to have: non-traditional works (eg, mentions from wikipedia)
+
+**Inbound reference IA services:** OpenLibrary.org and/or web.archive.org might
+want to show a count or list of papers that reference a web page (by URL) or
+book (by openlibrary work identifier).
+
+**Inbound reference counts:** ability to display number of inbound citation
+links for a release or work, on demand. Eg, on a fatcat.wiki release landing
+page. Not sure how important this use-case is.
+
+**Bulk Metadata Releases:** we will want to share this citation graph as an
+artifact. We can easily serialize this format into JSON and share that, or push
+into a columnar file format like Parquet to get storage efficiency advances,
+type/schema enforcement, and easier ingest and use for large-scale data
+analysis.
+
+TODO: more?
+
+
+## Schemas
+
+First, a combined JSON/pydantic/elasticsearch object that represents a
+reference between two things:
+
+    BiblioRef ("bibliographic reference")
+        _key: Optional[str] elasticsearch doc key
+            ("release", source_release_ident, ref_index)
+            ("wikipedia", source_wikipedia_article, ref_index)
+        update_ts: Optional[datetime] elasticsearch doc timestamp
+
+        # metadata about source of reference
+        source_release_ident: Optional[str]
+        source_work_ident: Optional[str]
+        source_wikipedia_article: Optional[str]
+            with lang prefix like "en:Superglue"
+        # skipped: source_openlibrary_work
+        # skipped: source_url_surt
+        source_release_stage: Optional[str]
+        source_year: Optional[int]
+
+        # context of the reference itself
+        ref_index: int
+            1-indexed, not 0-indexed
+        ref_key: Optional[str]
+            eg, "Lee86", "BIB23"
+        ref_locator: Optional[str]
+            eg, page number
+
+        # target of reference (identifiers)
+        target_release_ident: Optional[str]
+        target_work_ident: Optional[str]
+        target_openlibrary_work: Optional[str]
+        target_url_surt: Optional[str]
+        target_url: Optional[str]
+            would not be stored in elasticsearch, but would be auto-generated
+            by all "get" methods from the SURT, so calling code does not need
+            to do SURT transform
+        # skipped: target_wikipedia_article
+
+        match_provenance: str
+            crossref, pubmed, grobid, etc
+        match_status: Optional[str]
+            strong, weak, etc
+            TODO: "match_strength"?
+        match_reason: Optional[str]
+            "doi", "isbn", "fuzzy title, author", etc
+            maybe "fuzzy-title-author"?
+
+        target_unstructured: string (only if no release_ident link/match)
+        target_csl: free-form JSON (only if no release_ident link/match)
+            CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+            generated from unstructured by a GROBID parse, if needed
+
+Then, two wrapper objects that add more complete metadata. These would be
+pydantic/JSON objects, used in python code, and maybe exposed via API, but not
+indexed in elasticsearch. These are the objects that would, eg, be used by
+jinja templated to display lists of references in the user interface.
+
+    AccessOption
+        access_type: str
+            describes type of access link
+            controlled values: wayback, ia_file, repository, loginwall, etc
+        access_url: str
+            note: for `target_url` refs, would do a CDX lookup and this URL
+            would be a valid/HTTP-200 web.archive.org capture URL
+        mimetype: Optional[str]
+            application/pdf, text/html, etc
+            blank for landing pages
+        size_bytes: Optional[int]
+        thumbnail_url: Optional[str]
+
+    CslBiblioRef
+        # an "enriched" version of BiblioRef with metadata about the source or
+        # target entity. would be "hydrated" via a lookup to, eg, the
+        # `fatcat_release` elasticsearch index (fast mget fetch with a single
+        # request), as opposed to fatcat API fetches
+        biblio_ref: BiblioRef
+        source_csl/target_csl: free-form CSL-JSON
+        source_access/target_access: List[AccessOption]
+
+    FatcatBiblioRef
+        # enriched version of BiblioRef with complete ReleaseEntity object as
+        # fetched from the fatcat API. CSL-JSON metadata would be derived from
+        # the full release entity.
+        biblio_ref: BiblioRef
+        source_release/target_release: Optional[ReleaseEntity]
+            complete ReleaseEntity from API, with optional expand/hide fields
+        source_csl/target_csl: free-form CSL-JSON
+            CSL-JSON version of ReleaseEntity metadata
+        source_access/target_access: List[AccessOption]
+
+
+## Datastore
+
+Would store in Elasticsearch as a live database, at least to start.
+
+TODO: try generating ~1 million of these objects to estimate index size (at
+billions of docs).
+
+Might be reasonable to use PostgreSQL in the future, with more explicit control
+over indexes and tuning for latency. But Elasticsearch is pretty easy to
+operate (eg, replicas).
+
+
+## Methods / Implementation
+
+    get_outbound_refs(
+        release_ident | work_ident | wikipedia_article,
+        limit: int = 100,
+        offset: Optional[int] = None,
+    ) -> List[BiblioRef]
+
+    get_inbound_refs(
+        release_ident | work_ident | openlibrary_work | url_surt | url,
+        consolidate_works: bool = True,
+            # for work_ident lookups, whether to             
+        filter_stage: List[str],
+            # eg, only include "published" sources
+        filter_type: List[str],
+            # eg, only include "fatcat" sources, not "wikipedia" article refs
+        limit: int = 25,
+        offset: Optional[int] = None,
+    ) -> List[BiblioRef]
+
+    count_inbound_refs(...) -> int
+        same parameters as get_inbound_refs(), but returns just a count
+
+    get_all_outbound_refs(...) -> List[BiblioRef]
+    get_all_inbound_refs(...) -> List[BiblioRef]
+        same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?)
+        (optional; maybe not public)
+
+    # run elasticsearch mget query for all ref idents and include "enriched" refs when possible
+    # for outbound URL refs, would do wayback CDX fetches to find a direct wayback URL
+    # TODO: for openlibrary, would this query openlibrary.org API? or some fatcat-specific index?
+    enrich_inbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
+    enrich_outbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
+
+    # run fatcat API fetches for each ref and return "enriched" refs
+    enrich_inbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
+    enrich_outbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
+
+## HTTP API Endpoints
+
+Possible HTTP API endpoints... not even sure we would use these or expose them
+publicly?
+
+    citations-api.fatcat.wiki
+        /refs/inbound
+            &release_ident=
+            &work_ident=
+            &openlibrary_work=
+            &url=
+        /refs/outbound
+            &release_ident=
+            &work_ident=
+        /refs/csl/outbound
+        /refs/fatcat/outbound
+
+    api.fatcat.wiki/citations/v0
+        /inbound
+
+    fatcat.wiki/release/{release_ident}/refs/outbound.json
+    fatcat.wiki/work/{work_ident}/refs/outbound.json
+        &filter_type
+        &filter_stage
+        &limit
+        &offset
+
+    fatcat.wiki/refs/openlibrary/{openlibrary_ident}/inbound.json
+
+    fatcat.wiki/refs/url/inbound.json
+        &url=
+
+## Design Notes
+
+This proposed schema is relatively close to what the "normalize" SQL table
+would look like (many-to-many relationship).
+
+Especiall for "redistributing as bulk corpus", we might want to consider an
+alternative data model which is a single source entity containing a list of
+outbound references. Could even be a single source *work* for fatcat content,
+with many release under the entity. One advantage of this is that source
+metadata (eg, `release_ident`) is not duplicated on multiple rows.
+
+We could have "source objects" as a data model in the database as well; this
+would make "outbound" queries a trivial key lookup, instead of a query by
+`source_release_ident`. However, for "inbound" reference queries, many large
+rows would be returned, with unwanted metadata.
+
+Another alternative design would be storing more metadata about source and
+target in each row. This would remove the ned to do separate
+"hydration"/"enrich" fetches. This would probably blow up in the index size
+though, and would require more aggressive re-indexing (in a live-updated
+scenario). Eg, when a new fulltext file is updated (access option), would need
+to update all citation records pointing to that work.
+
+## Third-Party Comparison
+
+Microsoft Academic provides a simple (source, destination) pair, at the
+"edition" level. An additional citation context table, which is (source,
+destination, context:str). A separate "PaperResources" table has typed URLs
+(type can be project, data, code), flagged as "cites" or "own". Presumably this
+allows mentions and citations of specific software and datasets, distinct from
+software and datasets described as part of the contribution of the paper itself.
+
+Open Citations REST API schema:
+
+    occ_id: the OpenCitations Corpus local identifier of the citing bibliographic resource (e.g. "br/2384552");
+    author: the semicolon-separated list of authors of the citing bibliographic resource;
+    year: the year of publication of the citing bibliographic resource;
+    title: the title of the citing bibliographic resource;
+    source_title: the title of the venue where the citing bibliographic resource has been published;
+    volume: the number of the volume in which the citing bibliographic resource has been published;
+    issue: the number of the issue in which the citing bibliographic resource has been published;
+    page: the starting and ending pages of the citing bibliographic resource in the context of the venue where it has been published;
+    doi: the DOI of the citing bibliographic resource;
+    occ_reference: the semicolon-separated OpenCitations Corpus local identifiers of all the bibliograhic resources cited by the citing bibliographic resource in consideration;
+    doi_reference: the semicolon-separated DOIs of all the cited bibliograhic resources that have such identifier associated;
+    citation_count: the number of citations received by the citing bibliographic resource.
+
+## TODO / Questions
+
+Should the enriched objects just extend the existing object type? Eg, have
+fields that are only sometimes set (`Optional[]`), like we have with
+`ReleaseEntity` (which always has `container_id` but only sometimes
+a full `ContainerEntity` at `container`).
author	Bryan Newbold <bnewbold@robocracy.org>	2021-01-29 20:51:18 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-02-26 14:53:58 -0800
commit	6e40cc800ec5428602a8f18b42111acf920ff2fd (patch)
tree	5fb7a456106100abc535900fc10256c4971dbbc6 /proposals
parent	017d6f96632cd2a083c94bf9c5bbbd107df3e222 (diff)
download	fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.tar.gz fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.zip