From 6e40cc800ec5428602a8f18b42111acf920ff2fd Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 29 Jan 2021 20:51:18 -0800 Subject: citation graph schema/API proposal (first draft) --- proposals/2021-01-29_citation_api.md | 274 +++++++++++++++++++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 proposals/2021-01-29_citation_api.md (limited to 'proposals') diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md new file mode 100644 index 00000000..1e329d61 --- /dev/null +++ b/proposals/2021-01-29_citation_api.md @@ -0,0 +1,274 @@ + +Describes schemas, APIs, use-cases, and data store for citation graph. + +## Use Cases + +**Outbound reference web pages:** on fatcat.wiki and scholar.archive.org, want +to have a page that lists all of the works cited by ("outgoing") a paper or +other fatcat release. + +- query by fatcat `release_ident` +- nice to have: list references in the order they appear in the paper, and + annotate with any "key" used in the source document itself (either an index + number or a short name for the reference) +- need to have a formatted reference string for each reference, even if we have + not "linked" to a specific fatcat release (aka, would need structured or + unstructured citation text to display) + + +**Inbound reference web pages:** on fatcat.wiki and scholar.archive.org, want +to display a list of all works which cite a specific work ("inbound" +citations). + +- query by fatcat `release_ident`, or possibly by `work_ident` and ability to + say "cites a different version of the same work" +- nice to have: citation context snippet surrounding the citation +- like outbound, want to have good display options and access options for each + entry +- nice to have: non-traditional works (eg, mentions from wikipedia) + +**Inbound reference IA services:** OpenLibrary.org and/or web.archive.org might +want to show a count or list of papers that reference a web page (by URL) or +book (by openlibrary work identifier). + +**Inbound reference counts:** ability to display number of inbound citation +links for a release or work, on demand. Eg, on a fatcat.wiki release landing +page. Not sure how important this use-case is. + +**Bulk Metadata Releases:** we will want to share this citation graph as an +artifact. We can easily serialize this format into JSON and share that, or push +into a columnar file format like Parquet to get storage efficiency advances, +type/schema enforcement, and easier ingest and use for large-scale data +analysis. + +TODO: more? + + +## Schemas + +First, a combined JSON/pydantic/elasticsearch object that represents a +reference between two things: + + BiblioRef ("bibliographic reference") + _key: Optional[str] elasticsearch doc key + ("release", source_release_ident, ref_index) + ("wikipedia", source_wikipedia_article, ref_index) + update_ts: Optional[datetime] elasticsearch doc timestamp + + # metadata about source of reference + source_release_ident: Optional[str] + source_work_ident: Optional[str] + source_wikipedia_article: Optional[str] + with lang prefix like "en:Superglue" + # skipped: source_openlibrary_work + # skipped: source_url_surt + source_release_stage: Optional[str] + source_year: Optional[int] + + # context of the reference itself + ref_index: int + 1-indexed, not 0-indexed + ref_key: Optional[str] + eg, "Lee86", "BIB23" + ref_locator: Optional[str] + eg, page number + + # target of reference (identifiers) + target_release_ident: Optional[str] + target_work_ident: Optional[str] + target_openlibrary_work: Optional[str] + target_url_surt: Optional[str] + target_url: Optional[str] + would not be stored in elasticsearch, but would be auto-generated + by all "get" methods from the SURT, so calling code does not need + to do SURT transform + # skipped: target_wikipedia_article + + match_provenance: str + crossref, pubmed, grobid, etc + match_status: Optional[str] + strong, weak, etc + TODO: "match_strength"? + match_reason: Optional[str] + "doi", "isbn", "fuzzy title, author", etc + maybe "fuzzy-title-author"? + + target_unstructured: string (only if no release_ident link/match) + target_csl: free-form JSON (only if no release_ident link/match) + CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) + generated from unstructured by a GROBID parse, if needed + +Then, two wrapper objects that add more complete metadata. These would be +pydantic/JSON objects, used in python code, and maybe exposed via API, but not +indexed in elasticsearch. These are the objects that would, eg, be used by +jinja templated to display lists of references in the user interface. + + AccessOption + access_type: str + describes type of access link + controlled values: wayback, ia_file, repository, loginwall, etc + access_url: str + note: for `target_url` refs, would do a CDX lookup and this URL + would be a valid/HTTP-200 web.archive.org capture URL + mimetype: Optional[str] + application/pdf, text/html, etc + blank for landing pages + size_bytes: Optional[int] + thumbnail_url: Optional[str] + + CslBiblioRef + # an "enriched" version of BiblioRef with metadata about the source or + # target entity. would be "hydrated" via a lookup to, eg, the + # `fatcat_release` elasticsearch index (fast mget fetch with a single + # request), as opposed to fatcat API fetches + biblio_ref: BiblioRef + source_csl/target_csl: free-form CSL-JSON + source_access/target_access: List[AccessOption] + + FatcatBiblioRef + # enriched version of BiblioRef with complete ReleaseEntity object as + # fetched from the fatcat API. CSL-JSON metadata would be derived from + # the full release entity. + biblio_ref: BiblioRef + source_release/target_release: Optional[ReleaseEntity] + complete ReleaseEntity from API, with optional expand/hide fields + source_csl/target_csl: free-form CSL-JSON + CSL-JSON version of ReleaseEntity metadata + source_access/target_access: List[AccessOption] + + +## Datastore + +Would store in Elasticsearch as a live database, at least to start. + +TODO: try generating ~1 million of these objects to estimate index size (at +billions of docs). + +Might be reasonable to use PostgreSQL in the future, with more explicit control +over indexes and tuning for latency. But Elasticsearch is pretty easy to +operate (eg, replicas). + + +## Methods / Implementation + + get_outbound_refs( + release_ident | work_ident | wikipedia_article, + limit: int = 100, + offset: Optional[int] = None, + ) -> List[BiblioRef] + + get_inbound_refs( + release_ident | work_ident | openlibrary_work | url_surt | url, + consolidate_works: bool = True, + # for work_ident lookups, whether to + filter_stage: List[str], + # eg, only include "published" sources + filter_type: List[str], + # eg, only include "fatcat" sources, not "wikipedia" article refs + limit: int = 25, + offset: Optional[int] = None, + ) -> List[BiblioRef] + + count_inbound_refs(...) -> int + same parameters as get_inbound_refs(), but returns just a count + + get_all_outbound_refs(...) -> List[BiblioRef] + get_all_inbound_refs(...) -> List[BiblioRef] + same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?) + (optional; maybe not public) + + # run elasticsearch mget query for all ref idents and include "enriched" refs when possible + # for outbound URL refs, would do wayback CDX fetches to find a direct wayback URL + # TODO: for openlibrary, would this query openlibrary.org API? or some fatcat-specific index? + enrich_inbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef] + enrich_outbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef] + + # run fatcat API fetches for each ref and return "enriched" refs + enrich_inbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef] + enrich_outbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef] + +## HTTP API Endpoints + +Possible HTTP API endpoints... not even sure we would use these or expose them +publicly? + + citations-api.fatcat.wiki + /refs/inbound + &release_ident= + &work_ident= + &openlibrary_work= + &url= + /refs/outbound + &release_ident= + &work_ident= + /refs/csl/outbound + /refs/fatcat/outbound + + api.fatcat.wiki/citations/v0 + /inbound + + fatcat.wiki/release/{release_ident}/refs/outbound.json + fatcat.wiki/work/{work_ident}/refs/outbound.json + &filter_type + &filter_stage + &limit + &offset + + fatcat.wiki/refs/openlibrary/{openlibrary_ident}/inbound.json + + fatcat.wiki/refs/url/inbound.json + &url= + +## Design Notes + +This proposed schema is relatively close to what the "normalize" SQL table +would look like (many-to-many relationship). + +Especiall for "redistributing as bulk corpus", we might want to consider an +alternative data model which is a single source entity containing a list of +outbound references. Could even be a single source *work* for fatcat content, +with many release under the entity. One advantage of this is that source +metadata (eg, `release_ident`) is not duplicated on multiple rows. + +We could have "source objects" as a data model in the database as well; this +would make "outbound" queries a trivial key lookup, instead of a query by +`source_release_ident`. However, for "inbound" reference queries, many large +rows would be returned, with unwanted metadata. + +Another alternative design would be storing more metadata about source and +target in each row. This would remove the ned to do separate +"hydration"/"enrich" fetches. This would probably blow up in the index size +though, and would require more aggressive re-indexing (in a live-updated +scenario). Eg, when a new fulltext file is updated (access option), would need +to update all citation records pointing to that work. + +## Third-Party Comparison + +Microsoft Academic provides a simple (source, destination) pair, at the +"edition" level. An additional citation context table, which is (source, +destination, context:str). A separate "PaperResources" table has typed URLs +(type can be project, data, code), flagged as "cites" or "own". Presumably this +allows mentions and citations of specific software and datasets, distinct from +software and datasets described as part of the contribution of the paper itself. + +Open Citations REST API schema: + + occ_id: the OpenCitations Corpus local identifier of the citing bibliographic resource (e.g. "br/2384552"); + author: the semicolon-separated list of authors of the citing bibliographic resource; + year: the year of publication of the citing bibliographic resource; + title: the title of the citing bibliographic resource; + source_title: the title of the venue where the citing bibliographic resource has been published; + volume: the number of the volume in which the citing bibliographic resource has been published; + issue: the number of the issue in which the citing bibliographic resource has been published; + page: the starting and ending pages of the citing bibliographic resource in the context of the venue where it has been published; + doi: the DOI of the citing bibliographic resource; + occ_reference: the semicolon-separated OpenCitations Corpus local identifiers of all the bibliograhic resources cited by the citing bibliographic resource in consideration; + doi_reference: the semicolon-separated DOIs of all the cited bibliograhic resources that have such identifier associated; + citation_count: the number of citations received by the citing bibliographic resource. + +## TODO / Questions + +Should the enriched objects just extend the existing object type? Eg, have +fields that are only sometimes set (`Optional[]`), like we have with +`ReleaseEntity` (which always has `container_id` but only sometimes +a full `ContainerEntity` at `container`). -- cgit v1.2.3