diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2021-01-29 20:51:18 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-02-26 14:53:58 -0800 | 
| commit | 6e40cc800ec5428602a8f18b42111acf920ff2fd (patch) | |
| tree | 5fb7a456106100abc535900fc10256c4971dbbc6 /proposals | |
| parent | 017d6f96632cd2a083c94bf9c5bbbd107df3e222 (diff) | |
| download | fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.tar.gz fatcat-6e40cc800ec5428602a8f18b42111acf920ff2fd.zip | |
citation graph schema/API proposal (first draft)
Diffstat (limited to 'proposals')
| -rw-r--r-- | proposals/2021-01-29_citation_api.md | 274 | 
1 files changed, 274 insertions, 0 deletions
| diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md new file mode 100644 index 00000000..1e329d61 --- /dev/null +++ b/proposals/2021-01-29_citation_api.md @@ -0,0 +1,274 @@ + +Describes schemas, APIs, use-cases, and data store for citation graph. + +## Use Cases + +**Outbound reference web pages:** on fatcat.wiki and scholar.archive.org, want +to have a page that lists all of the works cited by ("outgoing") a paper or +other fatcat release. + +- query by fatcat `release_ident` +- nice to have: list references in the order they appear in the paper, and +  annotate with any "key" used in the source document itself (either an index +  number or a short name for the reference) +- need to have a formatted reference string for each reference, even if we have +  not "linked" to a specific fatcat release (aka, would need structured or +  unstructured citation text to display) + + +**Inbound reference web pages:** on fatcat.wiki and scholar.archive.org, want +to display a list of all works which cite a specific work ("inbound" +citations). + +- query by fatcat `release_ident`, or possibly by `work_ident` and ability to +  say "cites a different version of the same work" +- nice to have: citation context snippet surrounding the citation +- like outbound, want to have good display options and access options for each +  entry +- nice to have: non-traditional works (eg, mentions from wikipedia) + +**Inbound reference IA services:** OpenLibrary.org and/or web.archive.org might +want to show a count or list of papers that reference a web page (by URL) or +book (by openlibrary work identifier). + +**Inbound reference counts:** ability to display number of inbound citation +links for a release or work, on demand. Eg, on a fatcat.wiki release landing +page. Not sure how important this use-case is. + +**Bulk Metadata Releases:** we will want to share this citation graph as an +artifact. We can easily serialize this format into JSON and share that, or push +into a columnar file format like Parquet to get storage efficiency advances, +type/schema enforcement, and easier ingest and use for large-scale data +analysis. + +TODO: more? + + +## Schemas + +First, a combined JSON/pydantic/elasticsearch object that represents a +reference between two things: + +    BiblioRef ("bibliographic reference") +        _key: Optional[str] elasticsearch doc key +            ("release", source_release_ident, ref_index) +            ("wikipedia", source_wikipedia_article, ref_index) +        update_ts: Optional[datetime] elasticsearch doc timestamp + +        # metadata about source of reference +        source_release_ident: Optional[str] +        source_work_ident: Optional[str] +        source_wikipedia_article: Optional[str] +            with lang prefix like "en:Superglue" +        # skipped: source_openlibrary_work +        # skipped: source_url_surt +        source_release_stage: Optional[str] +        source_year: Optional[int] + +        # context of the reference itself +        ref_index: int +            1-indexed, not 0-indexed +        ref_key: Optional[str] +            eg, "Lee86", "BIB23" +        ref_locator: Optional[str] +            eg, page number + +        # target of reference (identifiers) +        target_release_ident: Optional[str] +        target_work_ident: Optional[str] +        target_openlibrary_work: Optional[str] +        target_url_surt: Optional[str] +        target_url: Optional[str] +            would not be stored in elasticsearch, but would be auto-generated +            by all "get" methods from the SURT, so calling code does not need +            to do SURT transform +        # skipped: target_wikipedia_article + +        match_provenance: str +            crossref, pubmed, grobid, etc +        match_status: Optional[str] +            strong, weak, etc +            TODO: "match_strength"? +        match_reason: Optional[str] +            "doi", "isbn", "fuzzy title, author", etc +            maybe "fuzzy-title-author"? + +        target_unstructured: string (only if no release_ident link/match) +        target_csl: free-form JSON (only if no release_ident link/match) +            CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) +            generated from unstructured by a GROBID parse, if needed + +Then, two wrapper objects that add more complete metadata. These would be +pydantic/JSON objects, used in python code, and maybe exposed via API, but not +indexed in elasticsearch. These are the objects that would, eg, be used by +jinja templated to display lists of references in the user interface. + +    AccessOption +        access_type: str +            describes type of access link +            controlled values: wayback, ia_file, repository, loginwall, etc +        access_url: str +            note: for `target_url` refs, would do a CDX lookup and this URL +            would be a valid/HTTP-200 web.archive.org capture URL +        mimetype: Optional[str] +            application/pdf, text/html, etc +            blank for landing pages +        size_bytes: Optional[int] +        thumbnail_url: Optional[str] + +    CslBiblioRef +        # an "enriched" version of BiblioRef with metadata about the source or +        # target entity. would be "hydrated" via a lookup to, eg, the +        # `fatcat_release` elasticsearch index (fast mget fetch with a single +        # request), as opposed to fatcat API fetches +        biblio_ref: BiblioRef +        source_csl/target_csl: free-form CSL-JSON +        source_access/target_access: List[AccessOption] + +    FatcatBiblioRef +        # enriched version of BiblioRef with complete ReleaseEntity object as +        # fetched from the fatcat API. CSL-JSON metadata would be derived from +        # the full release entity. +        biblio_ref: BiblioRef +        source_release/target_release: Optional[ReleaseEntity] +            complete ReleaseEntity from API, with optional expand/hide fields +        source_csl/target_csl: free-form CSL-JSON +            CSL-JSON version of ReleaseEntity metadata +        source_access/target_access: List[AccessOption] + + +## Datastore + +Would store in Elasticsearch as a live database, at least to start. + +TODO: try generating ~1 million of these objects to estimate index size (at +billions of docs). + +Might be reasonable to use PostgreSQL in the future, with more explicit control +over indexes and tuning for latency. But Elasticsearch is pretty easy to +operate (eg, replicas). + + +## Methods / Implementation + +    get_outbound_refs( +        release_ident | work_ident | wikipedia_article, +        limit: int = 100, +        offset: Optional[int] = None, +    ) -> List[BiblioRef] + +    get_inbound_refs( +        release_ident | work_ident | openlibrary_work | url_surt | url, +        consolidate_works: bool = True, +            # for work_ident lookups, whether to              +        filter_stage: List[str], +            # eg, only include "published" sources +        filter_type: List[str], +            # eg, only include "fatcat" sources, not "wikipedia" article refs +        limit: int = 25, +        offset: Optional[int] = None, +    ) -> List[BiblioRef] + +    count_inbound_refs(...) -> int +        same parameters as get_inbound_refs(), but returns just a count + +    get_all_outbound_refs(...) -> List[BiblioRef] +    get_all_inbound_refs(...) -> List[BiblioRef] +        same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?) +        (optional; maybe not public) + +    # run elasticsearch mget query for all ref idents and include "enriched" refs when possible +    # for outbound URL refs, would do wayback CDX fetches to find a direct wayback URL +    # TODO: for openlibrary, would this query openlibrary.org API? or some fatcat-specific index? +    enrich_inbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef] +    enrich_outbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef] + +    # run fatcat API fetches for each ref and return "enriched" refs +    enrich_inbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef] +    enrich_outbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef] + +## HTTP API Endpoints + +Possible HTTP API endpoints... not even sure we would use these or expose them +publicly? + +    citations-api.fatcat.wiki +        /refs/inbound +            &release_ident= +            &work_ident= +            &openlibrary_work= +            &url= +        /refs/outbound +            &release_ident= +            &work_ident= +        /refs/csl/outbound +        /refs/fatcat/outbound + +    api.fatcat.wiki/citations/v0 +        /inbound + +    fatcat.wiki/release/{release_ident}/refs/outbound.json +    fatcat.wiki/work/{work_ident}/refs/outbound.json +        &filter_type +        &filter_stage +        &limit +        &offset + +    fatcat.wiki/refs/openlibrary/{openlibrary_ident}/inbound.json + +    fatcat.wiki/refs/url/inbound.json +        &url= + +## Design Notes + +This proposed schema is relatively close to what the "normalize" SQL table +would look like (many-to-many relationship). + +Especiall for "redistributing as bulk corpus", we might want to consider an +alternative data model which is a single source entity containing a list of +outbound references. Could even be a single source *work* for fatcat content, +with many release under the entity. One advantage of this is that source +metadata (eg, `release_ident`) is not duplicated on multiple rows. + +We could have "source objects" as a data model in the database as well; this +would make "outbound" queries a trivial key lookup, instead of a query by +`source_release_ident`. However, for "inbound" reference queries, many large +rows would be returned, with unwanted metadata. + +Another alternative design would be storing more metadata about source and +target in each row. This would remove the ned to do separate +"hydration"/"enrich" fetches. This would probably blow up in the index size +though, and would require more aggressive re-indexing (in a live-updated +scenario). Eg, when a new fulltext file is updated (access option), would need +to update all citation records pointing to that work. + +## Third-Party Comparison + +Microsoft Academic provides a simple (source, destination) pair, at the +"edition" level. An additional citation context table, which is (source, +destination, context:str). A separate "PaperResources" table has typed URLs +(type can be project, data, code), flagged as "cites" or "own". Presumably this +allows mentions and citations of specific software and datasets, distinct from +software and datasets described as part of the contribution of the paper itself. + +Open Citations REST API schema: + +    occ_id: the OpenCitations Corpus local identifier of the citing bibliographic resource (e.g. "br/2384552"); +    author: the semicolon-separated list of authors of the citing bibliographic resource; +    year: the year of publication of the citing bibliographic resource; +    title: the title of the citing bibliographic resource; +    source_title: the title of the venue where the citing bibliographic resource has been published; +    volume: the number of the volume in which the citing bibliographic resource has been published; +    issue: the number of the issue in which the citing bibliographic resource has been published; +    page: the starting and ending pages of the citing bibliographic resource in the context of the venue where it has been published; +    doi: the DOI of the citing bibliographic resource; +    occ_reference: the semicolon-separated OpenCitations Corpus local identifiers of all the bibliograhic resources cited by the citing bibliographic resource in consideration; +    doi_reference: the semicolon-separated DOIs of all the cited bibliograhic resources that have such identifier associated; +    citation_count: the number of citations received by the citing bibliographic resource. + +## TODO / Questions + +Should the enriched objects just extend the existing object type? Eg, have +fields that are only sometimes set (`Optional[]`), like we have with +`ReleaseEntity` (which always has `container_id` but only sometimes +a full `ContainerEntity` at `container`). | 
