Merge branch 'bnewbold-refs-apis'

author: Bryan Newbold <bnewbold@robocracy.org> 2021-08-06 11:58:16 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-08-06 11:58:16 -0700
commit: 99885b458ad505ebb63b3e7cf5b1bae3dd2a459e (patch)
tree: de3fbb3e42b0bb7f6e447d2e13ac3f92a8bb90b2 /proposals
parent: 950d3f08bd439aed92d01dbc3cca9747570aa82c (diff)
parent: 56e4ce2d8347cdfedd492d54fde080772f3d8725 (diff)
download: fatcat-99885b458ad505ebb63b3e7cf5b1bae3dd2a459e.tar.gz
fatcat-99885b458ad505ebb63b3e7cf5b1bae3dd2a459e.zip
1 files changed, 40 insertions, 64 deletions
diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md
index 1e329d61..3805dcac 100644
--- a/proposals/2021-01-29_citation_api.md
+++ b/proposals/2021-01-29_citation_api.md
@@ -41,13 +41,13 @@ into a columnar file format like Parquet to get storage efficiency advances,
 type/schema enforcement, and easier ingest and use for large-scale data
 analysis.
 
-TODO: more?
-
 
 ## Schemas
 
 First, a combined JSON/pydantic/elasticsearch object that represents a
-reference between two things:
+reference from one thing to another, where the "source" must be known, but the
+"target" may either be known ("matched") or ambiguous (eg, just a reference
+string):
 
     BiblioRef ("bibliographic reference")
         _key: Optional[str] elasticsearch doc key
@@ -60,8 +60,6 @@ reference between two things:
         source_work_ident: Optional[str]
         source_wikipedia_article: Optional[str]
             with lang prefix like "en:Superglue"
-        # skipped: source_openlibrary_work
-        # skipped: source_url_surt
         source_release_stage: Optional[str]
         source_year: Optional[int]
 
@@ -71,7 +69,9 @@ reference between two things:
         ref_key: Optional[str]
             eg, "Lee86", "BIB23"
         ref_locator: Optional[str]
-            eg, page number
+            eg, specific page number in the book being referenced, if
+            applicable. Not used for, eg, first page of paper in a
+            volume/issue.
 
         # target of reference (identifiers)
         target_release_ident: Optional[str]
@@ -82,15 +82,15 @@ reference between two things:
             would not be stored in elasticsearch, but would be auto-generated
             by all "get" methods from the SURT, so calling code does not need
             to do SURT transform
-        # skipped: target_wikipedia_article
 
         match_provenance: str
             crossref, pubmed, grobid, etc
+            TODO: "ref_provenance"
         match_status: Optional[str]
             strong, weak, etc
-            TODO: "match_strength"?
+            TODO: "match_strength"? "match_confidence"?
         match_reason: Optional[str]
-            "doi", "isbn", "fuzzy title, author", etc
+            "doi", "isbn", "title-fuzzy, author", etc
             maybe "fuzzy-title-author"?
 
         target_unstructured: string (only if no release_ident link/match)
@@ -116,33 +116,22 @@ jinja templated to display lists of references in the user interface.
         size_bytes: Optional[int]
         thumbnail_url: Optional[str]
 
-    CslBiblioRef
-        # an "enriched" version of BiblioRef with metadata about the source or
-        # target entity. would be "hydrated" via a lookup to, eg, the
-        # `fatcat_release` elasticsearch index (fast mget fetch with a single
-        # request), as opposed to fatcat API fetches
-        biblio_ref: BiblioRef
-        source_csl/target_csl: free-form CSL-JSON
-        source_access/target_access: List[AccessOption]
-
-    FatcatBiblioRef
+    EnrichedBiblioRef
         # enriched version of BiblioRef with complete ReleaseEntity object as
-        # fetched from the fatcat API. CSL-JSON metadata would be derived from
-        # the full release entity.
+        # fetched from entity catalogs, if available. For example, fatcat API.
         biblio_ref: BiblioRef
         source_release/target_release: Optional[ReleaseEntity]
             complete ReleaseEntity from API, with optional expand/hide fields
-        source_csl/target_csl: free-form CSL-JSON
-            CSL-JSON version of ReleaseEntity metadata
         source_access/target_access: List[AccessOption]
+        # TODO: target_openlibrary? source_wikipedia?
 
 
 ## Datastore
 
 Would store in Elasticsearch as a live database, at least to start.
 
-TODO: try generating ~1 million of these objects to estimate index size (at
-billions of docs).
+Example Elasticsearch index `fatcat_ref_v02_20210716` has 1.8 billion docs
+(references), and consumes 435 GBytes of disk.
 
 Might be reasonable to use PostgreSQL in the future, with more explicit control
 over indexes and tuning for latency. But Elasticsearch is pretty easy to
@@ -172,59 +161,46 @@ operate (eg, replicas).
     count_inbound_refs(...) -> int
         same parameters as get_inbound_refs(), but returns just a count
 
-    get_all_outbound_refs(...) -> List[BiblioRef]
-    get_all_inbound_refs(...) -> List[BiblioRef]
-        same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?)
-        (optional; maybe not public)
+    # UNIMPLEMENTED
+    #get_all_outbound_refs(...) -> List[BiblioRef]
+    #get_all_inbound_refs(...) -> List[BiblioRef]
+    #    same as get_outbound_refs()/get_inbound_refs(), but does a scroll (return list or iterator?)
+    #    (optional; maybe not public)
 
-    # run elasticsearch mget query for all ref idents and include "enriched" refs when possible
-    # for outbound URL refs, would do wayback CDX fetches to find a direct wayback URL
-    # TODO: for openlibrary, would this query openlibrary.org API? or some fatcat-specific index?
-    enrich_inbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
-    enrich_outbound_refs(refs: List[BiblioRef]) -> List[CslBiblioRef]
-
-    # run fatcat API fetches for each ref and return "enriched" refs
-    enrich_inbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
-    enrich_outbound_refs_fatcat(refs: List[BiblioRef], hide, expand) -> List[FatcatBiblioRef]
+    # run catalog API fetches for each and return "enriched" refs
+    enrich_inbound_refs(refs: List[BiblioRef], hide, expand) -> List[EnrichedBiblioRef]
+    enrich_outbound_refs(refs: List[BiblioRef], hide, expand) -> List[EnrichedBiblioRef]
 
 ## HTTP API Endpoints
 
-Possible HTTP API endpoints... not even sure we would use these or expose them
-publicly?
-
-    citations-api.fatcat.wiki
-        /refs/inbound
-            &release_ident=
-            &work_ident=
-            &openlibrary_work=
-            &url=
-        /refs/outbound
-            &release_ident=
-            &work_ident=
-        /refs/csl/outbound
-        /refs/fatcat/outbound
-
-    api.fatcat.wiki/citations/v0
-        /inbound
-
-    fatcat.wiki/release/{release_ident}/refs/outbound.json
-    fatcat.wiki/work/{work_ident}/refs/outbound.json
-        &filter_type
-        &filter_stage
+Initial web endpoints, including unstable pseudo-APIs:
+
+    fatcat.wiki/release/{release_ident}/refs-in (and .json)
+    fatcat.wiki/release/{release_ident}/refs-out (and .json)
         &limit
         &offset
+        &sort (for inbound)
+        &filter_stage (for inbound)
 
-    fatcat.wiki/refs/openlibrary/{openlibrary_ident}/inbound.json
+    fatcat.wiki/openlibrary/{openlibrary_ident}/refs-in (and .json)
+        &limit
+        &offset
+        &sort
+        &filter_stage
 
-    fatcat.wiki/refs/url/inbound.json
-        &url=
+    fatcat.wiki/web/refs-in (and .json)
+        &url= (required)
+        &limit
+        &offset
+        &sort (newest, oldest)
+        &filter_stage
 
 ## Design Notes
 
 This proposed schema is relatively close to what the "normalize" SQL table
 would look like (many-to-many relationship).
 
-Especiall for "redistributing as bulk corpus", we might want to consider an
+Especially for "redistributing as bulk corpus", we might want to consider an
 alternative data model which is a single source entity containing a list of
 outbound references. Could even be a single source *work* for fatcat content,
 with many release under the entity. One advantage of this is that source
author	Bryan Newbold <bnewbold@robocracy.org>	2021-08-06 11:58:16 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-08-06 11:58:16 -0700
commit	99885b458ad505ebb63b3e7cf5b1bae3dd2a459e (patch)
tree	de3fbb3e42b0bb7f6e447d2e13ac3f92a8bb90b2 /proposals
parent	950d3f08bd439aed92d01dbc3cca9747570aa82c (diff)
parent	56e4ce2d8347cdfedd492d54fde080772f3d8725 (diff)
download	fatcat-99885b458ad505ebb63b3e7cf5b1bae3dd2a459e.tar.gz fatcat-99885b458ad505ebb63b3e7cf5b1bae3dd2a459e.zip