diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2021-06-02 00:26:51 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2021-06-02 00:26:51 -0700 | 
| commit | 87a9bb4ed81b503f03e6e77d6b082249523e67d4 (patch) | |
| tree | 791114dd83f83db57e53c74532fe1cc6b0782896 /proposals | |
| parent | 63271ba7e78fede9d9fb71c9022c2fb25fda0f8e (diff) | |
| download | sandcrawler-87a9bb4ed81b503f03e6e77d6b082249523e67d4.tar.gz sandcrawler-87a9bb4ed81b503f03e6e77d6b082249523e67d4.zip | |
crossref DB proposal, and include in SQL schema
Diffstat (limited to 'proposals')
| -rw-r--r-- | proposals/2021-04-22_crossref_db.md | 86 | 
1 files changed, 86 insertions, 0 deletions
| diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md new file mode 100644 index 0000000..bead7a4 --- /dev/null +++ b/proposals/2021-04-22_crossref_db.md @@ -0,0 +1,86 @@ + +status: work-in-progress + +Crossref DOI Metadata in Sandcrawler DB +======================================= + +Proposal is to have a local copy of Crossref API metadata records in +sandcrawler DB, accessible by simple key lookup via postgrest. + +Initial goal is to include these in scholar work "bundles" (along with +fulltext, etc), in particular as part of reference extraction pipeline. Around +late 2020, many additional references became available via Crossref records, +and have not been imported (updated) into fatcat. Reference storage in fatcat +API is a scaling problem we would like to put off, so injecting content in this +way is desirable. + +To start, working with a bulk dump made available by Crossref. In the future, +might persist the daily feed to that we have a continuously up-to-date copy. + +Another application of Crossref-in-bundles is to identify overall scale of +changes since initial Crossref metadata import. + + +## Sandcrawler DB Schema + +The "updated" field in this case refers to the upstream timestamp, not the +sandcrawler database update time. + +    CREATE TABLE IF NOT EXISTS crossref ( +        doi                 TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)), +        indexed             TIMESTAMP WITH TIME ZONE NOT NULL, +        record              JSON NOT NULL, +        PRIMARY KEY(doi) +    ); + +For postgrest access, may need to also: + +    GRANT SELECT ON public.crossref TO web_anon; + +## SQL Backfill Command + +For an example file: + +    cat sample.json \ +        | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ +        | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +For a full snapshot: + +    zcat crossref_public_data_file_2021_01.json.gz \ +        | pv -l \ +        | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ +        | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +jq is the bottleneck (100% of a single CPU core). + +## Kafka Worker + +Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref +table. + +## SQL Table Disk Utilization + +An example backfill from early 2021, with about 120 million Crossref DOI +records. + +Starting database size (with ingest running): + +    Filesystem      Size  Used Avail Use% Mounted on +    /dev/vdb1       1.7T  896G  818G  53% /1 + +    Size: 475.14G + +Ingest SQL command took: + +    120M 15:06:08 [2.22k/s] +    COPY 120684688 + +After database size: + +    Filesystem      Size  Used Avail Use% Mounted on +    /dev/vdb1       1.7T  1.2T  498G  71% /1 + +    Size: 794.88G + +So about 320 GByte of disk. | 
