aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-10-28_grobid_refs.md
diff options
context:
space:
mode:
Diffstat (limited to 'proposals/2021-10-28_grobid_refs.md')
-rw-r--r--proposals/2021-10-28_grobid_refs.md125
1 files changed, 125 insertions, 0 deletions
diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md
new file mode 100644
index 0000000..1fc79b6
--- /dev/null
+++ b/proposals/2021-10-28_grobid_refs.md
@@ -0,0 +1,125 @@
+
+GROBID References in Sandcrawler DB
+===================================
+
+Want to start processing "unstructured" raw references coming from upstream
+metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
+XML), and save the results in sandcrawler DB. From there, they will get pulled
+in to fatcat-scholar "intermediate bundles" and included in reference exports.
+
+The initial use case for this is to parse "unstructured" references deposited
+in Crossref, and include them in refcat.
+
+
+## Schema and Semantics
+
+The output JSON/dict schema for parsed references follows that of
+`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The
+`unstructured` field that was parsed is included in the output, though it may
+not be byte-for-byte exact (see below). One notable change from the past (eg,
+older GROBID-parsed references) is that author `name` is now `full_name`. New
+fields include `editors` (same schema as `authors`), `book_title`, and
+`series_title`.
+
+The overall output schema matches that of the `grobid_refs` SQL table:
+
+ source: string, lower-case. eg 'crossref'
+ source_id: string, eg '10.1145/3366650.3366668'
+ source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z`
+ suffix), which identifies version of upstream metadata
+ refs_json: JSON, list of `GrobidBiblio` JSON objects
+
+References are re-processed on a per-article (or per-release) basis. All the
+references for an article are handled as a batch and output as a batch. If
+there are no upstream references, row with `ref_json` as empty list may be
+returned.
+
+Not all upstream references get re-parsed, even if an 'unstructured' field is
+available. If 'unstructured' is not available, no row is ever output. For
+example, if a reference includes `unstructured` (raw citation string), but also
+has structured metadata for authors, title, year, and journal name, we might
+not re-parse the `unstructured` string. Whether to re-parse is evaulated on a
+per-reference basis. This behavior may change over time.
+
+`unstructured` strings may be pre-processed before being submitted to GROBID.
+This is because many sources have systemic encoding issues. GROBID itself may
+also do some modification of the input citation string before returning it in
+the output. This means the `unstructured` string is not a reliable way to map
+between specific upstream references and parsed references. Instead, the `id`
+field (str) of `GrobidBiblio` gets set to any upstream "key" or "index"
+identifier used to track individual references. If there is only a numeric
+index, the `id` is that number as a string.
+
+The `key` or `id` may need to be woven back in to the ref objects manually,
+because GROBID `processCitationList` takes just a list of raw strings, with no
+attached reference-level key or id.
+
+
+## New SQL Table and View
+
+We may want to do re-parsing of references from sources other than `crossref`,
+so there is a generic `grobid_refs` table. But it is also common to fetch both
+the crossref metadata and any re-parsed references together, so as a convenience
+there is a PostgreSQL view (virtual table) that includes both a crossref
+metadata record and parsed citations, if available. If downstream code cares a
+lot about having the refs and record be in sync, the `source_ts` field on
+`grobid_refs` can be matched against the `indexed` column of `crossref` (or the
+`.indexed.date-time` JSON field in the record itself).
+
+Remember that DOIs should always be lower-cased before querying, inserting,
+comparing, etc.
+
+ CREATE TABLE IF NOT EXISTS grobid_refs (
+ source TEXT NOT NULL CHECK (octet_length(source) >= 1),
+ source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
+ source_ts TIMESTAMP WITH TIME ZONE,
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ refs_json JSON NOT NULL,
+ PRIMARY KEY(source, source_id)
+ );
+
+ CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS
+ SELECT
+ crossref.doi as doi,
+ crossref.indexed as indexed,
+ crossref.record as record,
+ grobid_refs.source_ts as source_ts,
+ grobid_refs.refs_json as refs_json
+ FROM crossref
+ LEFT JOIN grobid_refs ON
+ grobid_refs.source_id = crossref.doi
+ AND grobid_refs.source = 'crossref';
+
+Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.
+
+
+## New Workers / Tools
+
+For simplicity, to start, a single worker with consume from
+`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
+insert to both `crossref` and `grobid_refs` tables. This worker will run
+locally on the machine with sandcrawler-db.
+
+Another tool will support taking large chunks of Crossref JSON (as lines),
+filter them, process with GROBID, and print JSON to stdout, in the
+`grobid_refs` JSON schema.
+
+
+## Task Examples
+
+Command to process crossref records with refs tool:
+
+ cat crossref_sample.json \
+ | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \
+ | pv -l \
+ > crossref_sample.parsed.json
+
+ # => 10.0k 0:00:27 [ 368 /s]
+
+Load directly in to postgres (after tables have been created):
+
+ cat crossref_sample.parsed.json \
+ | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');"
+
+ # => COPY 9999