1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
GROBID References in Sandcrawler DB
===================================
Want to start processing "unstructured" raw references coming from upstream
metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
XML), and save the results in sandcrawler DB. From there, they will get pulled
in to fatcat-scholar "intermediate bundles" and included in reference exports.
The initial use case for this is to parse "unstructured" references deposited
in Crossref, and include them in refcat.
## Schema and Semantics
Follows that of `grobid_tei_xml` version 0.1.
Not all references are necessarily included for GROBID processing. They should
identified and mapped using the entire unstructured string.
When present, `key` or `id` is woven back in to the ref objects (GROBID
`processCitationList` doesn't ever see the keys). `index`, returned by
`grobid_tei_xml`, may not be accurate (because not all references were passed),
and may be removed (TBD).
## New SQL Table and View
CREATE TABLE IF NOT EXISTS grobid_refs (
source TEXT NOT NULL CHECK (octet_length(source) >= 1),
source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
source_ts TIMESTAMP WITH TIME ZONE,
updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
refs_json JSONB NOT NULL,
PRIMARY KEY(source, source_id)
);
CREATE OR REPLACE VIEW crossref_with_refs
doi, indexed, record, source_ts, refs_json AS
SELECT
crossref.doi as doi,
crossref.indexed as indexed,
crossref.record as record,
grobid_refs.source_ts as source_ts,
grobid_refs.refs_json as refs_json
FROM crossref
LEFT JOIN grobid_refs ON
grobid_refs.source_id = crossref.doi
AND grobid_refs.source = 'crossref';
Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.
## New Workers / Tools
For simplicity, to start, a single worker with consume from
`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
insert to both `crossref` and `grobid_refs` tables. This worker will run
locally on the machine with sandcrawler-db.
Another tool will support taking large chunks of Crossref JSON (as lines),
filter them, process with GROBID, and print JSON to stdout, in the
`grobid_refs` JSON schema.
|