aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-10-28_grobid_refs.md
blob: 3a17af3670cc86a8138db4aa96d3c9ee41bd7dad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

GROBID References in Sandcrawler DB
===================================

Want to start processing "unstructured" raw references coming from upstream
metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
XML), and save the results in sandcrawler DB. From there, they will get pulled
in to fatcat-scholar "intermediate bundles" and included in reference exports.

The initial use case for this is to parse "unstructured" references deposited
in Crossref, and include them in refcat.


## Schema and Semantics

Follows that of `grobid_tei_xml` version 0.1.

Not all references are necessarily included for GROBID processing. They should
identified and mapped using the entire unstructured string.

When present, `key` or `id` is woven back in to the ref objects (GROBID
`processCitationList` doesn't ever see the keys). `index`, returned by
`grobid_tei_xml`, may not be accurate (because not all references were passed),
and may be removed (TBD).


## New SQL Table and View

    CREATE TABLE IF NOT EXISTS grobid_refs (
        source              TEXT NOT NULL CHECK (octet_length(source) >= 1),
        source_id           TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
        source_ts           TIMESTAMP WITH TIME ZONE,
        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        refs_json           JSONB NOT NULL,
        PRIMARY KEY(source, source_id)
    );

    CREATE OR REPLACE VIEW crossref_with_refs
        doi, indexed, record, source_ts, refs_json AS
        SELECT
            crossref.doi as doi,
            crossref.indexed as indexed,
            crossref.record as record,
            grobid_refs.source_ts as source_ts,
            grobid_refs.refs_json as refs_json
        FROM crossref
        LEFT JOIN grobid_refs ON
            grobid_refs.source_id = crossref.doi
            AND grobid_refs.source = 'crossref';

Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.


## New Workers / Tools

For simplicity, to start, a single worker with consume from
`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
insert to both `crossref` and `grobid_refs` tables. This worker will run
locally on the machine with sandcrawler-db.

Another tool will support taking large chunks of Crossref JSON (as lines),
filter them, process with GROBID, and print JSON to stdout, in the
`grobid_refs` JSON schema.