aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-10-28_grobid_refs.md
blob: ff835d4147f2216ef7dbe8e386dcb912eef047ba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

GROBID References in Sandcrawler DB
===================================

Want to start processing "unstructured" raw references coming from upstream
metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
XML), and save the results in sandcrawler DB. From there, they will get pulled
in to fatcat-scholar "intermediate bundles" and included in reference exports.

The initial use case for this is to parse "unstructured" references deposited
in Crossref, and include them in refcat.


## Schema and Semantics

The output JSON/dict schema for parsed references follows that of
`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The
`unstructured` field that was parsed is included in the output, though it may
not be byte-for-byte exact (see below). One notable change from the past (eg,
older GROBID-parsed references) is that author `name` is now `full_name`. New
fields include `editors` (same schema as `authors`), `book_title`, and
`series_title`.

The overall output schema matches that of the `grobid_refs` SQL table:

    source: string, lower-case. eg 'crossref'
    source_id: string, eg '10.1145/3366650.3366668'
    source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z`
               suffix), which identifies version of upstream metadata
    refs_json: JSONB, list of `GrobidBiblio` JSON objects

References are re-processed on a per-article (or per-release) basis. All the
references for an article are handled as a batch and output as a batch. If
there are no upstream references, row with `ref_json` as empty list may be
returned.

Not all upstream references get re-parsed, even if an 'unstructured' field is
available. If 'unstructured' is not available, no row is ever output. For
example, if a reference includes `unstructured` (raw citation string), but also
has structured metadata for authors, title, year, and journal name, we might
not re-parse the `unstructured` string. Whether to re-parse is evaulated on a
per-reference basis. This behavior may change over time.

`unstructured` strings may be pre-processed before being submitted to GROBID.
This is because many sources have systemic encoding issues. GROBID itself may
also do some modification of the input citation string before returning it in
the output. This means the `unstructured` string is not a reliable way to map
between specific upstream references and parsed references. Instead, the `id`
field (str) of `GrobidBiblio` gets set to any upstream "key" or "index"
identifier used to track individual references. If there is only a numeric
index, the `id` is that number as a string.

The `key` or `id` may need to be woven back in to the ref objects manually,
because GROBID `processCitationList` takes just a list of raw strings, with no
attached reference-level key or id.


## New SQL Table and View

We may want to do re-parsing of references from sources other than `crossref`,
so there is a generic `grobid_refs` table. But it is also common to fetch both
the crossref metadata and any re-parsed references together, so as a convience
there is a PostgreSQL view (virtual table) that includes both a crossref
metadata record and parsed citations, if available. If downstream code cares a
lot about having the refs and record be in sync, the `source_ts` field on
`grobid_refs` can be matched againt the `indexed` column of `crossref` (or the
`.indexed.date-time` JSON field in the record itself).

Remember that DOIs should always be lower-cased before querying, inserting,
comparing, etc.

    CREATE TABLE IF NOT EXISTS grobid_refs (
        source              TEXT NOT NULL CHECK (octet_length(source) >= 1),
        source_id           TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
        source_ts           TIMESTAMP WITH TIME ZONE,
        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        refs_json           JSONB NOT NULL,
        PRIMARY KEY(source, source_id)
    );

    CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS
        SELECT
            crossref.doi as doi,
            crossref.indexed as indexed,
            crossref.record as record,
            grobid_refs.source_ts as source_ts,
            grobid_refs.refs_json as refs_json
        FROM crossref
        LEFT JOIN grobid_refs ON
            grobid_refs.source_id = crossref.doi
            AND grobid_refs.source = 'crossref';

Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.


## New Workers / Tools

For simplicity, to start, a single worker with consume from
`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
insert to both `crossref` and `grobid_refs` tables. This worker will run
locally on the machine with sandcrawler-db.

Another tool will support taking large chunks of Crossref JSON (as lines),
filter them, process with GROBID, and print JSON to stdout, in the
`grobid_refs` JSON schema.


## Task Examples

Command to process crossref records with refs tool:

    cat crossref_sample.json \
        | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \
        | pv -l \
        > crossref_sample.parsed.json

    # => 10.0k 0:00:27 [ 368 /s]

Load directly in to postgres (after tables have been created):

    cat crossref_sample.parsed.json \
        | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \
        | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');"

    # => COPY 9999