1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
status: work-in-progress
Crossref DOI Metadata in Sandcrawler DB
=======================================
Proposal is to have a local copy of Crossref API metadata records in
sandcrawler DB, accessible by simple key lookup via postgrest.
Initial goal is to include these in scholar work "bundles" (along with
fulltext, etc), in particular as part of reference extraction pipeline. Around
late 2020, many additional references became available via Crossref records,
and have not been imported (updated) into fatcat. Reference storage in fatcat
API is a scaling problem we would like to put off, so injecting content in this
way is desirable.
To start, working with a bulk dump made available by Crossref. In the future,
might persist the daily feed to that we have a continuously up-to-date copy.
Another application of Crossref-in-bundles is to identify overall scale of
changes since initial Crossref metadata import.
## Sandcrawler DB Schema
The "updated" field in this case refers to the upstream timestamp, not the
sandcrawler database update time.
CREATE TABLE IF NOT EXISTS crossref (
doi TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)),
indexed TIMESTAMP WITH TIME ZONE NOT NULL,
record JSON NOT NULL,
PRIMARY KEY(doi)
);
For postgrest access, may need to also:
GRANT SELECT ON public.crossref TO web_anon;
## SQL Backfill Command
For an example file:
cat sample.json \
| jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
| psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
For a full snapshot:
zcat crossref_public_data_file_2021_01.json.gz \
| pv -l \
| jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
| psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
jq is the bottleneck (100% of a single CPU core).
## Kafka Worker
Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref
table.
## SQL Table Disk Utilization
An example backfill from early 2021, with about 120 million Crossref DOI
records.
Starting database size (with ingest running):
Filesystem Size Used Avail Use% Mounted on
/dev/vdb1 1.7T 896G 818G 53% /1
Size: 475.14G
Ingest SQL command took:
120M 15:06:08 [2.22k/s]
COPY 120684688
After database size:
Filesystem Size Used Avail Use% Mounted on
/dev/vdb1 1.7T 1.2T 498G 71% /1
Size: 794.88G
So about 320 GByte of disk.
|