notes/tasks/2021-10-29_crossref_refs_backfill.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96


The current sandcrawler-db crossref table was backfilled from a 2021-01
snapshot, and has not been updated since.

Would like to use the existing fatcat Kafka feed to keep the crossref table up
to date, and also backfill in GROBID reference parsing of all `unstructured`
references.

Current plan is:

1. use kafkacat CLI to dump crossref Kafka topic, from the begining of 2021 up
   to some recent date
2. use `persist_tool.py`, with a large batch size (200?) to backfill this dump
   into sandcrawler-db. this will update some rows multiple times (if there
   have been updates)
3. dump the full crossref table, as a point-in-time snapshot
4. filter to crossref records that have `unstrutured` references in them (at
   all)
5. use `grobid_tool.py` with `parallel` to batch process references
6. backfill these refs using a simple SQL COPY statement
7. deploy crossref persist worker, with ref updates on, and roll the consumer
   group back to date of dump
8. wait for everything to catch up


## Commands

Get a timestamp in milliseconds:

    2021-01-01 is:
        1609488000 in unix time (seconds)
        1609488000000 in miliseconds

Hrm, oldest messages seem to actually be from 2021-04-28T19:21:10Z though. Due
to topic compaction? Yup, we have a 180 day compaction policy on that topic,
probably from when kafka space was tight. Oh well!

Updated retention for this topic to `46656000000` (~540 days, ~18 months) using
`kafka-manager` web app.

    kafkacat -C -b wbgrp-svc263.us.archive.org -t fatcat-prod.api-crossref -o s@1609488000000 \
        | pv -l \
        | gzip \
        > crossref_feed_start20210428_end20211029.json.gz

This resulted in ~36 million rows, 46GB.

`scp` that around, then run persist on `sandcrawler-db`:

    # in pipenv, as sandcrawler user
    # manually edited to set batch size to 200
    zcat /srv/sandcrawler/tasks/crossref_feed_start20210428_end20211029.json.gz \
        | pv -l \
        | ./persist_tool.py crossref -

With a single thread, the persist process runs at about 1,000 rows/sec, which
works out to about 10 hours for 36 million rows.

At the start of this process, total PostgreSQL database size is 832.21G.

Query to dump crossref rows which have any refs and compress output with pigz:

    # dump_crossref.sql
    COPY (
        SELECT record
        FROM crossref
        WHERE record::jsonb @? '$.reference[*].unstructured'
        -- LIMIT 5
    )
    TO STDOUT
    WITH NULL '';

    # 'sed' required because of double quote escaping in postgresql output::
    # https://stackoverflow.com/questions/29869983/postgres-row-to-json-produces-invalid-json-with-double-escaped-quotes/29871069
    # 'rg' filter is just being conservative

    psql sandcrawler < dump_crossref.sql \
        | sed 's/\\"/\"/g' \
        | rg '^\{' \
        | pv -l \
        | pigz \
        > /srv/sandcrawler/tasks/crossref_sandcrawler_unstructured.json.gz


    # NOTE: -j40 is for production run with ~dedicated GROBID server with many cores
    zcat /srv/sandcrawler/tasks/crossref_sandcrawler_unstructured.json.gz \
        | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --grobid-host http://wbgrp-svc096.us.archive.org:8070 parse-crossref-refs - \
        | pv -l \
        | pigz \
        > /srv/sandcrawler/tasks/crossref_sandcrawler_unstructured.grobid_refs.json.gz

    # able to do about 300-500 records/second
    # 23.9k 0:01:14 [ 320 /s]
    # 134518 total refs parsed
    # ~1817 refs/second parsed