aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks/2021-10-29_crossref_refs_backfill.md
blob: 5a5ffa269c2d735803bea0bd26d3a2e78a371093 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

The current sandcrawler-db crossref table was backfilled from a 2021-01
snapshot, and has not been updated since.

Would like to use the existing fatcat Kafka feed to keep the crossref table up
to date, and also backfill in GROBID reference parsing of all `unstructured`
references.

Current plan is:

1. use kafkacat CLI to dump crossref Kafka topic, from the begining of 2021 up
   to some recent date
2. use `persist_tool.py`, with a large batch size (200?) to backfill this dump
   into sandcrawler-db. this will update some rows multiple times (if there
   have been updates)
3. dump the full crossref table, as a point-in-time snapshot
4. filter to crossref records that have `unstrutured` references in them (at
   all)
5. use `grobid_tool.py` with `parallel` to batch process references
6. backfill these refs using a simple SQL COPY statement
7. deploy crossref persist worker, with ref updates on, and roll the consumer
   group back to date of dump
8. wait for everything to catch up


## Commands

Get a timestamp in milliseconds:

    2021-01-01 is:
        1609488000 in unix time (seconds)
        1609488000000 in miliseconds

Hrm, oldest messages seem to actually be from 2021-04-28T19:21:10Z though. Due
to topic compaction? Yup, we have a 180 day compaction policy on that topic,
probably from when kafka space was tight. Oh well!

Updated retention for this topic to `46656000000` (~540 days, ~18 months) using
`kafka-manager` web app.

    kafkacat -C -b wbgrp-svc263.us.archive.org -t fatcat-prod.api-crossref -o s@1609488000000 \
        | pv -l \
        | gzip \
        > crossref_feed_start20210428_end20211029.json.gz

This resulted in ~36 million rows, 46GB.

`scp` that around, then run persist on `sandcrawler-db`:

    # in pipenv, as sandcrawler user
    zcat /srv/sandcrawler/tasks/crossref_feed_start20210428_end20211029.json.gz \
        | pv -l \
        | ./persist_tool crossref -