1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
|
This is the first attempt at a clean final production import. Running in QA; if
all goes well would dump and import in prod.
Made a number of changes since yesterday's import, so won't be surprised if run
in to problems. Plan is to make any fixes and push through to the end to turn
up any additional issues/bugs, then iterate yet again if needed.
NOTE: this import ended up being abandoned (too slow) in lieu of 2019-01-30.
## Service up/down
sudo service fatcat-web stop
sudo service fatcat-api stop
# shutdown all the import/export/etc
# delete any snapshots and /tmp/fatcat*
sudo rm /srv/fatcat/snapshots/*
sudo rm /tmp/fatcat_*
# git pull
# ansible playbook push
# re-build fatcat-api to ensure that worked
sudo service fatcat-web stop
sudo service fatcat-api stop
# as postgres user:
DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
sudo service postgresql restart
http delete :9200/fatcat_release
http delete :9200/fatcat_container
http delete :9200/fatcat_changelog
http put :9200/fatcat_release < release_schema.json
http put :9200/fatcat_container < container_schema.json
http put :9200/fatcat_changelog < changelog_schema.json
sudo service elasticsearch stop
sudo service kibana stop
sudo service fatcat-api start
# ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json
# create new auth keys via bootstrap (edit debug -> release first)
# update config/env/ansible/etc with new tokens
# delete existing entities
# run the imports!
# after running below imports
sudo service fatcat-web start
sudo service elasticsearch start
sudo service kibana start
## Import commands
rust version (as webcrawl): 1.32.0
git commit: 586458cacabd1d2f4feb0d0f1a9558f229f48f5e
export LC_ALL=C.UTF-8
export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json
hit a bug (see below) but safe to continue
Counter({'total': 107869, 'insert': 102623, 'exists': 5200, 'skip': 46, 'update': 0})
real 4m43.635s
user 1m55.904s
sys 0m5.376s
export FATCAT_AUTH_WORKER_ORCID="..."
time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -
hit another bug (see below), again safe to continue
Counter({'total': 48888, 'insert': 48727, 'skip': 161, 'exists': 0, 'update': 0}) (etc)
real 29m56.773s
user 89m2.532s
sys 5m11.104s
export FATCAT_AUTH_WORKER_CROSSREF="..."
time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode
running very slow; maybe batch size of 100 was too large? pushing 80+
MB/sec, but very little CPU utilization. some ISSN lookups taking up to
a second each (!). no vacuum in progress. at xzcat, only '2.1 MiB/s'
at current rate will take more than 48 hours. hrm.
after 3.5 hours or so, cancelled and restarted in non-bezerk mode, with batch size of 50.
xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 50 crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
at xzcat, about 5 Mib/s. just after citation efficiency, full import was around 20 hours.
if slows down again, may be due to some threads failing and not
dumping. if that's the case, should try with 'head -n200000' or so to
catch output errors.
ps aux | rg fatcat_import.py | rg -v perl | wc -l => 22
at 7.2% (beyond earlier progress), and now inserting (not just
lookups), pushing 5.6 MiB/sec, 17 hours (estimated) to go, seems to be
running fine.
at 12 hours in, at 20% and down to 1.9 MiB/sec again. Lots of disk I/O
(80 MB/sec write), seems to be bottleneck, not sure why.
would take... about an hour to restart, might save 20+ hours, might waste 14?
Counter({'total': 5005785, 'insert': 4319312, 'exists': 457819, 'skip': 228654, 'update': 0})
531544.60user 13597.32system 60:38:43elapsed 249%CPU (0avgtext+0avgdata 448748maxresident)k
124037840inputs+395235552outputs (140major+41973732minor)pagefaults 0swaps
real 3638m43.712s => 60 hours (!!!)
user 8944m37.944s
sys 232m25.200s
export FATCAT_AUTH_SANDCRAWLER="..."
export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched --bezerk-mode -
time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -
time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa
## Bugs encountered
x broke a constraint or made an otherwise invalid request: name is required for all Container entities
=> wasn't bezerk mode, so should be fine to continue
x {"success":false,"error":"BadRequest","message":"broke a constraint or made an otherwise invalid request: display_name is required for all Creator entities"}
=> wasn't bezerk mode, so should be fine to continue
|