1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
|
The QA import is running really slow; this is a parallel attempt in case things
are faster on the fatcat-prod2-vm machine, with 50 batch size and bezerk mode.
NOTE: this ended up being the successful/"final" bootstrap import.
## Service up/down
sudo service fatcat-web stop
sudo service fatcat-api stop
# shutdown all the import/export/etc
# delete any snapshots and /tmp/fatcat*
sudo rm /srv/fatcat/snapshots/*
sudo rm /tmp/fatcat_*
# git pull
# ansible playbook push
# re-build fatcat-api to ensure that worked
sudo service fatcat-web stop
sudo service fatcat-api stop
# as postgres user:
DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
sudo service postgresql restart
http delete :9200/fatcat_release
http delete :9200/fatcat_container
http delete :9200/fatcat_changelog
http put :9200/fatcat_release < release_schema.json
http put :9200/fatcat_container < container_schema.json
http put :9200/fatcat_changelog < changelog_schema.json
sudo service elasticsearch stop
sudo service kibana stop
sudo service fatcat-api start
# ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json
# if necessary:
# ALTER USER fatcat WITH SUPERUSER;
# ALTER USER fatcat WITH PASSWORD '...';
# create new auth keys via bootstrap (edit debug -> release first)
# update config/env/ansible/etc with new tokens
# delete existing entities
# run the imports!
# after running below imports
sudo service fatcat-web start
sudo service elasticsearch start
sudo service kibana start
## Import commands
rust version (as webcrawl): 1fe371288daf417cdf44b94e372b485426b47134
git commit: 1.32.0
export LC_ALL=C.UTF-8
export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json
Counter({'total': 107869, 'insert': 107823, 'skip': 46, 'update': 0, 'exists': 0})
real 6m2.287s
user 2m4.612s
sys 0m5.664s
export FATCAT_AUTH_WORKER_ORCID="..."
time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -
98% 79:1=22s
Counter({'total': 48097, 'insert': 47908, 'skip': 189, 'exists': 0, 'update': 0})
100% 80:0=0s
real 33m9.211s
user 93m33.040s
sys 5m32.176s
export FATCAT_AUTH_WORKER_CROSSREF="..."
time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode
seems to be maintaining 9.1 MiB/sec and estimates 15 hours. 200 M/sec disk write. we'll see!
100 % 33.2 GiB / 331.9 GiB = 0.100 3.6 MiB/s 26:16:57
Counter({'total': 5001477, 'insert': 4784708, 'skip': 216769, 'update': 0, 'exists': 0})
395971.48user 8101.15system 26:17:07elapsed 427%CPU (0avgtext+0avgdata 431560maxresident)k
232972688inputs+477055792outputs (334645major+39067735minor)pagefaults 0swaps
real 1577m7.908s
user 6681m58.948s
sys 141m25.560s
export FATCAT_AUTH_SANDCRAWLER="..."
export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched - --bezerk-mode
(accidentally lost, but took about 3 hours)
time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -
Counter({'total': 827944, 'insert': 555359, 'exists': 261441, 'update': 11129, 'skip': 15})
32115.82user 1370.12system 4:30:25elapsed 206%CPU (0avgtext+0avgdata 37312maxresident)k
28200inputs+3767112outputs (108major+471069minor)pagefaults 0swaps
real 270m25.288s
user 535m52.908s
sys 22m56.328s
time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa
1.6M 2:02:05 [ 218 /s]
Counter({'total': 133095, 'insert': 120176, 'inserted.release': 120176, 'exists': 12919, 'skip': 0, 'update': 0})
20854.82user 422.09system 2:02:12elapsed 290%CPU (0avgtext+0avgdata 63816maxresident)k
29688inputs+21057912outputs (118major+809972minor)pagefaults 0swaps
real 122m12.533s
user 350m14.824s
sys 7m29.820s
## After Import Stats
bnewbold@wbgrp-svc503$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 1.8T 591G 1.1T 36% /
Size: 294.82G
select count(*) from changelog => 2,306,900
table_name | table_size | indexes_size | total_size
--------------------------------------------------------------+------------+--------------+------------
"public"."refs_blob" | 70 GB | 1896 MB | 72 GB
"public"."release_rev" | 36 GB | 32 GB | 68 GB
"public"."release_contrib" | 25 GB | 23 GB | 48 GB
"public"."release_edit" | 9342 MB | 10 GB | 19 GB
"public"."work_edit" | 9342 MB | 10 GB | 19 GB
"public"."release_ident" | 6334 MB | 10235 MB | 16 GB
"public"."work_ident" | 6333 MB | 10235 MB | 16 GB
"public"."file_rev_url" | 6085 MB | 2251 MB | 8337 MB
"public"."work_rev" | 4092 MB | 3795 MB | 7887 MB
"public"."file_rev" | 1706 MB | 2883 MB | 4589 MB
"public"."abstracts" | 4089 MB | 300 MB | 4390 MB
"public"."file_edit" | 1403 MB | 1560 MB | 2963 MB
"public"."file_ident" | 944 MB | 1529 MB | 2473 MB
"public"."file_rev_release" | 889 MB | 1558 MB | 2447 MB
"public"."release_rev_abstract" | 404 MB | 536 MB | 941 MB
"public"."creator_rev" | 371 MB | 457 MB | 827 MB
"public"."creator_edit" | 377 MB | 420 MB | 797 MB
"public"."editgroup" | 480 MB | 285 MB | 766 MB
"public"."creator_ident" | 255 MB | 412 MB | 667 MB
"public"."changelog" | 135 MB | 139 MB | 274 MB
"public"."container_rev" | 31 MB | 11 MB | 42 MB
"public"."container_edit" | 10 MB | 12 MB | 22 MB
"public"."container_ident" | 7216 kB | 12 MB | 19 MB
relname | too_much_seq | case | rel_size | seq_scan | idx_scan
----------------------+--------------+------+-------------+----------+-----------
creator_edit | -94655 | OK | 395558912 | 2 | 94657
container_edit | -94655 | OK | 10911744 | 2 | 94657
file_edit | -94655 | OK | 1470627840 | 2 | 94657
work_edit | -94655 | OK | 9793445888 | 2 | 94657
release_edit | -94655 | OK | 9793241088 | 2 | 94657
container_rev | -1168077 | OK | 32546816 | 3 | 1168080
file_rev_release | -3405015 | OK | 931627008 | 2 | 3405017
file_rev_url | -3405015 | OK | 6379298816 | 2 | 3405017
changelog | -3883131 | OK | 141934592 | 382 | 3883513
abstracts | -8367919 | OK | 4011868160 | 1 | 8367920
creator_ident | -9066121 | OK | 267124736 | 5 | 9066126
creator_rev | -14129509 | OK | 388431872 | 3 | 14129512
release_contrib | -17121962 | OK | 26559053824 | 3 | 17121965
release_rev_abstract | -17123930 | OK | 423878656 | 3 | 17123933
file_ident | -18428366 | OK | 989888512 | 5 | 18428371
refs_blob | -50251199 | OK | 15969484800 | 1 | 50251200
container_ident | -74332007 | OK | 7364608 | 5 | 74332012
file_rev | -99555196 | OK | 1788166144 | 4 | 99555200
release_ident | -132347345 | OK | 6639624192 | 5 | 132347350
work_rev | -193625747 | OK | 4289314816 | 1 | 193625748
work_ident | -196604815 | OK | 6639476736 | 5 | 196604820
editgroup | -214491911 | OK | 503414784 | 3 | 214491914
release_rev | -482813156 | OK | 38609838080 | 11 | 482813167
(23 rows)
## Dump Stats / Process
DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp
postgres@wbgrp-svc503:/srv/fatcat/src/extra/sql_dumps$ DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp
Will move output to '/tmp'
Running SQL (from 'fatcat_prod')...
BEGIN
COPY 1
COPY 3906704 -> creators
COPY 107826 -> containers
COPY 14378465 -> files
COPY 3 -> filesets
COPY 3 -> webcaptures
COPY 96812903 -> releases
COPY 96812903 -> works
COPY 2306900 -> changelog
ROLLBACK
Done: /tmp/fatcat_idents.2019-02-01.214959.r2306900.tar.gz
fatcat-export:
x files
x containers
- releases_extended (TODO: estimate time to dump based on file timestamps)
cat /tmp/fatcat_ident_releases.tsv | ./target/release/fatcat-export release --expand files,filesets,webcaptures,container -j8 | pv -l | gzip > /srv/fatcat/snapshots/release_export_expanded.json.gz
96.8M 7:37:51 [3.52k/s]
-rw-rw-r-- 1 webcrawl webcrawl 64G Feb 2 05:45 release_export_expanded.json.gz
sql dumps:
time sudo -u postgres pg_dump --verbose --format=tar fatcat_prod | pigz > /srv/fatcat/snapshots/fatcat_private_dbdump_${DATESLUG}.tar.gz
real 112m34.310s
user 296m46.112s
sys 22m35.004s
-rw-rw-r-- 1 bnewbold bnewbold 81G Feb 2 04:15 fatcat_private_dbdump_2019-02-02.022209.tar.gz
Looking for repeated SHA-1 and DOI:
zcat file_hashes.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_sha1.tsv
=> none
zcat release_extid.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_doi.tsv
=> a few million repeated *blank* lines... could filter out?
## Load Stats / Progress
export LC_ALL=C.UTF-8
time zcat /srv/fatcat/snapshots/release_export_expanded.json.gz | pv -l | ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release
time zcat /srv/fatcat/snapshots/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container
time zcat /srv/fatcat/snapshots/2019-01-30/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container
real 0m58.528s
user 1m0.396s
sys 0m2.412s
# very python-CPU-limited, so crank that -j20
# hadn't used '--linebuffer' with parallel before, but otherwise it holds
# on to all the output lines before passing on to the next pipe program
time zcat /srv/fatcat/snapshots/2019-01-30/release_export_expanded.json.gz | pv -l | parallel -j20 --linebuffer --round-robin --pipe ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release
165k 0:00:10 [18.4k/s]
2019/02/02 09:30:49 96812900 docs in 2h27m32.835681602s at 10935.807 docs/s with 8 workers
2019/02/02 09:30:49 applied setting: {"index": {"refresh_interval": "1s"}} with status 200 OK
2019/02/02 09:30:49 applied setting: {"index": {"number_of_replicas": "1"}} with status 200 OK
2019/02/02 09:31:03 index flushed: 200 OK
real 147m46.387s
user 2621m40.420s
sys 56m11.456s
sudo su postgres
dropdb fatcat_prod
#zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz | pg_restore --clean --if-exists --create --exit-on-error -d fatcat_prod
createdb fatcat_prod
time zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz | pg_restore --exit-on-error --clean --if-exists --dbname fatcat_prod
seems to go pretty fast, so multiple jobs probably not needed
real 284m40.448s
user 58m45.240s
sys 7m33.600s
DONE: delete old elastic index
## Bugs/Issues encountered
x in_ia_sim is broken; not passing through
x elastic port (9200) was not open to cluster
=> but should close; should be over HTTP
x elasticsearch host wrong (should be search.fatcat.wiki)
=> search.fatcat.wiki
x postgres config wasn't actually getting installed in the right place by
ansible (!!!), which probably had crazy effects on performance, etc
x postgres version confusion was because both versions (server and client) can
be installed in parallel, and older version "wins". wiping VM would solve this.
x should try pigz for things like ident_table_snapshot and exports? these seem to be gzip-limited
- fatcat-export and pg_dump seem to mutually lock (transaction-wise), which is
unexpected. fatcat-export should have very loose (low-priority) transaction
scope, because it already has the full release_rev id, and pg_dump should
also be in background/non-linear mode (except for "public" dumps?)
=> this was somewhat subtle; didn't completely lock
- this machine is postgres 10, not postgres 11. same with fatcat-prod1-vm.
Added to TODO:
- want a better "write lock" flag (on database) other than clearing auth key
- KBART CLOCKSS reports (and maybe LOCKSS?) have repeated lines, need to be merged
- empty AUTH_ALT_KEYS should just be ignored (not try to parse)
## Metadata Quality Notes
- crossref references look great!
- extra/crossref/alternative-id often includes exact full DOI
10.1158/1538-7445.AM10-3529
10.1158/1538-7445.am10-3529
=> but not always? publisher-specific
- contribs[]/extra/seq often has "first" from crossref
=> is this helpful?
- abstracts content is fine, but should probably check for "jats:" when setting
mimetype
x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
=> https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave
10.26891/jik.v10i2.2016.92-97
- original title works, yay!
https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im
10.2504/kds.26.358
- new license: https://www.karger.com/Services/SiteLicenses
- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7
"9780080373027"
could at least put in alternative-id?
- BUG: subtitle coming through as an array, not string
- `license_slug` does get set
eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/
- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude
- BUG (?): file missing size:
https://fatcat.wiki/file/wpvkiqx2w5celc3ajyfsh3cfsa
- webface BUG: file-to-release links missing
- webface meh: still need to collapse links by domain better, and also vs. www.x/x
I think this is good (enough)!
Possible other KBART sources: Hathitrust, PKP preservation net (open, OJS), scholars portal (?), british library
Nature mag kbart clocks in empty (?)
ISSN-L: 0028-0836
https://fatcat.wiki/container/drfdii35rzaibj3aml5uhvr5xm
Missing DOIs (out of scope?):
DOI not found: 10.1023/a:1009888907797
DOI not found: 10.1186/1471-2148-4-49
DOI not found: 10.1023/a:1026471016927
DOI not found: 10.1090/s0002-9939-04-07569-0
DOI not found: 10.1186/1742-4682-1-11
DOI not found: 10.1186/1477-3163-2-5
DOI not found: 10.1186/gb-2003-4-4-210
DOI not found: 10.1186/gb-2004-5-9-r63
DOI not found: 10.13188/2330-2178.1000008
DOI not found: 10.4135/9781473960749
DOI not found: 10.1252/kakoronbunshu1953.36.479
DOI not found: 10.2320/materia.42.461
DOI not found: 10.1186/1742-4933-3-3
DOI not found: 10.14257/ijsh
DOI not found: 10.1023/a:1016008714781
DOI not found: 10.1023/a:1016648722322
DOI not found: 10.1787/5k990rjhvtlv-en
DOI not found: 10.4064/fm
DOI not found: 10.1090/s0002-9947-98-01992-8
DOI not found: 10.1186/1475-925x-2-16
DOI not found: 10.1186/1479-5868-3-9
DOI not found: 10.1090/s0002-9939-03-07205-8
DOI not found: 10.1023/a:1008111923880
DOI not found: 10.1090/s0002-9939-98-04322-6
DOI not found: 10.1186/gb-2005-6-11-r93
DOI not found: 10.5632/jila1925.2.236
DOI not found: 10.1023/a:1011359428672
DOI not found: 10.1090/s0002-9947-97-01844-8
DOI not found: 10.1155/4817
DOI not found: 10.1186/1472-6807-1-5
DOI not found: 10.1002/(issn)1542-0981
DOI not found: 10.1186/rr115
|