aboutsummaryrefslogtreecommitdiffstats
path: root/notes/grobid_munging.txt
blob: 013e458f7d6ad8f3c13425841bb7e4d1b4e1880e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

In docker:

    kafkacat -C -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.grobid-output-pg | pv -l | rg 'OA-JOURNAL-CRAWL-2019-08' > OA-JOURNAL-CRAWL-2019-08.grobid.json
    # 5.01M 0:31:04 [2.69k/s]
    # 277 GByte grobid-output.prod.json

Then:

    cat grobid-output.prod.json | rg 'OA-JOURNAL-CRAWL-2019-08' | pv -l > OA-JOURNAL-CRAWL-2019-08.grobid.json
    # 265k 0:32:12 [ 137 /s]

    pigz grobid-output.prod.json
    # 63 GByte grobid-output.prod.json.gz

    cat OA-JOURNAL-CRAWL-2019-08.grobid.json | pv -l | jq "[.key, .status, .status_code, .error_msg] | @tsv" -r | sort -u -S 4G | uniq --check-chars 40 > OA-JOURNAL-CRAWL-2019-08.grobid.tsv
    # 265k

    wc -l OA-JOURNAL-CRAWL-2019-08.grobid.tsv
    # 212879 OA-JOURNAL-CRAWL-2019-08.grobid.tsv

    cut -f2 OA-JOURNAL-CRAWL-2019-08.grobid.tsv | sort | uniq -c
    #  14087 error
    # 198792 success

In sandcrawler pipenv:

    head -n100 /grande/oa-crawl-grobid/OA-JOURNAL-CRAWL-2019-08.grobid.json | ./grobid_tool.py transform --metadata-only - > /grande/oa-crawl-grobid/OA-JOURNAL-CRAWL-2019-08.metadata.json.sample

    cat /grande/oa-crawl-grobid/OA-JOURNAL-CRAWL-2019-08.grobid.json | parallel --linebuffer --round-robin --pipe -j8 ./grobid_tool.py transform --metadata-only - > /grande/oa-crawl-grobid/OA-JOURNAL-CRAWL-2019-08.metadata.json

    cat OA-JOURNAL-CRAWL-2019-08.metadata.json | rg -v '"fatcat_release": null' > OA-JOURNAL-CRAWL-2019-08.metadata.matched.json

    wc -l OA-JOURNAL-CRAWL-2019-08.metadata.matched.json OA-JOURNAL-CRAWL-2019-08.grobid.tsv
    #  28162 OA-JOURNAL-CRAWL-2019-08.metadata.matched.json
    # 212879 OA-JOURNAL-CRAWL-2019-08.grobid.tsv

Next steps:
- import the matched files (while verifying match)
- some web interface to make sandcrawler easier?
    input: sha1 or url
    view: grobid status and metadata, ML results, fatcat metadata (via API lookup)
    links/actions: view PDF, re-run GROBID, add to a release (via API)

## BAD/BROKEN

All these following didn't work because old versions of kafkacat only read
partial results. Ended up using docker to run more recent ubuntu, sigh.

    kafkacat -C -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.grobid-output-pg -e | pv -l > grobid-output.prod.json

    cat grobid-output.prod.json | rg '"status": "success"' > grobid-output.prod.success.json

    kafkacat -C -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.grobid-output-pg -e | pv -l | rg '"status": "success"' > grobid-output.prod.success.json

    kafkacat -C -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.grobid-output-pg -e | pv -l | rg 'OA-JOURNAL-CRAWL-2019-08' > OA-JOURNAL-CRAWL-2019-08.grobid.json

    head -n200 /grande/oa-crawl-grobid/grobid-output.prod.success.json | ./grobid_tool.py transform --metadata-only - | jq "[.fatcat_release, .biblio.title]" -c | less


    cat OA-JOURNAL-CRAWL-2019-08.grobid.json | parallel --pipe -j8 jq .status -r | sort | uniq -c
         1879 error
        26698 success


For full grobid-output was looking like:

     318561 error
     199607 success