diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-08-10 19:49:45 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-08-10 19:49:45 -0700 |
commit | ff60cb2411082b2e5ea4e09875006824632b81a2 (patch) | |
tree | 1ff14278e6b78e6032f8d2a3bc45f5cf1a92ba32 | |
parent | 6ac61cd09e8089c59b79edd303f855f74ce58d6c (diff) | |
download | sandcrawler-ff60cb2411082b2e5ea4e09875006824632b81a2.tar.gz sandcrawler-ff60cb2411082b2e5ea4e09875006824632b81a2.zip |
postgrest backfill updates
-rw-r--r-- | postgrest/backfill/backfill.md | 20 |
1 files changed, 19 insertions, 1 deletions
diff --git a/postgrest/backfill/backfill.md b/postgrest/backfill/backfill.md index f97c901..f1a5f86 100644 --- a/postgrest/backfill/backfill.md +++ b/postgrest/backfill/backfill.md @@ -65,6 +65,7 @@ NOTE: these largely didn't work; will need to write a batch importer. Batch import process: cat UNPAYWALL-PDF-CRAWL-2018-07.filtered.cdx MSAG-PDF-CRAWL-2017.cdx TARGETED-PDF-CRAWL-2017.cdx UNPAYWALL-PDF-CRAWL-2019-04.pdfs_sorted.cdx | ./backfill_cdx.py + # Done: Counter({'raw_lines': 123254127, 'total': 51365599, 'batches': 51365}) ## `fatcat_file` @@ -102,7 +103,24 @@ Quick test: Run big batch: - ls /bigger/unpaywall-transfer/2019-07-17-1741.30-dumpgrobidxml/part*gz |arallel --progress -j8 'zcat {} | cut -f2 | ./backfill_grobid.py' + ls /bigger/unpaywall-transfer/2019-07-17-1741.30-dumpgrobidxml/part*gz | parallel --progress -j8 'zcat {} | cut -f2 | ./backfill_grobid.py' + # [...] + # Done: Counter({'minio-success': 161605, 'total': 161605, 'raw_lines': 161605, 'batches': 161}) + # [...] + +Was running slow with lots of iowait and 99% jdb2. This seems to be disk I/O. Going to try: + + sudo mount /dev/sdc1 /sandcrawler-minio/ -o data=writeback,noatime,nobarrier + + # -j8: 20+ M/s write, little jdb2 + # -j16: 30+ M/s write, little jdb2 + # -j12: 30+ M/s write, going with this + +For general use should go back to: + + sudo mount /dev/sdc1 /sandcrawler-minio/ -o data=noatime + + # -j4: Still pretty slow, only ~3-5 M/s disk write. jbd2 consistently at 99%, 360 K/s write ## rough table sizes |