From 5b175031e7c431828edfafaef1a3989171c32630 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Mon, 24 May 2021 17:45:42 +0200 Subject: update notes --- python/notes/version_3.md | 18 ++++++++++++++++++ python/notes/version_4.md | 5 +++++ 2 files changed, 23 insertions(+) diff --git a/python/notes/version_3.md b/python/notes/version_3.md index f828ee8..f9b6928 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -316,3 +316,21 @@ A subtle bug: a doi in refs ends with tab: ``` 10.1002/andp.19975090102\t ``` + +---- + +## URL lookup via pig + +* failed after a week; map spill + +``` +2021-05-21 15:04:25,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete +^C2021-05-24 15:22:57,073 [Thread-6] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932 +2021-05-24 15:22:58,245 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete +2021-05-24 15:22:58,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete +2021-05-24 15:23:02,763 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed + +real 8276m35.071s +user 425m6.748s +sys 52m21.012s +``` diff --git a/python/notes/version_4.md b/python/notes/version_4.md index 633f47f..ff0b499 100644 --- a/python/notes/version_4.md +++ b/python/notes/version_4.md @@ -298,3 +298,8 @@ $ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMAT "resource/ISSN/0020-7160#KeyTitle" ], ``` + +We would need: + +* rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat) + -- cgit v1.2.3