diff options
-rw-r--r-- | python/notes/version_3.md | 18 | ||||
-rw-r--r-- | python/notes/version_4.md | 5 |
2 files changed, 23 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md index f828ee8..f9b6928 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -316,3 +316,21 @@ A subtle bug: a doi in refs ends with tab: ``` 10.1002/andp.19975090102\t ``` + +---- + +## URL lookup via pig + +* failed after a week; map spill + +``` +2021-05-21 15:04:25,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete +^C2021-05-24 15:22:57,073 [Thread-6] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932 +2021-05-24 15:22:58,245 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete +2021-05-24 15:22:58,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete +2021-05-24 15:23:02,763 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed + +real 8276m35.071s +user 425m6.748s +sys 52m21.012s +``` diff --git a/python/notes/version_4.md b/python/notes/version_4.md index 633f47f..ff0b499 100644 --- a/python/notes/version_4.md +++ b/python/notes/version_4.md @@ -298,3 +298,8 @@ $ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMAT "resource/ISSN/0020-7160#KeyTitle" ], ``` + +We would need: + +* rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat) + |