aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes
diff options
context:
space:
mode:
Diffstat (limited to 'python/notes')
-rw-r--r--python/notes/version_3.md18
-rw-r--r--python/notes/version_4.md5
2 files changed, 23 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md
index f828ee8..f9b6928 100644
--- a/python/notes/version_3.md
+++ b/python/notes/version_3.md
@@ -316,3 +316,21 @@ A subtle bug: a doi in refs ends with tab:
```
10.1002/andp.19975090102\t
```
+
+----
+
+## URL lookup via pig
+
+* failed after a week; map spill
+
+```
+2021-05-21 15:04:25,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete
+^C2021-05-24 15:22:57,073 [Thread-6] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932
+2021-05-24 15:22:58,245 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete
+2021-05-24 15:22:58,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete
+2021-05-24 15:23:02,763 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed
+
+real 8276m35.071s
+user 425m6.748s
+sys 52m21.012s
+```
diff --git a/python/notes/version_4.md b/python/notes/version_4.md
index 633f47f..ff0b499 100644
--- a/python/notes/version_4.md
+++ b/python/notes/version_4.md
@@ -298,3 +298,8 @@ $ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMAT
"resource/ISSN/0020-7160#KeyTitle"
],
```
+
+We would need:
+
+* rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat)
+