aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--pig/README.md23
-rw-r--r--pig/papers_edu_tilde.cdx15
2 files changed, 30 insertions, 8 deletions
diff --git a/pig/README.md b/pig/README.md
index d13c065..bc892fa 100644
--- a/pig/README.md
+++ b/pig/README.md
@@ -1,4 +1,4 @@
-# Notes
+# Pig Notes
In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4.
@@ -23,13 +23,13 @@ $ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s
A test run with a single file.
```
-$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0
+$ pig \
+ -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz \
+ -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \
+ -p OUTPUT=/user/martin/fatcat-refs-lookup-0 \
+ filter-cdx-join-urls.pig
```
-* http://ia802401.us.archive.org:6988/
-* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042
-* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/
-
Running against 1/300 block of global CDX took about 15h.
```
@@ -58,11 +58,18 @@ $ jq -rc .status refs_links_liveweb.json | sort | uniq -c | sort -nr 2> /dev/nul
Running against a full index.
```
-$ time pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-1 filter-cdx-join-urls.pig
+$ time pig \
+ -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz \
+ -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \
+ -p OUTPUT=/user/martin/fatcat-refs-lookup-1 \
+ filter-cdx-join-urls.pig
```
-* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_300026/
+* application id was: `application_1611217683160_300026`
+The full lookup led to "map spill", since we needed to extract surts from the
+full CDX index. Not taking advantage of zipnum and other possible improvements.
+Killed the jobs; required hdfs cleanup.
----
diff --git a/pig/papers_edu_tilde.cdx b/pig/papers_edu_tilde.cdx
new file mode 100644
index 0000000..f43a11a
--- /dev/null
+++ b/pig/papers_edu_tilde.cdx
@@ -0,0 +1,15 @@
+#http://www.stanford.edu:80/~johntayl/Papers/taylor2.pdf
+#http://met.nps.edu/~mtmontgo/papers/isabel_part2.pdf
+#http://www.pitt.edu:80/~druzdzel/psfiles/ecai06.pdf
+#http://www.comp.hkbu.edu.hk/~ymc/papers/conference/ijcnn03_710.pdf
+
+# should be 6 matches:
+hk,edu,hkbu,comp)/~ymc/papers/conference/ijcnn03_710.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 LQHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+edu,stanford,www)/~johntayl/Papers/taylor2.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 XQHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+edu,nps,met)/~mtmontgo/papers/isabel_part2.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 PQHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 9QHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+jp,ac,pitt,www)/~druzdzel/psfiles/ecai06.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 8QHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+co,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 7QHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz
+
+# NOT:
+com,corp,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf 20170706005950 http://mit.edu/file.pdf application/pdf 200 6QHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz