aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:02 -0700
committerBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:02 -0700
commit1ae7fd2f0c5661560b15be86614c2c4d41b21205 (patch)
tree71ed116cfbc65562bfcbd2d913402c098c23c1df
parentf21bf5c66382a475a5127e449d05a75ba41a9a25 (diff)
downloadsandcrawler-1ae7fd2f0c5661560b15be86614c2c4d41b21205.tar.gz
sandcrawler-1ae7fd2f0c5661560b15be86614c2c4d41b21205.zip
commit notes from my laptop
-rw-r--r--notes/backfill_scalding_rewrite.txt22
-rw-r--r--notes/crawl_cdx_merge.md16
-rw-r--r--notes/hbase_table_sizes.txt12
-rw-r--r--notes/old_extract_results.txt50
-rw-r--r--notes/url_pattern_heuristic_backfill.txt104
-rw-r--r--notes/url_pattern_heuristic_verification.txt52
6 files changed, 256 insertions, 0 deletions
diff --git a/notes/backfill_scalding_rewrite.txt b/notes/backfill_scalding_rewrite.txt
new file mode 100644
index 0000000..f5fb1d1
--- /dev/null
+++ b/notes/backfill_scalding_rewrite.txt
@@ -0,0 +1,22 @@
+
+Background context needed:
+- CDX text file format
+- rough arch outline (what runs where)
+- basic hadoop+hbase overview
+- hbase schema
+- quick look at hadoop and hbase web interfaces
+- maybe quick re-profile?
+
+Plan/Steps:
+x together: get *any* JVM map/reduce thing to build and run on cluster
+x together: get something to build that talks to hbase
+x basic JVM test infra; HBase mockup. "shopping"
+ => scalding and/or cascading
+x simple hbase scan report generation (counts/stats)
+x CDX parsing
+- complete backfill script
+
+Spec for CDX backfill script:
+- input is CDX, output to HBase table
+- filter input before anything ("defensive"; only PDF, HTTP 200, size limit)
+- reads HBase before insert; don't overwrite
diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md
new file mode 100644
index 0000000..a843a8d
--- /dev/null
+++ b/notes/crawl_cdx_merge.md
@@ -0,0 +1,16 @@
+
+## Old Way
+
+
+Use metamgr to export an items list.
+
+Get all the CDX files and merge/sort:
+
+ mkdir CRAWL-2000 && cd CRAWL-2000
+ cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
+ ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
+ sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
+ wc -l CRAWL-2000.cdx
+ rm CRAWL-2000.unsorted.cdx
+
+ # gzip and upload to petabox, or send to HDFS, or whatever
diff --git a/notes/hbase_table_sizes.txt b/notes/hbase_table_sizes.txt
new file mode 100644
index 0000000..97bbb16
--- /dev/null
+++ b/notes/hbase_table_sizes.txt
@@ -0,0 +1,12 @@
+
+As of 2018-05-29:
+- qa rows: 1,246,013
+- prod rows: 8,974,188
+
+As of 2018-06-16:
+- qa: 1,246,013
+- prod: 18,308,086
+
+As of 2018-08-01:
+- qa: 1,246,013
+- prod: 18,308,141
diff --git a/notes/old_extract_results.txt b/notes/old_extract_results.txt
new file mode 100644
index 0000000..0327b8b
--- /dev/null
+++ b/notes/old_extract_results.txt
@@ -0,0 +1,50 @@
+
+command:
+
+ ./extraction_cdx_grobid.py --hbase-table wbgrp-journal-extract-0-qa --hbase-host bnewbold-dev.us.archive.org --grobid-uri http://wbgrp-svc096.us.archive.org:8070 -r hadoop -c mrjob.conf --archive $VENVSHORT.tar.gz#venv hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx --jobconf mapred.line.input.format.linespermap=8000 --jobconf mapreduce.job.queuename=extraction
+
+Started: Wed Apr 11 05:54:54 UTC 2018
+Finished: Sun Apr 15 20:42:37 UTC 2018
+(late saturday night PST fixed grobid parallelism)
+
+Elapsed: 110hrs, 47mins, 42sec
+
+line counts:
+ error 3896
+ existing 311209
+ invalid 2311343
+ skip 195641
+ success 1143094
+ total 3,965,183
+
+## Against prod table
+
+Started: Sun Apr 15 21:38:24 UTC 2018
+Finished: Wed Apr 18 17:36:44 UTC 2018
+Elapsed: 67hrs, 58mins, 20sec
+
+lines
+ error 143
+ existing 213292
+ invalid 2311343
+ skip 195641
+ success 1,244,764
+ total 3,965,183
+
+## TARGETED
+
+Job job_1513499322977_358533 failed with state FAILED due to: Task failed task_1513499322977_358533_m_000323
+
+Started: Thu Apr 19 05:21:25 UTC 2018
+Finished: Sat Apr 21 11:01:58 UTC 2018
+Elapsed: 53hrs, 40mins, 33sec
+
+lines
+ error=4093
+ existing=55448
+ invalid=688873
+ skip=257533
+ success=1,282,053
+ total=2,288,000
+
+
diff --git a/notes/url_pattern_heuristic_backfill.txt b/notes/url_pattern_heuristic_backfill.txt
new file mode 100644
index 0000000..8e422f5
--- /dev/null
+++ b/notes/url_pattern_heuristic_backfill.txt
@@ -0,0 +1,104 @@
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter
+ 21,434,960
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-msag
+ 13,637,948
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-unpaywall-20180329
+ 3,393,658
+
+#########
+
+Goal: backfill a bunch of existing content into the HBase table. Bonus for
+being re-runable in the future.
+
+Source data:
+- GWB PDF CDX list
+- archive.org JSTOR files (?)
+- arxiv.org bulk files (?)
+- large URL lists (MSAG, etc)
+
+Methods:
+- pig filter GWB PDF CDX list based on regexes
+- pig join GWB PDF CDX list to known URL lists (then remove join)
+x iterate URL lists, hitting CDX API and saving response
+
+
+- (.edu, .ac.uk) domain with a tilde in the URL
+
+#http://www.stanford.edu:80/~johntayl/Papers/taylor2.pdf
+#http://met.nps.edu/~mtmontgo/papers/isabel_part2.pdf
+#http://www.pitt.edu:80/~druzdzel/psfiles/ecai06.pdf
+#http://www.comp.hkbu.edu.hk/~ymc/papers/conference/ijcnn03_710.pdf
+
+hk,edu,hkbu,comp)/~ymc/papers/conference/ijcnn03_710.pdf
+edu,stanford,www)/~johntayl/Papers/taylor2.pdf
+edu,nps,met)/~mtmontgo/papers/isabel_part2.pdf
+edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+jp,ak,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+co,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+
+NOT: com,corp,edu,,www)/~druzdzel/psfiles/ecai06.pdf
+
+- the words in URL: paper(s), pubs, research, publications, article, proceedings
+
+#http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf
+#http://files.eric.ed.gov/fulltext/EJ798626.pdf
+#http://www.hbs.edu/research/pdf/10-108.pdf
+#http://www.unifr.ch/biochem/assets/files/albrecht/publications/Abraham06.pdf
+#http://www.cnbc.cmu.edu/cns/papers/Kassetal2005.pdf
+#http://www.macrothink.org/journal/index.php/ijhrs/article/download/5765/4663
+#http://www.pims.math.ca:80/science/2004/fpsac/Papers/Liskovets.pdf
+#http://www.risc.uni-linz.ac.at/publications/download/risc_3287/synasc_revised.pdf
+#http://softsys.cs.uoi.gr/dbglobe/publications/wi04.pdf
+#http://lexikos.journals.ac.za/pub/article/download/1048/564
+#http://www.siam.org/proceedings/analco/2007/anl07_029ecesaratto.pdf
+#http://www.cs.bris.ac.uk/Publications/Papers/2000249.pdf
+
+uk,ac,surrey,ee,personal)/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf
+gov,ed,eric,files)/fulltext/EJ798626.pdf
+edu,hbs,www)/research/pdf/10-108.pdf
+ch,unifr,www)/biochem/assets/files/albrecht/publications/Abraham06.pdf
+edu,cmu,cnbc,www)/cns/papers/Kassetal2005.pdf
+org,macrothink,www)/journal/index.php/ijhrs/article/download/5765/4663
+ca,math,pims,www)/science/2004/fpsac/Papers/Liskovets.pdf
+at,ac,uni-linz,risc,www)/publications/download/risc_3287/synasc_revised.pdf
+gr,uoi,cs,softsys)/dbglobe/publications/wi04.pdf
+za,ac,journals,lexikos)/pub/article/download/1048/564
+org,siam,www)/proceedings/analco/2007/anl07_029ecesaratto.pdf
+uk,ac,bris,cs,www)/Publications/Papers/2000249.pdf
+
+
+- words in domains: hal., eprint, research., journal
+
+#http://research.fit.edu/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf
+#http://ijs.sgmjournals.org:80/cgi/reprint/54/6/2217.pdf
+#http://eprints.ecs.soton.ac.uk/12020/1/mind-the-semantic-gap.pdf
+#http://eprint.uq.edu.au/archive/00004120/01/R103_Forrester_pp.pdf
+
+edu,fit,research)/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf
+org,sgmjournals,ijs)//cgi/reprint/54/6/2217.pdf
+uk,ac,soton,ecs,eprints)/12020/1/mind-the-semantic-gap.pdf
+au,edu,uq,eprint)/archive/00004120/01/R103_Forrester_pp.pdf
+
+- doi-like pattern in URL
+#http://journals.ametsoc.org/doi/pdf/10.1175/2008BAMS2370.1
+#http://www.nejm.org:80/doi/pdf/10.1056/NEJMoa1013607
+
+org,ametsoc,journals)/doi/pdf/10.1175/2008BAMS2370.1
+org,nejm,www)/doi/pdf/10.1056/NEJMoa1013607
+
+- short list of hosts/domains?
+ *.core.ac.uk
+ *scielo*
+ *.redalyc.org
+
+#http://www.scielo.br:80/pdf/cagro/v33n1/v33n1a19.pdf
+#https://revistas.unal.edu.co/index.php/dyna/article/viewFile/51385/57892
+#http://rives.revues.org:80/pdf/449
+
+br,scielo,www)/pdf/cagro/v33n1/v33n1a19.pdf
+co,edu,unal,revistas)/index.php/dyna/article/viewFile/51385/57892
+org,revues,rives)/pdf/449
+
diff --git a/notes/url_pattern_heuristic_verification.txt b/notes/url_pattern_heuristic_verification.txt
new file mode 100644
index 0000000..7b35b88
--- /dev/null
+++ b/notes/url_pattern_heuristic_verification.txt
@@ -0,0 +1,52 @@
+
+## URL pattern regexing
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter/part*
+
+N https://nsarchive2.gwu.edu//rus/text_files/Volkogonov/1918.10.13%20Speech%20by%20BK,%20to%20Red%20Army%20Soldiers,%20R13977.pdf speech, russian
+
+edu tilde:
+ N http://www.d.umn.edu/~kgilbert/ened3342-1/Field%20Interp%202/snow/CloudIDKey.pdf homework?
+ N http://www.mech.utah.edu/~minor/BIOSKETCH-minor-october%202007.pdf CV
+ N http://web.archive.org/web/20030724175610/http://www.ssc.wisc.edu:80/~sseverin/lect12f01.pdf slides
+ N http://web.archive.org/web/20050117195001/http://www.csie.ntu.edu.tw:80/~b90013/DBhw7.pdf
+ Y http://web.archive.org/web/20040220222413/http://homepages.uc.edu:80/~lukovib/aiaa_02_0857.pdf
+ Y http://www.kki.yamanashi.ac.jp/~ohbuchi/online_pubs/IEEE_bigMM2015_Matsuda/BigMM_20150224b_web.pdf
+
+other words:
+ N https://files.eric.ed.gov/fulltext/ED069848.pdf tech report?
+ N http://istitutocomprensivopescara2.gov.it/attachments/article/164/griglia_osservativa_bes_terza_fascia.pdf table
+ M https://jfjustice.net/userfiles/file/Research/Report%20of%20the%20Outreach%20Forums%20on%20the%20PIL%20Cases%20on%20Sexual%20Gender%20Based%20Violence.pdf report
+ M http://www.iitk.ac.in/nicee/wcee/article/13_9035.pdf filler page? like a paper
+ Y http://www.dtic.mil/dtic/tr/fulltext/u2/314095.pdf
+ Y https://www.casact.org/pubs/proceed/proceed25/25400.pdf
+ Y http://circres.ahajournals.org/content/circresaha/111/8/1002.full.pdf
+ Y http://web.archive.org/web/20170313034332/http://thixomet.ru/UserFiles/File/Articles/1/2.CHM_2006_02-2.pdf
+ Y http://www.redalyc.org/pdf/873/87313713019.pdf
+ Y http://ukacc.group.shef.ac.uk/proceedings/control2004/Papers/213.pdf
+ Y http://periodicos.uem.br:80/ojs/index.php/RbhrAnpuh/article/download/23988/13095
+ Y http://w3.uqo.ca/photonique/papers/measurement.pdf
+ Y http://web.archive.org/web/20140312150030/http://afms.org.au/proceedings/9/Griffiths.pdf
+ Y http://www.hal.inserm.fr/file/index/docid/580194/filename/PROSTATE_SEGMENTATION_IN_HIFU_THERAPY.pdf
+ Y http://journal.ipb.ac.id/index.php/jmht/article/download/6003/4658
+
+publications:
+ N http://web.archive.org/web/20060527120026/http://www.merenkulkulaitos.fi:80/e/services/informationservices/publications/bulletin/avaa.php?id=336 treaty?
+ N http://orbit.dtu.dk/en/publications/status-for-skarven-i-danmark(8ffaf614-387e-429f-9fd4-4677ee5016ae).pdf?nofollow=true&rendering=standard related to a paper?
+ N http://community.trinity.nsw.edu.au/navbar/publications/docs/news/2_pn/2016/ps160103.pdf newsletter
+ N http://web.archive.org/web/20170216001602/https://www.nass.usda.gov/Statistics_by_State/New_Mexico/Publications/Annual_Statistical_Bulletin/2005/03_05.pdf report
+ N http://web.archive.org/web/20110109080048/http://www.ipria.org/publications/on-line-bulletins/austdev/AusDevsBulletin07.09.pdf
+ N http://web.archive.org/web/20060930192249/http://www.nmmfa.org/publications/CensusTracts/35031940200.pdf
+ N http://web.archive.org/web/20100621152841/http://psychologymatters.org/workforce/publications/01-doc-empl/table-11.pdf
+ N http://www.dtce.org.pk/DTCE/Publications/PN2 final report-dr8-F.pdf
+ Y https://www.frbatlanta.org/-/media/Documents/research/publications/wp/1995/wp9513.pdf
+ Y http://irrec.ifas.ufl.edu/IRSWS/publications/Lu_ESPR_2011.pdf
+
+doi:
+ M https://page-one.live.cf.public.springer.com/pdf/preview/10.1007/s11229-012-0117-8 paper, but only fragment (!?!?!)
+
+
+TODO:
+- drop "publications", "research", "pubs"
+- edu tilde is borderline... but keep it for now
+- black-list page-one.*