From 1ae7fd2f0c5661560b15be86614c2c4d41b21205 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 24 Aug 2018 13:39:02 -0700 Subject: commit notes from my laptop --- notes/backfill_scalding_rewrite.txt | 22 ++++++ notes/crawl_cdx_merge.md | 16 +++++ notes/hbase_table_sizes.txt | 12 ++++ notes/old_extract_results.txt | 50 +++++++++++++ notes/url_pattern_heuristic_backfill.txt | 104 +++++++++++++++++++++++++++ notes/url_pattern_heuristic_verification.txt | 52 ++++++++++++++ 6 files changed, 256 insertions(+) create mode 100644 notes/backfill_scalding_rewrite.txt create mode 100644 notes/crawl_cdx_merge.md create mode 100644 notes/hbase_table_sizes.txt create mode 100644 notes/old_extract_results.txt create mode 100644 notes/url_pattern_heuristic_backfill.txt create mode 100644 notes/url_pattern_heuristic_verification.txt diff --git a/notes/backfill_scalding_rewrite.txt b/notes/backfill_scalding_rewrite.txt new file mode 100644 index 0000000..f5fb1d1 --- /dev/null +++ b/notes/backfill_scalding_rewrite.txt @@ -0,0 +1,22 @@ + +Background context needed: +- CDX text file format +- rough arch outline (what runs where) +- basic hadoop+hbase overview +- hbase schema +- quick look at hadoop and hbase web interfaces +- maybe quick re-profile? + +Plan/Steps: +x together: get *any* JVM map/reduce thing to build and run on cluster +x together: get something to build that talks to hbase +x basic JVM test infra; HBase mockup. "shopping" + => scalding and/or cascading +x simple hbase scan report generation (counts/stats) +x CDX parsing +- complete backfill script + +Spec for CDX backfill script: +- input is CDX, output to HBase table +- filter input before anything ("defensive"; only PDF, HTTP 200, size limit) +- reads HBase before insert; don't overwrite diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md new file mode 100644 index 0000000..a843a8d --- /dev/null +++ b/notes/crawl_cdx_merge.md @@ -0,0 +1,16 @@ + +## Old Way + + +Use metamgr to export an items list. + +Get all the CDX files and merge/sort: + + mkdir CRAWL-2000 && cd CRAWL-2000 + cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz + ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx + sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx + wc -l CRAWL-2000.cdx + rm CRAWL-2000.unsorted.cdx + + # gzip and upload to petabox, or send to HDFS, or whatever diff --git a/notes/hbase_table_sizes.txt b/notes/hbase_table_sizes.txt new file mode 100644 index 0000000..97bbb16 --- /dev/null +++ b/notes/hbase_table_sizes.txt @@ -0,0 +1,12 @@ + +As of 2018-05-29: +- qa rows: 1,246,013 +- prod rows: 8,974,188 + +As of 2018-06-16: +- qa: 1,246,013 +- prod: 18,308,086 + +As of 2018-08-01: +- qa: 1,246,013 +- prod: 18,308,141 diff --git a/notes/old_extract_results.txt b/notes/old_extract_results.txt new file mode 100644 index 0000000..0327b8b --- /dev/null +++ b/notes/old_extract_results.txt @@ -0,0 +1,50 @@ + +command: + + ./extraction_cdx_grobid.py --hbase-table wbgrp-journal-extract-0-qa --hbase-host bnewbold-dev.us.archive.org --grobid-uri http://wbgrp-svc096.us.archive.org:8070 -r hadoop -c mrjob.conf --archive $VENVSHORT.tar.gz#venv hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx --jobconf mapred.line.input.format.linespermap=8000 --jobconf mapreduce.job.queuename=extraction + +Started: Wed Apr 11 05:54:54 UTC 2018 +Finished: Sun Apr 15 20:42:37 UTC 2018 +(late saturday night PST fixed grobid parallelism) + +Elapsed: 110hrs, 47mins, 42sec + +line counts: + error 3896 + existing 311209 + invalid 2311343 + skip 195641 + success 1143094 + total 3,965,183 + +## Against prod table + +Started: Sun Apr 15 21:38:24 UTC 2018 +Finished: Wed Apr 18 17:36:44 UTC 2018 +Elapsed: 67hrs, 58mins, 20sec + +lines + error 143 + existing 213292 + invalid 2311343 + skip 195641 + success 1,244,764 + total 3,965,183 + +## TARGETED + +Job job_1513499322977_358533 failed with state FAILED due to: Task failed task_1513499322977_358533_m_000323 + +Started: Thu Apr 19 05:21:25 UTC 2018 +Finished: Sat Apr 21 11:01:58 UTC 2018 +Elapsed: 53hrs, 40mins, 33sec + +lines + error=4093 + existing=55448 + invalid=688873 + skip=257533 + success=1,282,053 + total=2,288,000 + + diff --git a/notes/url_pattern_heuristic_backfill.txt b/notes/url_pattern_heuristic_backfill.txt new file mode 100644 index 0000000..8e422f5 --- /dev/null +++ b/notes/url_pattern_heuristic_backfill.txt @@ -0,0 +1,104 @@ + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter + 21,434,960 + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-msag + 13,637,948 + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-unpaywall-20180329 + 3,393,658 + +######### + +Goal: backfill a bunch of existing content into the HBase table. Bonus for +being re-runable in the future. + +Source data: +- GWB PDF CDX list +- archive.org JSTOR files (?) +- arxiv.org bulk files (?) +- large URL lists (MSAG, etc) + +Methods: +- pig filter GWB PDF CDX list based on regexes +- pig join GWB PDF CDX list to known URL lists (then remove join) +x iterate URL lists, hitting CDX API and saving response + + +- (.edu, .ac.uk) domain with a tilde in the URL + +#http://www.stanford.edu:80/~johntayl/Papers/taylor2.pdf +#http://met.nps.edu/~mtmontgo/papers/isabel_part2.pdf +#http://www.pitt.edu:80/~druzdzel/psfiles/ecai06.pdf +#http://www.comp.hkbu.edu.hk/~ymc/papers/conference/ijcnn03_710.pdf + +hk,edu,hkbu,comp)/~ymc/papers/conference/ijcnn03_710.pdf +edu,stanford,www)/~johntayl/Papers/taylor2.pdf +edu,nps,met)/~mtmontgo/papers/isabel_part2.pdf +edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf +jp,ak,pitt,www)/~druzdzel/psfiles/ecai06.pdf +co,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf + +NOT: com,corp,edu,,www)/~druzdzel/psfiles/ecai06.pdf + +- the words in URL: paper(s), pubs, research, publications, article, proceedings + +#http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf +#http://files.eric.ed.gov/fulltext/EJ798626.pdf +#http://www.hbs.edu/research/pdf/10-108.pdf +#http://www.unifr.ch/biochem/assets/files/albrecht/publications/Abraham06.pdf +#http://www.cnbc.cmu.edu/cns/papers/Kassetal2005.pdf +#http://www.macrothink.org/journal/index.php/ijhrs/article/download/5765/4663 +#http://www.pims.math.ca:80/science/2004/fpsac/Papers/Liskovets.pdf +#http://www.risc.uni-linz.ac.at/publications/download/risc_3287/synasc_revised.pdf +#http://softsys.cs.uoi.gr/dbglobe/publications/wi04.pdf +#http://lexikos.journals.ac.za/pub/article/download/1048/564 +#http://www.siam.org/proceedings/analco/2007/anl07_029ecesaratto.pdf +#http://www.cs.bris.ac.uk/Publications/Papers/2000249.pdf + +uk,ac,surrey,ee,personal)/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf +gov,ed,eric,files)/fulltext/EJ798626.pdf +edu,hbs,www)/research/pdf/10-108.pdf +ch,unifr,www)/biochem/assets/files/albrecht/publications/Abraham06.pdf +edu,cmu,cnbc,www)/cns/papers/Kassetal2005.pdf +org,macrothink,www)/journal/index.php/ijhrs/article/download/5765/4663 +ca,math,pims,www)/science/2004/fpsac/Papers/Liskovets.pdf +at,ac,uni-linz,risc,www)/publications/download/risc_3287/synasc_revised.pdf +gr,uoi,cs,softsys)/dbglobe/publications/wi04.pdf +za,ac,journals,lexikos)/pub/article/download/1048/564 +org,siam,www)/proceedings/analco/2007/anl07_029ecesaratto.pdf +uk,ac,bris,cs,www)/Publications/Papers/2000249.pdf + + +- words in domains: hal., eprint, research., journal + +#http://research.fit.edu/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf +#http://ijs.sgmjournals.org:80/cgi/reprint/54/6/2217.pdf +#http://eprints.ecs.soton.ac.uk/12020/1/mind-the-semantic-gap.pdf +#http://eprint.uq.edu.au/archive/00004120/01/R103_Forrester_pp.pdf + +edu,fit,research)/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf +org,sgmjournals,ijs)//cgi/reprint/54/6/2217.pdf +uk,ac,soton,ecs,eprints)/12020/1/mind-the-semantic-gap.pdf +au,edu,uq,eprint)/archive/00004120/01/R103_Forrester_pp.pdf + +- doi-like pattern in URL +#http://journals.ametsoc.org/doi/pdf/10.1175/2008BAMS2370.1 +#http://www.nejm.org:80/doi/pdf/10.1056/NEJMoa1013607 + +org,ametsoc,journals)/doi/pdf/10.1175/2008BAMS2370.1 +org,nejm,www)/doi/pdf/10.1056/NEJMoa1013607 + +- short list of hosts/domains? + *.core.ac.uk + *scielo* + *.redalyc.org + +#http://www.scielo.br:80/pdf/cagro/v33n1/v33n1a19.pdf +#https://revistas.unal.edu.co/index.php/dyna/article/viewFile/51385/57892 +#http://rives.revues.org:80/pdf/449 + +br,scielo,www)/pdf/cagro/v33n1/v33n1a19.pdf +co,edu,unal,revistas)/index.php/dyna/article/viewFile/51385/57892 +org,revues,rives)/pdf/449 + diff --git a/notes/url_pattern_heuristic_verification.txt b/notes/url_pattern_heuristic_verification.txt new file mode 100644 index 0000000..7b35b88 --- /dev/null +++ b/notes/url_pattern_heuristic_verification.txt @@ -0,0 +1,52 @@ + +## URL pattern regexing + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter/part* + +N https://nsarchive2.gwu.edu//rus/text_files/Volkogonov/1918.10.13%20Speech%20by%20BK,%20to%20Red%20Army%20Soldiers,%20R13977.pdf speech, russian + +edu tilde: + N http://www.d.umn.edu/~kgilbert/ened3342-1/Field%20Interp%202/snow/CloudIDKey.pdf homework? + N http://www.mech.utah.edu/~minor/BIOSKETCH-minor-october%202007.pdf CV + N http://web.archive.org/web/20030724175610/http://www.ssc.wisc.edu:80/~sseverin/lect12f01.pdf slides + N http://web.archive.org/web/20050117195001/http://www.csie.ntu.edu.tw:80/~b90013/DBhw7.pdf + Y http://web.archive.org/web/20040220222413/http://homepages.uc.edu:80/~lukovib/aiaa_02_0857.pdf + Y http://www.kki.yamanashi.ac.jp/~ohbuchi/online_pubs/IEEE_bigMM2015_Matsuda/BigMM_20150224b_web.pdf + +other words: + N https://files.eric.ed.gov/fulltext/ED069848.pdf tech report? + N http://istitutocomprensivopescara2.gov.it/attachments/article/164/griglia_osservativa_bes_terza_fascia.pdf table + M https://jfjustice.net/userfiles/file/Research/Report%20of%20the%20Outreach%20Forums%20on%20the%20PIL%20Cases%20on%20Sexual%20Gender%20Based%20Violence.pdf report + M http://www.iitk.ac.in/nicee/wcee/article/13_9035.pdf filler page? like a paper + Y http://www.dtic.mil/dtic/tr/fulltext/u2/314095.pdf + Y https://www.casact.org/pubs/proceed/proceed25/25400.pdf + Y http://circres.ahajournals.org/content/circresaha/111/8/1002.full.pdf + Y http://web.archive.org/web/20170313034332/http://thixomet.ru/UserFiles/File/Articles/1/2.CHM_2006_02-2.pdf + Y http://www.redalyc.org/pdf/873/87313713019.pdf + Y http://ukacc.group.shef.ac.uk/proceedings/control2004/Papers/213.pdf + Y http://periodicos.uem.br:80/ojs/index.php/RbhrAnpuh/article/download/23988/13095 + Y http://w3.uqo.ca/photonique/papers/measurement.pdf + Y http://web.archive.org/web/20140312150030/http://afms.org.au/proceedings/9/Griffiths.pdf + Y http://www.hal.inserm.fr/file/index/docid/580194/filename/PROSTATE_SEGMENTATION_IN_HIFU_THERAPY.pdf + Y http://journal.ipb.ac.id/index.php/jmht/article/download/6003/4658 + +publications: + N http://web.archive.org/web/20060527120026/http://www.merenkulkulaitos.fi:80/e/services/informationservices/publications/bulletin/avaa.php?id=336 treaty? + N http://orbit.dtu.dk/en/publications/status-for-skarven-i-danmark(8ffaf614-387e-429f-9fd4-4677ee5016ae).pdf?nofollow=true&rendering=standard related to a paper? + N http://community.trinity.nsw.edu.au/navbar/publications/docs/news/2_pn/2016/ps160103.pdf newsletter + N http://web.archive.org/web/20170216001602/https://www.nass.usda.gov/Statistics_by_State/New_Mexico/Publications/Annual_Statistical_Bulletin/2005/03_05.pdf report + N http://web.archive.org/web/20110109080048/http://www.ipria.org/publications/on-line-bulletins/austdev/AusDevsBulletin07.09.pdf + N http://web.archive.org/web/20060930192249/http://www.nmmfa.org/publications/CensusTracts/35031940200.pdf + N http://web.archive.org/web/20100621152841/http://psychologymatters.org/workforce/publications/01-doc-empl/table-11.pdf + N http://www.dtce.org.pk/DTCE/Publications/PN2 final report-dr8-F.pdf + Y https://www.frbatlanta.org/-/media/Documents/research/publications/wp/1995/wp9513.pdf + Y http://irrec.ifas.ufl.edu/IRSWS/publications/Lu_ESPR_2011.pdf + +doi: + M https://page-one.live.cf.public.springer.com/pdf/preview/10.1007/s11229-012-0117-8 paper, but only fragment (!?!?!) + + +TODO: +- drop "publications", "research", "pubs" +- edu tilde is borderline... but keep it for now +- black-list page-one.* -- cgit v1.2.3