diff options
| -rw-r--r-- | notes/backfill_scalding_rewrite.txt | 22 | ||||
| -rw-r--r-- | notes/crawl_cdx_merge.md | 16 | ||||
| -rw-r--r-- | notes/hbase_table_sizes.txt | 12 | ||||
| -rw-r--r-- | notes/old_extract_results.txt | 50 | ||||
| -rw-r--r-- | notes/url_pattern_heuristic_backfill.txt | 104 | ||||
| -rw-r--r-- | notes/url_pattern_heuristic_verification.txt | 52 | 
6 files changed, 256 insertions, 0 deletions
diff --git a/notes/backfill_scalding_rewrite.txt b/notes/backfill_scalding_rewrite.txt new file mode 100644 index 0000000..f5fb1d1 --- /dev/null +++ b/notes/backfill_scalding_rewrite.txt @@ -0,0 +1,22 @@ + +Background context needed: +- CDX text file format +- rough arch outline (what runs where) +- basic hadoop+hbase overview +- hbase schema +- quick look at hadoop and hbase web interfaces +- maybe quick re-profile? + +Plan/Steps: +x together: get *any* JVM map/reduce thing to build and run on cluster +x together: get something to build that talks to hbase +x basic JVM test infra; HBase mockup. "shopping" +    => scalding and/or cascading +x simple hbase scan report generation (counts/stats) +x CDX parsing +- complete backfill script + +Spec for CDX backfill script: +- input is CDX, output to HBase table +- filter input before anything ("defensive"; only PDF, HTTP 200, size limit) +- reads HBase before insert; don't overwrite diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md new file mode 100644 index 0000000..a843a8d --- /dev/null +++ b/notes/crawl_cdx_merge.md @@ -0,0 +1,16 @@ + +## Old Way + + +Use metamgr to export an items list. + +Get all the CDX files and merge/sort: + +    mkdir CRAWL-2000 && cd CRAWL-2000 +    cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz +    ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx +    sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx +    wc -l CRAWL-2000.cdx +    rm CRAWL-2000.unsorted.cdx + +    # gzip and upload to petabox, or send to HDFS, or whatever diff --git a/notes/hbase_table_sizes.txt b/notes/hbase_table_sizes.txt new file mode 100644 index 0000000..97bbb16 --- /dev/null +++ b/notes/hbase_table_sizes.txt @@ -0,0 +1,12 @@ + +As of 2018-05-29: +- qa rows:   1,246,013 +- prod rows: 8,974,188 + +As of 2018-06-16: +- qa:    1,246,013 +- prod: 18,308,086 + +As of 2018-08-01: +- qa:    1,246,013 +- prod: 18,308,141 diff --git a/notes/old_extract_results.txt b/notes/old_extract_results.txt new file mode 100644 index 0000000..0327b8b --- /dev/null +++ b/notes/old_extract_results.txt @@ -0,0 +1,50 @@ + +command: + +    ./extraction_cdx_grobid.py         --hbase-table wbgrp-journal-extract-0-qa         --hbase-host bnewbold-dev.us.archive.org         --grobid-uri http://wbgrp-svc096.us.archive.org:8070 -r hadoop -c mrjob.conf --archive $VENVSHORT.tar.gz#venv hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx --jobconf mapred.line.input.format.linespermap=8000 --jobconf mapreduce.job.queuename=extraction + +Started:    Wed Apr 11 05:54:54 UTC 2018 +Finished:   Sun Apr 15 20:42:37 UTC 2018 +(late saturday night PST fixed grobid parallelism) + +Elapsed: 110hrs, 47mins, 42sec + +line counts: +    error	3896 +    existing	311209 +    invalid	2311343	 +    skip	195641 +    success	1143094 +    total	3,965,183 + +## Against prod table + +Started:    Sun Apr 15 21:38:24 UTC 2018 +Finished:   Wed Apr 18 17:36:44 UTC 2018 +Elapsed:    67hrs, 58mins, 20sec + +lines    +    error   143 +    existing    213292 +    invalid 2311343 +    skip    195641 +    success 1,244,764 +    total   3,965,183 + +## TARGETED + +Job job_1513499322977_358533 failed with state FAILED due to: Task failed task_1513499322977_358533_m_000323 + +Started:	Thu Apr 19 05:21:25 UTC 2018 +Finished:	Sat Apr 21 11:01:58 UTC 2018 +Elapsed:	53hrs, 40mins, 33sec + +lines    +        error=4093 +        existing=55448 +        invalid=688873 +        skip=257533 +        success=1,282,053 +        total=2,288,000 + + diff --git a/notes/url_pattern_heuristic_backfill.txt b/notes/url_pattern_heuristic_backfill.txt new file mode 100644 index 0000000..8e422f5 --- /dev/null +++ b/notes/url_pattern_heuristic_backfill.txt @@ -0,0 +1,104 @@ + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter +    21,434,960 + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-msag +    13,637,948 + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-unpaywall-20180329 +    3,393,658 + +######### + +Goal: backfill a bunch of existing content into the HBase table. Bonus for +being re-runable in the future. + +Source data: +- GWB PDF CDX list +- archive.org JSTOR files (?) +- arxiv.org bulk files (?) +- large URL lists (MSAG, etc) + +Methods: +- pig filter GWB PDF CDX list based on regexes +- pig join GWB PDF CDX list to known URL lists (then remove join) +x iterate URL lists, hitting CDX API and saving response + + +- (.edu, .ac.uk) domain with a tilde in the URL + +#http://www.stanford.edu:80/~johntayl/Papers/taylor2.pdf +#http://met.nps.edu/~mtmontgo/papers/isabel_part2.pdf +#http://www.pitt.edu:80/~druzdzel/psfiles/ecai06.pdf +#http://www.comp.hkbu.edu.hk/~ymc/papers/conference/ijcnn03_710.pdf + +hk,edu,hkbu,comp)/~ymc/papers/conference/ijcnn03_710.pdf +edu,stanford,www)/~johntayl/Papers/taylor2.pdf +edu,nps,met)/~mtmontgo/papers/isabel_part2.pdf +edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf +jp,ak,pitt,www)/~druzdzel/psfiles/ecai06.pdf +co,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf + +NOT: com,corp,edu,,www)/~druzdzel/psfiles/ecai06.pdf + +- the words in URL: paper(s), pubs, research, publications, article, proceedings + +#http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf +#http://files.eric.ed.gov/fulltext/EJ798626.pdf +#http://www.hbs.edu/research/pdf/10-108.pdf +#http://www.unifr.ch/biochem/assets/files/albrecht/publications/Abraham06.pdf +#http://www.cnbc.cmu.edu/cns/papers/Kassetal2005.pdf +#http://www.macrothink.org/journal/index.php/ijhrs/article/download/5765/4663 +#http://www.pims.math.ca:80/science/2004/fpsac/Papers/Liskovets.pdf +#http://www.risc.uni-linz.ac.at/publications/download/risc_3287/synasc_revised.pdf +#http://softsys.cs.uoi.gr/dbglobe/publications/wi04.pdf +#http://lexikos.journals.ac.za/pub/article/download/1048/564 +#http://www.siam.org/proceedings/analco/2007/anl07_029ecesaratto.pdf +#http://www.cs.bris.ac.uk/Publications/Papers/2000249.pdf + +uk,ac,surrey,ee,personal)/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf +gov,ed,eric,files)/fulltext/EJ798626.pdf +edu,hbs,www)/research/pdf/10-108.pdf +ch,unifr,www)/biochem/assets/files/albrecht/publications/Abraham06.pdf +edu,cmu,cnbc,www)/cns/papers/Kassetal2005.pdf +org,macrothink,www)/journal/index.php/ijhrs/article/download/5765/4663 +ca,math,pims,www)/science/2004/fpsac/Papers/Liskovets.pdf +at,ac,uni-linz,risc,www)/publications/download/risc_3287/synasc_revised.pdf +gr,uoi,cs,softsys)/dbglobe/publications/wi04.pdf +za,ac,journals,lexikos)/pub/article/download/1048/564 +org,siam,www)/proceedings/analco/2007/anl07_029ecesaratto.pdf +uk,ac,bris,cs,www)/Publications/Papers/2000249.pdf + + +- words in domains: hal., eprint, research., journal + +#http://research.fit.edu/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf +#http://ijs.sgmjournals.org:80/cgi/reprint/54/6/2217.pdf +#http://eprints.ecs.soton.ac.uk/12020/1/mind-the-semantic-gap.pdf +#http://eprint.uq.edu.au/archive/00004120/01/R103_Forrester_pp.pdf + +edu,fit,research)/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf +org,sgmjournals,ijs)//cgi/reprint/54/6/2217.pdf +uk,ac,soton,ecs,eprints)/12020/1/mind-the-semantic-gap.pdf +au,edu,uq,eprint)/archive/00004120/01/R103_Forrester_pp.pdf + +- doi-like pattern in URL +#http://journals.ametsoc.org/doi/pdf/10.1175/2008BAMS2370.1 +#http://www.nejm.org:80/doi/pdf/10.1056/NEJMoa1013607 + +org,ametsoc,journals)/doi/pdf/10.1175/2008BAMS2370.1 +org,nejm,www)/doi/pdf/10.1056/NEJMoa1013607 + +- short list of hosts/domains? +    *.core.ac.uk +    *scielo* +    *.redalyc.org + +#http://www.scielo.br:80/pdf/cagro/v33n1/v33n1a19.pdf +#https://revistas.unal.edu.co/index.php/dyna/article/viewFile/51385/57892 +#http://rives.revues.org:80/pdf/449 + +br,scielo,www)/pdf/cagro/v33n1/v33n1a19.pdf +co,edu,unal,revistas)/index.php/dyna/article/viewFile/51385/57892 +org,revues,rives)/pdf/449 + diff --git a/notes/url_pattern_heuristic_verification.txt b/notes/url_pattern_heuristic_verification.txt new file mode 100644 index 0000000..7b35b88 --- /dev/null +++ b/notes/url_pattern_heuristic_verification.txt @@ -0,0 +1,52 @@ + +## URL pattern regexing + +/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter/part* + +N  https://nsarchive2.gwu.edu//rus/text_files/Volkogonov/1918.10.13%20Speech%20by%20BK,%20to%20Red%20Army%20Soldiers,%20R13977.pdf  speech, russian + +edu tilde: +    N  http://www.d.umn.edu/~kgilbert/ened3342-1/Field%20Interp%202/snow/CloudIDKey.pdf homework? +    N  http://www.mech.utah.edu/~minor/BIOSKETCH-minor-october%202007.pdf CV +    N  http://web.archive.org/web/20030724175610/http://www.ssc.wisc.edu:80/~sseverin/lect12f01.pdf slides +    N  http://web.archive.org/web/20050117195001/http://www.csie.ntu.edu.tw:80/~b90013/DBhw7.pdf +    Y  http://web.archive.org/web/20040220222413/http://homepages.uc.edu:80/~lukovib/aiaa_02_0857.pdf +    Y  http://www.kki.yamanashi.ac.jp/~ohbuchi/online_pubs/IEEE_bigMM2015_Matsuda/BigMM_20150224b_web.pdf + +other words: +    N  https://files.eric.ed.gov/fulltext/ED069848.pdf tech report? +    N  http://istitutocomprensivopescara2.gov.it/attachments/article/164/griglia_osservativa_bes_terza_fascia.pdf table +    M  https://jfjustice.net/userfiles/file/Research/Report%20of%20the%20Outreach%20Forums%20on%20the%20PIL%20Cases%20on%20Sexual%20Gender%20Based%20Violence.pdf report +    M  http://www.iitk.ac.in/nicee/wcee/article/13_9035.pdf filler page? like a paper +    Y  http://www.dtic.mil/dtic/tr/fulltext/u2/314095.pdf +    Y  https://www.casact.org/pubs/proceed/proceed25/25400.pdf +    Y  http://circres.ahajournals.org/content/circresaha/111/8/1002.full.pdf +    Y  http://web.archive.org/web/20170313034332/http://thixomet.ru/UserFiles/File/Articles/1/2.CHM_2006_02-2.pdf +    Y  http://www.redalyc.org/pdf/873/87313713019.pdf +    Y  http://ukacc.group.shef.ac.uk/proceedings/control2004/Papers/213.pdf +    Y  http://periodicos.uem.br:80/ojs/index.php/RbhrAnpuh/article/download/23988/13095 +    Y  http://w3.uqo.ca/photonique/papers/measurement.pdf +    Y  http://web.archive.org/web/20140312150030/http://afms.org.au/proceedings/9/Griffiths.pdf +    Y  http://www.hal.inserm.fr/file/index/docid/580194/filename/PROSTATE_SEGMENTATION_IN_HIFU_THERAPY.pdf +    Y  http://journal.ipb.ac.id/index.php/jmht/article/download/6003/4658 + +publications: +    N  http://web.archive.org/web/20060527120026/http://www.merenkulkulaitos.fi:80/e/services/informationservices/publications/bulletin/avaa.php?id=336 treaty? +    N  http://orbit.dtu.dk/en/publications/status-for-skarven-i-danmark(8ffaf614-387e-429f-9fd4-4677ee5016ae).pdf?nofollow=true&rendering=standard related to a paper? +    N  http://community.trinity.nsw.edu.au/navbar/publications/docs/news/2_pn/2016/ps160103.pdf newsletter +    N  http://web.archive.org/web/20170216001602/https://www.nass.usda.gov/Statistics_by_State/New_Mexico/Publications/Annual_Statistical_Bulletin/2005/03_05.pdf report +    N  http://web.archive.org/web/20110109080048/http://www.ipria.org/publications/on-line-bulletins/austdev/AusDevsBulletin07.09.pdf +    N  http://web.archive.org/web/20060930192249/http://www.nmmfa.org/publications/CensusTracts/35031940200.pdf +    N  http://web.archive.org/web/20100621152841/http://psychologymatters.org/workforce/publications/01-doc-empl/table-11.pdf +    N  http://www.dtce.org.pk/DTCE/Publications/PN2 final report-dr8-F.pdf +    Y  https://www.frbatlanta.org/-/media/Documents/research/publications/wp/1995/wp9513.pdf +    Y  http://irrec.ifas.ufl.edu/IRSWS/publications/Lu_ESPR_2011.pdf + +doi: +    M  https://page-one.live.cf.public.springer.com/pdf/preview/10.1007/s11229-012-0117-8  paper, but only fragment (!?!?!) + + +TODO: +- drop "publications", "research", "pubs" +- edu tilde is borderline... but keep it for now +- black-list page-one.*  | 
