aboutsummaryrefslogtreecommitdiffstats
path: root/notes/url_pattern_heuristic_backfill.txt
diff options
context:
space:
mode:
Diffstat (limited to 'notes/url_pattern_heuristic_backfill.txt')
-rw-r--r--notes/url_pattern_heuristic_backfill.txt104
1 files changed, 104 insertions, 0 deletions
diff --git a/notes/url_pattern_heuristic_backfill.txt b/notes/url_pattern_heuristic_backfill.txt
new file mode 100644
index 0000000..8e422f5
--- /dev/null
+++ b/notes/url_pattern_heuristic_backfill.txt
@@ -0,0 +1,104 @@
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter
+ 21,434,960
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-msag
+ 13,637,948
+
+/user/bnewbold/pdfs/gwb-pdf-20171227034923-join-unpaywall-20180329
+ 3,393,658
+
+#########
+
+Goal: backfill a bunch of existing content into the HBase table. Bonus for
+being re-runable in the future.
+
+Source data:
+- GWB PDF CDX list
+- archive.org JSTOR files (?)
+- arxiv.org bulk files (?)
+- large URL lists (MSAG, etc)
+
+Methods:
+- pig filter GWB PDF CDX list based on regexes
+- pig join GWB PDF CDX list to known URL lists (then remove join)
+x iterate URL lists, hitting CDX API and saving response
+
+
+- (.edu, .ac.uk) domain with a tilde in the URL
+
+#http://www.stanford.edu:80/~johntayl/Papers/taylor2.pdf
+#http://met.nps.edu/~mtmontgo/papers/isabel_part2.pdf
+#http://www.pitt.edu:80/~druzdzel/psfiles/ecai06.pdf
+#http://www.comp.hkbu.edu.hk/~ymc/papers/conference/ijcnn03_710.pdf
+
+hk,edu,hkbu,comp)/~ymc/papers/conference/ijcnn03_710.pdf
+edu,stanford,www)/~johntayl/Papers/taylor2.pdf
+edu,nps,met)/~mtmontgo/papers/isabel_part2.pdf
+edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+jp,ak,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+co,edu,pitt,www)/~druzdzel/psfiles/ecai06.pdf
+
+NOT: com,corp,edu,,www)/~druzdzel/psfiles/ecai06.pdf
+
+- the words in URL: paper(s), pubs, research, publications, article, proceedings
+
+#http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf
+#http://files.eric.ed.gov/fulltext/EJ798626.pdf
+#http://www.hbs.edu/research/pdf/10-108.pdf
+#http://www.unifr.ch/biochem/assets/files/albrecht/publications/Abraham06.pdf
+#http://www.cnbc.cmu.edu/cns/papers/Kassetal2005.pdf
+#http://www.macrothink.org/journal/index.php/ijhrs/article/download/5765/4663
+#http://www.pims.math.ca:80/science/2004/fpsac/Papers/Liskovets.pdf
+#http://www.risc.uni-linz.ac.at/publications/download/risc_3287/synasc_revised.pdf
+#http://softsys.cs.uoi.gr/dbglobe/publications/wi04.pdf
+#http://lexikos.journals.ac.za/pub/article/download/1048/564
+#http://www.siam.org/proceedings/analco/2007/anl07_029ecesaratto.pdf
+#http://www.cs.bris.ac.uk/Publications/Papers/2000249.pdf
+
+uk,ac,surrey,ee,personal)/Personal/R.Bowden/publications/2012/Gilbert_ACCV_2012pp.pdf
+gov,ed,eric,files)/fulltext/EJ798626.pdf
+edu,hbs,www)/research/pdf/10-108.pdf
+ch,unifr,www)/biochem/assets/files/albrecht/publications/Abraham06.pdf
+edu,cmu,cnbc,www)/cns/papers/Kassetal2005.pdf
+org,macrothink,www)/journal/index.php/ijhrs/article/download/5765/4663
+ca,math,pims,www)/science/2004/fpsac/Papers/Liskovets.pdf
+at,ac,uni-linz,risc,www)/publications/download/risc_3287/synasc_revised.pdf
+gr,uoi,cs,softsys)/dbglobe/publications/wi04.pdf
+za,ac,journals,lexikos)/pub/article/download/1048/564
+org,siam,www)/proceedings/analco/2007/anl07_029ecesaratto.pdf
+uk,ac,bris,cs,www)/Publications/Papers/2000249.pdf
+
+
+- words in domains: hal., eprint, research., journal
+
+#http://research.fit.edu/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf
+#http://ijs.sgmjournals.org:80/cgi/reprint/54/6/2217.pdf
+#http://eprints.ecs.soton.ac.uk/12020/1/mind-the-semantic-gap.pdf
+#http://eprint.uq.edu.au/archive/00004120/01/R103_Forrester_pp.pdf
+
+edu,fit,research)/sealevelriselibrary/documents/doc_mgr/448/Florida_Keys_Low_Island_Biodiversity_&_SLR_-_Ross_et_al_2009.pdf
+org,sgmjournals,ijs)//cgi/reprint/54/6/2217.pdf
+uk,ac,soton,ecs,eprints)/12020/1/mind-the-semantic-gap.pdf
+au,edu,uq,eprint)/archive/00004120/01/R103_Forrester_pp.pdf
+
+- doi-like pattern in URL
+#http://journals.ametsoc.org/doi/pdf/10.1175/2008BAMS2370.1
+#http://www.nejm.org:80/doi/pdf/10.1056/NEJMoa1013607
+
+org,ametsoc,journals)/doi/pdf/10.1175/2008BAMS2370.1
+org,nejm,www)/doi/pdf/10.1056/NEJMoa1013607
+
+- short list of hosts/domains?
+ *.core.ac.uk
+ *scielo*
+ *.redalyc.org
+
+#http://www.scielo.br:80/pdf/cagro/v33n1/v33n1a19.pdf
+#https://revistas.unal.edu.co/index.php/dyna/article/viewFile/51385/57892
+#http://rives.revues.org:80/pdf/449
+
+br,scielo,www)/pdf/cagro/v33n1/v33n1a19.pdf
+co,edu,unal,revistas)/index.php/dyna/article/viewFile/51385/57892
+org,revues,rives)/pdf/449
+