diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-05-07 22:10:51 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-05-07 22:11:18 -0700 |
commit | d1401444dbfb515e62094f873d520a23ccbc29d9 (patch) | |
tree | 418a21b93261230b006127107b124e5c12236ab7 /pig/tests/files/papers_url_doi.cdx | |
parent | 81d2f6290fff487f0f49b109227443c0f8a7aedb (diff) | |
download | sandcrawler-d1401444dbfb515e62094f873d520a23ccbc29d9.tar.gz sandcrawler-d1401444dbfb515e62094f873d520a23ccbc29d9.zip |
pig script to filter GWB CDX by SURT regexes
Diffstat (limited to 'pig/tests/files/papers_url_doi.cdx')
-rw-r--r-- | pig/tests/files/papers_url_doi.cdx | 7 |
1 files changed, 7 insertions, 0 deletions
diff --git a/pig/tests/files/papers_url_doi.cdx b/pig/tests/files/papers_url_doi.cdx new file mode 100644 index 0000000..1ad5792 --- /dev/null +++ b/pig/tests/files/papers_url_doi.cdx @@ -0,0 +1,7 @@ +#http://journals.ametsoc.org/doi/pdf/10.1175/2008BAMS2370.1 +#http://www.nejm.org:80/doi/pdf/10.1056/NEJMoa1013607 + +# should match 2: + +org,ametsoc,journals)/doi/pdf/10.1175/2008BAMS2370.1 20170706005950 http://mit.edu/file.pdf application/pdf 200 MQHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz +org,nejm,www)/doi/pdf/10.1056/NEJMoa1013607 20170706005950 http://mit.edu/file.pdf application/pdf 200 MQHD36X5MNZPWFNMD5LFOYZSFGCHUN3V - - 123 456 CRAWL/CRAWL.warc.gz |