diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-01-29 12:00:20 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-01-29 12:00:20 -0800 |
commit | 41d25749b46d9ed0bfcb57de8c6cd5399ea54de7 (patch) | |
tree | 0448e160b24f941e7e9434a211da6ee10d79c72d | |
parent | 052080ae28aeb2635f4b8917d4772e745ddbbf7b (diff) | |
download | sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.tar.gz sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.zip |
2020q1 fulltext ingest plans
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 272 |
1 files changed, 272 insertions, 0 deletions
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md new file mode 100644 index 0000000..9469217 --- /dev/null +++ b/proposals/20200129_pdf_ingest.md @@ -0,0 +1,272 @@ + +status: planned + +2020q1 Fulltext PDF Ingest Plan +=================================== + +This document lays out a plan and tasks for a push on crawling and ingesting +more fulltext PDF content in early 2020. + +The goal is to get the current generation of pipelines and matching tools +running smoothly by the end of March, when the Mellon phase 1 grant ends. As a +"soft" goal, would love to see over 25 million papers (works) with fulltext in +fatcat by that deadline as well. + +This document is organized by conceptual approach, then by jobs to run and +coding tasks needing work. + +There is a lot of work here! + + +## Broad OA Ingest By External Identifier + +There are a few million papers in fatacat which: + +1. have a DOI, arxiv id, or pubmed central id, which can be followed to a + landing page or directly to a PDF +2. are known OA, usually because publication is Gold OA +3. don't have any fulltext PDF in fatcat + +As a detail, some of these "known OA" journals actually have embargos (aka, +they aren't true Gold OA). In particular, those marked via EZB OA "color", and +recent pubmed central ids. + +Of these, I think there are broadly two categories. The first is just papers we +haven't tried directly crawling or ingesting yet at all; these should be easy +to crawl and ingest. The second category is papers from large publishers with +difficult to crawl landing pages (for example, Elsevier, IEEE, Wiley, ACM). The +later category will probably not crawl with heritrix, and we are likely to be +rate-limited or resource constrained when using brozzler or + +Coding Tasks: + +- improve `fatcat_ingest.py` script to allow more granular slicing and limiting + the number of requests enqueued per batch (eg, to allow daily partial + big-publisher ingests in random order). Allow dumping arxiv+pmcid ingest + requests. + +Actions: + +- run broad Datacite DOI landing crawl with heritrix ("pre-ingest") +- after Datacite crawl completes, run arabesque and ingest any PDF hits +- run broad non-Datacite DOI landing crawl with heritrix. Use ingest tool to + generate (or filter a dump), removing Datacite DOIs and large publishers +- after non-Datacite crawl completes, run entire ingest request set through in + bulk mode +- start enqueing large-publisher (hard to crawl) OA DOIs to ingest queue + for SPNv2 crawling (blocking ingest tool improvement, and also SPNv2 health) +- start new PUBMEDCENTRAL and ARXIV slow-burn pubmed crawls (heritrix). Use + updated ingest tool to generate requests. + + +## Large Seedlist Crawl Iterations + +We have a bunch of large, high quality seedlists, most of which haven't been +updated or crawled in a year or two. Some use DOIs as identifiers, some use an +internal identifier. As a quick summary: + +- unpaywall: currently 25 million DOIs (Crossref only?) with fulltext. URLs may + be doi.org, publisher landing page, or direct PDF; may be published version, + pre-print, or manuscript (indicated with a flag). Only crawled with heritrix; + last crawl was Spring 2019. There is a new dump from late 2019 with a couple + million new papers/URLs. +- microsoft academic (MAG): tens of millions of papers, hundreds of millions of + URLs. Last crawled 2018 (?) from a 2016 dump. Getting a new full dump via + Azure; new dump includes type info for each URL ("pdf", "landing page", etc). + Uses MAG id for each URL, not DOI; hoping new dump has better MAG/DOI + mappings. Expect a very large crawl (tens of millions of new URLs). +- CORE: can do direct crawling of PDFs from their site, as well as external + URLs. They largely have pre-prints and IR content. Have not released a dump + in a long time. Would expect a couple million new direct (core.ac.uk) URLs + and fewer new web URLs (often overlap with other lists, like MAG) +- semantic scholar: they do regular dumps. Use SHA1 hash of PDF as identifier; + it's the "best PDF of a group", so not always the PDF you crawl. Host many OA + PDFs on their domain, very fast to crawl, as well as wide-web URLs. Their + scope has increased dramatically in recent years due to MAG import; expect a + lot of overlap there. + +It is increasingly important to not + +Coding Tasks: +- transform scripts for all these seedlist sources to create ingest request + lists +- sandcrawler ingest request persist script, which supports setting datetime +- fix HBase thrift gateway so url agnostic de-dupe can be updated +- finish ingest worker "skip existing" code path, which looks in sandcrawler-db + to see if URL has already been processed (for efficiency) + +Actions: +- transform and persist all these old seedlists, with the URL datetime set to + roughly when the URL was added to the upstream corpus +- transform arabesque output for all old crawls into ingest requests and run + through the bulk ingest queue. expect GROBID to be skipped for all these, and + for the *requests* not to be updated (SQL ON CONFLICT DO NOTHING). Will + update ingest result table with status. +- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist + into ingest request table. use SQL to dump only the *new* URLs (not seen in + previous dumps) using the created timestamp, outputing new bulk ingest + request lists. if possible, de-dupe between these two. then start bulk + heritrix crawls over these two long lists. Probably sharded over several + machines. Could also run serially (first one, then the other, with + ingest/de-dupe in between). Filter out usual large sites (core, s2, arxiv, + pubmed, etc) +- CORE and Semantic Scholar direct crawls, of only new URLs on their domain + (should not significantly conflict/dupe with other bulk crawls) + +After this round of big crawls completes we could do iterated crawling of +smaller seedlists, re-visit URLs that failed to ingest with updated heritrix +configs or the SPNv2 ingest tool, etc. + + +## GROBID/glutton Matching of Known PDFs + +Of the many PDFs in the sandcrawler CDX "working set", many were broadly +crawled or added via CDX heuristic. In other words, we don't have an identifier +from a seedlist. We previously run a matching script in Hadoop that attempted +to link these to Crossref DOIs based on GROBID extracted metadata. We haven't +done this in a long time; in the meanwhile we have added many more such PDFs, +added lots of metadata to our matching set (eg, pubmed and arxiv in addition to +crossref), and have the new biblio-glutton tool for matching, which may work +better than our old conservative tool. + +We have run GROBID+glutton over basically all of these PDFs. We should be able +to do a SQL query to select PDFs that: + +- have at least one known CDX row +- GROBID processed successfuly and glutton matched to a fatcat release +- do not have an existing fatcat file (based on sha1hex) +- output GROBID metadata, `file_meta`, and one or more CDX rows + +An update match importer can take this output and create new file entities. +Then lookup the release and confirm the match to the GROBID metadata, as well +as any other quality checks, then import into fatcat. We have some existing +filter code we could use. The verification code should be refactored into a +reusable method. + +It isn't clear to me how many new files/matches we would get from this, but +could do some test SQL queries to check. At least a million? + +A related task is to update the glutton lookup table (elasticsearch index and +on-disk lookup tables) after more recent metadata imports (Datacite, etc). +Unsure if we should filter out records or improve matching so that we don't +match "header" (paper) metadata to non-paper records (like datasets), but still +allow *reference* matching (citations to datasets). + +Coding Tasks: +- write SQL select function. Optionally, come up with a way to get multiple CDX + rows in the output (sub-query?) +- biblio metadata verify match function (between GROBID metadata and existing + fatcat release entity) +- updated match fatcat importer + +Actions: +- update `fatcat_file` sandcrawler table +- check how many PDFs this might ammount to. both by uniq SHA1 and uniq + `fatcat_release` matches +- do some manual random QA verification to check that this method results in + quality content in fatcat +- run full updated import + + +## No-Identifier PDF New Release Import Pipeline + +Previously, as part of longtail OA crawling work, I took a set of PDFs crawled +from OA journal homepages (where the publisher does not register DOIs), took +successful GROBID metadata, filtered for metadata quality, and imported about +1.5 million new release entities into fatcat. + +There were a number of metadata issues with this import that we are still +cleaning up, eg: + +- paper actually did have a DOI and should have been associated with existing + fatcat release entity; these PDFs mostly came from repository sites which + aggregated many PDFs, or due to unintentional outlink crawl configs +- no container linkage for any of these releases, making coverage tracking or + reporting difficult +- many duplicates in same import set, due to near-identical PDFs (different by + SHA-1, but same content and metadata), not merged or grouped in any way + +The cleanup process is out of scope for this document, but we want to do +another round of similar imports, while avoiding these problems. + +As a rouch sketch of what this would look like (may need to iterate): + +- filter to PDFs from longtail OA crawls (eg, based on WARC prefix, or URL domain) +- filter to PDFs not in fatcat already (in sandcrawler, then verify with lookup) +- filter to PDFs with successful GROBID extraction and *no* glutton match +- filter/clean GROBID extracted metadata (in python, not SQL), removing stubs + or poor/partial extracts +- run a fuzzy biblio metadata match against fatcat elasticsearch; use match + verification routine to check results +- if fuzzy match was a hit, consider importing directly as a matched file + (especially if there are no existing files for the release) +- identify container for PDF from any of: domain pattern/domain; GROBID + extracted ISSN or journal name; any other heuristic +- if all these filters pass and there was no fuzzy release match, and there was + a container match, import a new release (and the file) into fatcat + +Not entirely clear how to solve the near-duplicate issue. Randomize import +order (eg, sort by file sha1), import slowly with a single thread, and ensure +elasticsearch re-indexing pipeline is running smoothly so the fuzzy match will +find recently-imported hits? + +In theory we could use biblio-glutton API to do the matching lookups, but I +think it will be almost as fast to hit our own elasticsearch index. Also the +glutton backing store is always likely to be out of date. In the future we may +even write something glutton-compatible that hits our index. Note that this is +also very similar to how citation matching could work, though it might be +derailing or over-engineering to come up with a single solution for both +applications at this time. + +A potential issue here is that many of these papers are probably already in +another large but non-authoritative metadata corpus, like MAG, CORE, SHARE, or +BASE. Importing from those corpuses would want to go through the same fuzzy +matching to ensure we aren't creating duplicate releases, but further it would +be nice to be matching those external identifiers for any newly created +releases. One approach would be to bulk-import metadata from those sources +first. There are huge numbers of records in those corpuses, so we would need to +filter down by journal/container or OA flag first. Another would be to do fuzzy +matching when we *do* end up importing those corpuses, and update these records +with the external identifiers. This issue really gets at the crux of a bunch of +design issues and scaling problems with fatcat! But I think we should or need +to make progress on these longtail OA imports without perfectly solving these +larger issues. + +Details/Questions: +- what about non-DOI metadata sources like MAG, CORE, SHARE, BASE? Should we + import those first, or do fuzzy matching against those? +- use GROBID language detection and copy results to newly created releases +- in single-threaded, could cache "recently matched/imported releases" locally + to prevent double-importing +- cache container matching locally + +Coding Tasks: +- write SQL select statement +- iterate on GROBID metadata cleaning/transform/filter (have existing code for + this somewhere) +- implement a "fuzzy match" routine that takes biblio metadata (eg, GROBID + extracted), looks in fatcat elasticsearch for a match +- implement "fuzzy container match" routine, using as much available info as + possible. Could use chocula sqlite locally, or hit elasticsearch container + endpoint +- update GROBID importer to use fuzzy match and other checks + +Actions: +- run SQL select and estimate bounds on number of new releases created +- do some manual randomized QA runs to ensure this pipeline is importing + quality content in fatcat +- run a full batch import + + +## Non-authoritative Metadata and Fulltext from Aggregators + +This is not fully thought through, but at some point we will probably add one +or more large external aggregator metadata sources (MAG, Semantic Scholar, +CORE, SHARE, BASE), and bulk import both metadata records and fulltext at the +same time. The assumption is that those sources are doing the same fuzzy entity +merging/de-dupe and crawling we are doing, but they have already done it +(probably with more resources) and created stable identifiers that we can +include. + +A major blocker for most such imports is metadata licensing (fatcat is CC0, +others have restrictions). This may not be the case for CORE and SHARE though. |