2020q1 fulltext ingest plans

author: Bryan Newbold <bnewbold@archive.org> 2020-01-29 12:00:20 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-01-29 12:00:20 -0800
commit: 41d25749b46d9ed0bfcb57de8c6cd5399ea54de7 (patch)
tree: 0448e160b24f941e7e9434a211da6ee10d79c72d
parent: 052080ae28aeb2635f4b8917d4772e745ddbbf7b (diff)
download: sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.tar.gz
sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.zip
1 files changed, 272 insertions, 0 deletions
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
new file mode 100644
index 0000000..9469217
--- /dev/null
+++ b/proposals/20200129_pdf_ingest.md
@@ -0,0 +1,272 @@
+
+status: planned
+
+2020q1 Fulltext PDF Ingest Plan
+===================================
+
+This document lays out a plan and tasks for a push on crawling and ingesting
+more fulltext PDF content in early 2020.
+
+The goal is to get the current generation of pipelines and matching tools
+running smoothly by the end of March, when the Mellon phase 1 grant ends. As a
+"soft" goal, would love to see over 25 million papers (works) with fulltext in
+fatcat by that deadline as well.
+
+This document is organized by conceptual approach, then by jobs to run and
+coding tasks needing work.
+
+There is a lot of work here!
+
+
+## Broad OA Ingest By External Identifier
+
+There are a few million papers in fatacat which:
+
+1. have a DOI, arxiv id, or pubmed central id, which can be followed to a
+   landing page or directly to a PDF
+2. are known OA, usually because publication is Gold OA
+3. don't have any fulltext PDF in fatcat
+
+As a detail, some of these "known OA" journals actually have embargos (aka,
+they aren't true Gold OA). In particular, those marked via EZB OA "color", and
+recent pubmed central ids.
+
+Of these, I think there are broadly two categories. The first is just papers we
+haven't tried directly crawling or ingesting yet at all; these should be easy
+to crawl and ingest. The second category is papers from large publishers with
+difficult to crawl landing pages (for example, Elsevier, IEEE, Wiley, ACM). The
+later category will probably not crawl with heritrix, and we are likely to be
+rate-limited or resource constrained when using brozzler or 
+
+Coding Tasks:
+
+- improve `fatcat_ingest.py` script to allow more granular slicing and limiting
+  the number of requests enqueued per batch (eg, to allow daily partial
+  big-publisher ingests in random order). Allow dumping arxiv+pmcid ingest
+  requests.
+
+Actions:
+
+- run broad Datacite DOI landing crawl with heritrix ("pre-ingest")
+- after Datacite crawl completes, run arabesque and ingest any PDF hits
+- run broad non-Datacite DOI landing crawl with heritrix. Use ingest tool to
+  generate (or filter a dump), removing Datacite DOIs and large publishers
+- after non-Datacite crawl completes, run entire ingest request set through in
+  bulk mode
+- start enqueing large-publisher (hard to crawl) OA DOIs to ingest queue
+  for SPNv2 crawling (blocking ingest tool improvement, and also SPNv2 health)
+- start new PUBMEDCENTRAL and ARXIV slow-burn pubmed crawls (heritrix). Use
+  updated ingest tool to generate requests.
+
+
+## Large Seedlist Crawl Iterations
+
+We have a bunch of large, high quality seedlists, most of which haven't been
+updated or crawled in a year or two. Some use DOIs as identifiers, some use an
+internal identifier. As a quick summary:
+
+- unpaywall: currently 25 million DOIs (Crossref only?) with fulltext. URLs may
+  be doi.org, publisher landing page, or direct PDF; may be published version,
+  pre-print, or manuscript (indicated with a flag). Only crawled with heritrix;
+  last crawl was Spring 2019.  There is a new dump from late 2019 with a couple
+  million new papers/URLs.
+- microsoft academic (MAG): tens of millions of papers, hundreds of millions of
+  URLs. Last crawled 2018 (?) from a 2016 dump. Getting a new full dump via
+  Azure; new dump includes type info for each URL ("pdf", "landing page", etc).
+  Uses MAG id for each URL, not DOI; hoping new dump has better MAG/DOI
+  mappings. Expect a very large crawl (tens of millions of new URLs).
+- CORE: can do direct crawling of PDFs from their site, as well as external
+  URLs. They largely have pre-prints and IR content. Have not released a dump
+  in a long time. Would expect a couple million new direct (core.ac.uk) URLs
+  and fewer new web URLs (often overlap with other lists, like MAG)
+- semantic scholar: they do regular dumps. Use SHA1 hash of PDF as identifier;
+  it's the "best PDF of a group", so not always the PDF you crawl. Host many OA
+  PDFs on their domain, very fast to crawl, as well as wide-web URLs. Their
+  scope has increased dramatically in recent years due to MAG import; expect a
+  lot of overlap there.
+
+It is increasingly important to not 
+
+Coding Tasks:
+- transform scripts for all these seedlist sources to create ingest request
+  lists
+- sandcrawler ingest request persist script, which supports setting datetime
+- fix HBase thrift gateway so url agnostic de-dupe can be updated
+- finish ingest worker "skip existing" code path, which looks in sandcrawler-db
+  to see if URL has already been processed (for efficiency)
+
+Actions:
+- transform and persist all these old seedlists, with the URL datetime set to
+  roughly when the URL was added to the upstream corpus
+- transform arabesque output for all old crawls into ingest requests and run
+  through the bulk ingest queue. expect GROBID to be skipped for all these, and
+  for the *requests* not to be updated (SQL ON CONFLICT DO NOTHING). Will
+  update ingest result table with status.
+- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist
+  into ingest request table. use SQL to dump only the *new* URLs (not seen in
+  previous dumps) using the created timestamp, outputing new bulk ingest
+  request lists. if possible, de-dupe between these two. then start bulk
+  heritrix crawls over these two long lists. Probably sharded over several
+  machines. Could also run serially (first one, then the other, with
+  ingest/de-dupe in between). Filter out usual large sites (core, s2, arxiv,
+  pubmed, etc)
+- CORE and Semantic Scholar direct crawls, of only new URLs on their domain
+  (should not significantly conflict/dupe with other bulk crawls)
+
+After this round of big crawls completes we could do iterated crawling of
+smaller seedlists, re-visit URLs that failed to ingest with updated heritrix
+configs or the SPNv2 ingest tool, etc.
+
+
+## GROBID/glutton Matching of Known PDFs
+
+Of the many PDFs in the sandcrawler CDX "working set", many were broadly
+crawled or added via CDX heuristic. In other words, we don't have an identifier
+from a seedlist. We previously run a matching script in Hadoop that attempted
+to link these to Crossref DOIs based on GROBID extracted metadata. We haven't
+done this in a long time; in the meanwhile we have added many more such PDFs,
+added lots of metadata to our matching set (eg, pubmed and arxiv in addition to
+crossref), and have the new biblio-glutton tool for matching, which may work
+better than our old conservative tool.
+
+We have run GROBID+glutton over basically all of these PDFs. We should be able
+to do a SQL query to select PDFs that:
+
+- have at least one known CDX row
+- GROBID processed successfuly and glutton matched to a fatcat release
+- do not have an existing fatcat file (based on sha1hex)
+- output GROBID metadata, `file_meta`, and one or more CDX rows
+
+An update match importer can take this output and create new file entities.
+Then lookup the release and confirm the match to the GROBID metadata, as well
+as any other quality checks, then import into fatcat. We have some existing
+filter code we could use. The verification code should be refactored into a
+reusable method.
+
+It isn't clear to me how many new files/matches we would get from this, but
+could do some test SQL queries to check. At least a million?
+
+A related task is to update the glutton lookup table (elasticsearch index and
+on-disk lookup tables) after more recent metadata imports (Datacite, etc).
+Unsure if we should filter out records or improve matching so that we don't
+match "header" (paper) metadata to non-paper records (like datasets), but still
+allow *reference* matching (citations to datasets).
+
+Coding Tasks:
+- write SQL select function. Optionally, come up with a way to get multiple CDX
+  rows in the output (sub-query?)
+- biblio metadata verify match function (between GROBID metadata and existing
+  fatcat release entity)
+- updated match fatcat importer
+
+Actions:
+- update `fatcat_file` sandcrawler table
+- check how many PDFs this might ammount to. both by uniq SHA1 and uniq
+  `fatcat_release` matches
+- do some manual random QA verification to check that this method results in
+  quality content in fatcat
+- run full updated import
+
+
+## No-Identifier PDF New Release Import Pipeline
+
+Previously, as part of longtail OA crawling work, I took a set of PDFs crawled
+from OA journal homepages (where the publisher does not register DOIs), took
+successful GROBID metadata, filtered for metadata quality, and imported about
+1.5 million new release entities into fatcat.
+
+There were a number of metadata issues with this import that we are still
+cleaning up, eg:
+
+- paper actually did have a DOI and should have been associated with existing
+  fatcat release entity; these PDFs mostly came from repository sites which
+  aggregated many PDFs, or due to unintentional outlink crawl configs
+- no container linkage for any of these releases, making coverage tracking or
+  reporting difficult
+- many duplicates in same import set, due to near-identical PDFs (different by
+  SHA-1, but same content and metadata), not merged or grouped in any way
+
+The cleanup process is out of scope for this document, but we want to do
+another round of similar imports, while avoiding these problems.
+
+As a rouch sketch of what this would look like (may need to iterate):
+
+- filter to PDFs from longtail OA crawls (eg, based on WARC prefix, or URL domain)
+- filter to PDFs not in fatcat already (in sandcrawler, then verify with lookup)
+- filter to PDFs with successful GROBID extraction and *no* glutton match
+- filter/clean GROBID extracted metadata (in python, not SQL), removing stubs
+  or poor/partial extracts
+- run a fuzzy biblio metadata match against fatcat elasticsearch; use match
+  verification routine to check results
+- if fuzzy match was a hit, consider importing directly as a matched file
+  (especially if there are no existing files for the release)
+- identify container for PDF from any of: domain pattern/domain; GROBID
+  extracted ISSN or journal name; any other heuristic
+- if all these filters pass and there was no fuzzy release match, and there was
+  a container match, import a new release (and the file) into fatcat
+
+Not entirely clear how to solve the near-duplicate issue. Randomize import
+order (eg, sort by file sha1), import slowly with a single thread, and ensure
+elasticsearch re-indexing pipeline is running smoothly so the fuzzy match will
+find recently-imported hits?
+
+In theory we could use biblio-glutton API to do the matching lookups, but I
+think it will be almost as fast to hit our own elasticsearch index. Also the
+glutton backing store is always likely to be out of date. In the future we may
+even write something glutton-compatible that hits our index. Note that this is
+also very similar to how citation matching could work, though it might be
+derailing or over-engineering to come up with a single solution for both
+applications at this time.
+
+A potential issue here is that many of these papers are probably already in
+another large but non-authoritative metadata corpus, like MAG, CORE, SHARE, or
+BASE. Importing from those corpuses would want to go through the same fuzzy
+matching to ensure we aren't creating duplicate releases, but further it would
+be nice to be matching those external identifiers for any newly created
+releases. One approach would be to bulk-import metadata from those sources
+first. There are huge numbers of records in those corpuses, so we would need to
+filter down by journal/container or OA flag first. Another would be to do fuzzy
+matching when we *do* end up importing those corpuses, and update these records
+with the external identifiers. This issue really gets at the crux of a bunch of
+design issues and scaling problems with fatcat! But I think we should or need
+to make progress on these longtail OA imports without perfectly solving these
+larger issues.
+
+Details/Questions:
+- what about non-DOI metadata sources like MAG, CORE, SHARE, BASE? Should we
+  import those first, or do fuzzy matching against those?
+- use GROBID language detection and copy results to newly created releases
+- in single-threaded, could cache "recently matched/imported releases" locally
+  to prevent double-importing
+- cache container matching locally
+
+Coding Tasks:
+- write SQL select statement
+- iterate on GROBID metadata cleaning/transform/filter (have existing code for
+  this somewhere)
+- implement a "fuzzy match" routine that takes biblio metadata (eg, GROBID
+  extracted), looks in fatcat elasticsearch for a match
+- implement "fuzzy container match" routine, using as much available info as
+  possible. Could use chocula sqlite locally, or hit elasticsearch container
+  endpoint
+- update GROBID importer to use fuzzy match and other checks
+
+Actions:
+- run SQL select and estimate bounds on number of new releases created
+- do some manual randomized QA runs to ensure this pipeline is importing
+  quality content in fatcat
+- run a full batch import
+
+
+## Non-authoritative Metadata and Fulltext from Aggregators
+
+This is not fully thought through, but at some point we will probably add one
+or more large external aggregator metadata sources (MAG, Semantic Scholar,
+CORE, SHARE, BASE), and bulk import both metadata records and fulltext at the
+same time. The assumption is that those sources are doing the same fuzzy entity
+merging/de-dupe and crawling we are doing, but they have already done it
+(probably with more resources) and created stable identifiers that we can
+include.
+
+A major blocker for most such imports is metadata licensing (fatcat is CC0,
+others have restrictions). This may not be the case for CORE and SHARE though.
author	Bryan Newbold <bnewbold@archive.org>	2020-01-29 12:00:20 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-01-29 12:00:20 -0800
commit	41d25749b46d9ed0bfcb57de8c6cd5399ea54de7 (patch)
tree	0448e160b24f941e7e9434a211da6ee10d79c72d
parent	052080ae28aeb2635f4b8917d4772e745ddbbf7b (diff)
download	sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.tar.gz sandcrawler-41d25749b46d9ed0bfcb57de8c6cd5399ea54de7.zip