diff options
Diffstat (limited to 'proposals/202008_bulk_citation_graph.md')
-rw-r--r-- | proposals/202008_bulk_citation_graph.md | 160 |
1 files changed, 0 insertions, 160 deletions
diff --git a/proposals/202008_bulk_citation_graph.md b/proposals/202008_bulk_citation_graph.md deleted file mode 100644 index 65db0d94..00000000 --- a/proposals/202008_bulk_citation_graph.md +++ /dev/null @@ -1,160 +0,0 @@ - -status: brainstorm - - -This is one design proposal for how to scale up citation graph potential-match -generation, as well as for doing fuzzy matching of other types at scale (eg, -self-matches to group works within fatcat). Not proposiing that we have to do -things this way, this is just one possible option. - -This current proposal has the following assumptions: - -- 100-200 million "target" works -- 1-3 billion structured references to match -- references mostly coming from GROBID, Crossref, and PUBMED -- paper-like works (could include books; references to web pages etc to be - handled separately) -- static snapshots are sufficient -- python fuzzy match code works and is performant enough to verify matches - within small buckets/pools - -Additional major "source" and "target" works to think about: - -- wikipedia articles as special "source" works. should work fine with this - system. also most wikipedia paper references already have persistent - identifiers -- webpages as special "target" works, where we would want to do a CDX lookup or - something. normalize URL (SURT?) to generate reverse index ("all works citing - a given URL") -- openlibrary books as "target" works. also should work fine with this system - -The high-level prosposal is: - -- transform and normalize basic metadata for both citations and reference (eg, - sufficient fields for fuzzy verification), and store only this minimal subset -- as a first pass, if external identifiers exist in the "source" reference set, - do lookups against fatcat API and verify match on any resulting hits. remove - these "source" matches from the next stages. -- generate one or more fixed-size hash identifiers (~64 bit) for each citation - and target, and use these as a key to bucket works. this would not be hashes - over the entire record metadata, only small subsets -- sort the "target" works into an index for self-grouping, lookups, and - iteration. each record may appear multiple times if there are multiple hash - types -- sort the "source" references into an index and run a merge-sort on bucket - keys against the "target" index to generate candidate match buckets -- run python fuzzy match code against the candidate buckets, outputting a status - for each reference input and a list of all strong matches -- resort successful matches and index by both source and target identifiers as - output citation graph - -## Record Schema - -Imaginging a subset of fatcat release entity fields, perhaps stored in a binary -format like protobuf for size efficiency. Or a SQL table or columnar -datastore. If we used JSON we would want to use short key names to reduce key -storage overhead. Total data set size will impact performance because of disk -I/O, caching, etc. I think this may hold even with on-disk compression? - -Would do a pass of normalization ahead of time, like aggressive string -cleaning, so we don't need to do this per-fuzzy-verify attempt. - -Metadata subset might include: - -- `title` -- `subtitle` -- `authors` (surnames? structured/full names? original/alternate/aliases?) -- `year` -- `container_name` (and abbreviation?) -- `volume`, `issue`, `pages` -- `doi`, `pmid`, `arxiv_id` (only ext ids used in citations?) -- `release_ident` (source or target) -- `work_ident` (source or target) -- `release_stage` (target only?) -- `release_type` (target only?) - -Plus any other fields helpful in fuzzy verification. - -These records can be transformed into python release entities with partial -metadata, then passed to the existing fuzzy verification code path. - -## Hashing Schemes - -Hashing schemes could be flexible. Multiple could be used at the same time, and -we can change schemes over time. Each record could be transformed to one or -more hashes. Ideally we could use the top couple bits of the hash to indicate -the hash type. - -An initial proposal would be to use first and last N tokens of just the title. -In this scheme would normalize and tokenize the title, remove a few stopwords -(eg, tokens sometimes omitted in citation or indexing). If the title is shorter -than 3 tokens pad with blank tokens. Perhaps do a filter here against -inordinately popular titles or other bad data. Then use some fast hash -non-cryptographic hash with fixed size output (64-bits). Do this for both the -first and last three tokens; set the top bit to "0" for hash of the first three -tokens, or "1" for the hash of the last three tokens. Emit two key/value rows -(eg, TSV?), with the same values but different hashes. - -Alternatively, in SQL, index a single row on the two different hash types. - -Possible alternative hash variations we could experiment with: - -- take the first 10 normalized characters, removing whitespace, and hash that -- include first 3 title tokens, then 1 token of the first author's surname -- normalize and hash entire title -- concatenate subtitle to title or not - -Two advantages of hashing are: - -- we can shard/partition based on the key. this would not be the case if the - keys were raw natural language tokens -- advantages from fixed-size datatypes (eg, uint64) - -## Bulk Joining - -"Target" index could include all hash types in a single index. "Source" index -in bulk mode could be either all hash types concatenated together and run -together, then re-sort and uniq the output (eg, by release-to-release pairings) -to remove dupes. In many cases this would have the overhead of computing the -fuzzy verification multiple times redundantly (but only a small fixed maximum -number of duplicates). Alternatively, with greater pipeline complexity, could -do an initial match on one hash type, then attempt matching (eg, recompute and -sort and join) for the other hash types only for those which did not match. - -## Citation Graph Index Output - -Imagining successful match rows to look like: - -- `match_status` (eg, strong/weak) -- `source_release_ident` -- `source_release_stage` -- `ref_key` (optional? or `ref_index`?) -- `source_work_ident` -- `target_release_ident` -- `target_release_stage` -- `target_work_ident` - -Would run a sort/uniq on `(source_release_ident,target_release_ident)`. - -Could filter by stages, then sort/uniq work-to-work counts to generate simple -inbound citation counts for each target work. - -Could sort `target_work_ident` and generate groups of inbound works ("best -release per work") citing that work. Then do fast key lookups to show -"works/releases citing this work/release". - -## To Be Decided - -- bulk datastore/schema: just TSV files sorted by key column? if protobuf, how - to encode? what about SQL? parquet/arrow? -- what datastores would allow fast merge sorts? do SQL engines (parquet) - actually do this? -- would we need to make changes to be more compatible with something like - opencitations? Eg, I think they want DOI-to-DOI citations; having to look - those up again from fatcat API would be slow -- should we do this in a large distributed system like spark (maybe pyspark for - fuzzy verification) or stick to simple UNIX/CLI tools? -- wikipedia articles as sources? -- openlibrary identifiers? -- archive.org as additional identifiers? - |