diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-05-05 15:55:39 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-05-05 15:55:39 +0200 |
commit | 634b7b7d910ddb20c5af0722de41ef5ccded7358 (patch) | |
tree | d83f5fb36dc4c98035511059202fc51dc676ee54 /notes | |
parent | a380bffa5fb0cf20ee84ede6fa590bf38e3675f8 (diff) | |
parent | 134752c2a160986c13d6c2b9428cb2720ed382d0 (diff) | |
download | refcat-634b7b7d910ddb20c5af0722de41ef5ccded7358.tar.gz refcat-634b7b7d910ddb20c5af0722de41ef5ccded7358.zip |
Merge branch 'master' of git.archive.org:martin/cgraph
* 'master' of git.archive.org:martin/cgraph: (24 commits)
update notes
make: run go mod tidy after build
add test for ParseUnstructured
remove stub file
tweaks; move parsing out of command
skate-map: a bit more help output
update docs
set: some tweaks
update README
update deps
start overview docs
update README
update docs
map is a reference type
fix a typo
implement a few flags as mapper middleware
update ignore files
update deps
rename skate-ref-to-release to skate-conv
update README
...
Diffstat (limited to 'notes')
-rw-r--r-- | notes/overview.md | 52 |
1 files changed, 52 insertions, 0 deletions
diff --git a/notes/overview.md b/notes/overview.md new file mode 100644 index 0000000..8cb1200 --- /dev/null +++ b/notes/overview.md @@ -0,0 +1,52 @@ +# Overview + +## Data inputs + +Mostly JSON, but each one different in form and quality. + +Core inputs: + +* refs schema, from metadata or grobid (1-4B) +* fatcat release entities (100-200M) +* open library solr export (10-50M) + +Other inputs: + +* researchgate sitemap, titles (10-30M) +* oai-pmh harvest metadata (50-200M) +* sim (serials in microfilm, "microfilm") metadata + +Inputs related to evaluation: + +* BASE md dump (200-300M) +* Microsoft Academic, MAG (100-300M) + +Casually: + +* a single title, e.g. ILL related (1) +* lists of titles (1-1M) + +## Targets + +### BiblioRef + +Most important high level target; basic schema for current setup; elasticsearch +indexable, small JSON docs, allowing basic aggregations and lookups. + +This is not just a conversion, but may involve clustering, verification, etc. + +## Approach + +We may call it "local map-reduce", and we try to do it all in a single MR setup, e.g. + +* extract relevant fields and sort (map) +* apply computation on groups (reduce) + +As we want performance and sometimes custom code (e.g. for finding information +in unstructured data), we try to group code into a Go library with a suite of +command line tools. Easy to build and deploy. + +If the scaffoling is good, we can plug in mappers and reducers as we go, and +expose them in the tools. + + |