aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-11-24 15:06:34 +0100
committerMartin Czygan <martin.czygan@gmail.com>2020-11-24 15:06:34 +0100
commit621f50e685d9beeb1fe502a133e76fbd5a8a9c5c (patch)
tree527f4abc067ef9c4d49ba0bea1f52b23ed219f5e /notes
parentbfd6e08c3b1ffb16c98321d15dd5da6e8db400de (diff)
downloadfuzzycat-621f50e685d9beeb1fe502a133e76fbd5a8a9c5c.tar.gz
fuzzycat-621f50e685d9beeb1fe502a133e76fbd5a8a9c5c.zip
cleanup
Diffstat (limited to 'notes')
-rw-r--r--notes/bm.md19
-rw-r--r--notes/clustering.md102
-rw-r--r--notes/general.md197
-rw-r--r--notes/todo.md23
-rw-r--r--notes/workflow.md60
5 files changed, 0 insertions, 401 deletions
diff --git a/notes/bm.md b/notes/bm.md
deleted file mode 100644
index b6c3a7c..0000000
--- a/notes/bm.md
+++ /dev/null
@@ -1,19 +0,0 @@
-# b/m
-
-## cluster, verify
-
-* git pull deploy, aitio
-* cluster example
-* test with
-
-## regatedl match results
-
-* https://git.archive.org/martin/regatedl, in fixtures: https://git.archive.org/martin/regatedl/-/tree/master/fixtures
-
-## the temp data structure
-
-* should go in ~/.cache/...
-* sqlite; TSV
-
-## tigris ideas
-
diff --git a/notes/clustering.md b/notes/clustering.md
deleted file mode 100644
index 3f6312c..0000000
--- a/notes/clustering.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# Clustering
-
-Original dataset:
-
-```
-$ sha1sum release_export_expanded.json.zst
-fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst
-
-$ zstdcat -T0 release_export_expanded.json.zst | wc -l
-```
-
-Various clusters (title, title normalized, title nysiis (New York State
-Identification and Intelligence System, ...):
-
-```
-$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
-```
-
-Parallel (TODO: use `--pipepart`):
-
-```
-$ zstdcat -T0 release_export_expanded.json.zst | \
- parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \
- fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json
-```
-
-Numbers of clusters:
-
-```
- 141022216 cluster_title.json
- 134709771 cluster_title_normalized.json
- 119829458 cluster_title_nysiis.json
-```
-
-The number of duplicate record goes up as number of clusters go down:
-
-```
- 2858088 cluster_title_dups.json
- 5818143 cluster_title_normalized_dups.json
- 6274940 cluster_title_nysiis_dups.json
-```
-
-# Cluster numbers
-
-Using normalized title as example:
-
-* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger
-
-```
- size len
-count 5818143.000 5818143.000
-mean 4.350 52.120
-std 196.347 35.026
-min 2.000 0.000
-25% 2.000 24.000
-50% 2.000 46.000
-75% 3.000 72.000
-max 151383.000 11686.000
-```
-
-Around 448170 clusters with size 5 or more (with some example titles):
-
-```
-Medical Notes
-日本鉄鋼協会第97回講演大会講演概要
-Boutades
-Allergic Contact Dermatitis
-Comité international
-Incontinence
-Efficient Uncertainty Minimization for Fuzzy Spectral Clustering
-Early Intervention
-CURRENT READINGS IN NUCLEAR MEDICINE
-Nannocystis exedens
-```
-
-Grouping. API, hide.
-
-* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart";
- batching; "read from a file"; scan a file; "chunking"
-
-# TODO
-
-* [ ] do a SS like clustering, using title and author ngrams
-* [ ] cluster by doi without "vX" suffix
-
-# Verification
-
-* we only need to look at identified duplicates, which will be a few millions
-* we want fast access to all release JSON blob via ident, maybe do a
- "fuzzycat-cache" that copies relevant files into the fs, e.g.
-"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html)
-
-For verification we need to have the cached json blobs in some fast,
-thread-safe store. Estimated: 1K/s accesses, we still would need a few hours
-for a run.
-
-* [ ] find all ids we need, generate cache, maybe reduce number of fields
-* [ ] run verification on each cluster; generate a file of same format of
- "verified" clusters; take note the clustering and verification method
-
-Overall, we can combine various clustering and verification methods. We can
-also put together a list of maybe 100-200 test cases and evaluate methods.
diff --git a/notes/general.md b/notes/general.md
deleted file mode 100644
index 03f6ec4..0000000
--- a/notes/general.md
+++ /dev/null
@@ -1,197 +0,0 @@
-# fuzzycat (wip)
-
-Fuzzy matching publications for [fatcat](https://fatcat.wiki).
-
-* [fuzzycat](https://pypi.org/project/fuzzycat/)
-
-Note: This is currently work-in-progress.
-
-## Motivation
-
-Most of the results on sites like [Google
-Scholar](https://scholar.google.com/scholar?q=fuzzy+matching) group
-publications into clusters. Each cluster represents one publication, abstracted
-from its concrete representation as a link to a PDF.
-
-We call the abstract publication
-[work](https://guide.fatcat.wiki/entity_work.html) and the concrete instance a
-[release](https://guide.fatcat.wiki/entity_release.html). One goal is to group
-releases under works and to implement a versions feature (self-match). Another
-goal is to have support for matching of external lists (e.g. title lists or
-other document) to the existing records.
-
-This repository contains both generic code for matching as well as fatcat
-specific code using the fatcat openapi client.
-
-## Running and Deployment
-
-We defer more packaging polish until the code stabilizes a bit more. For now:
-
-```
-$ git clone git@github.com:miku/fuzzycat.git && cd fuzzycat
-$ pipenv install --deploy
-$ pipenv run python -m fuzzycat.main
-```
-
-For the future, an independent [pex](https://github.com/pantsbuild/pex) or
-[shiv](https://github.com/linkedin/shiv) executable would be a convenient
-option to allow execution from any directory.
-
-## Datasets
-
-A few relevant datasets are:
-
-* release and container metadata from a bulk fatcat export, e.g.
- [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05)
-* issn journal level data, via [issnlister](https://github.com/miku/issnlister)
-* journal abbreviation lists
-
-## Matching approaches
-
-![](static/approach.png)
-
-## Performance data points
-
-### Against elasticsearch
-
-Candidate generation via elasticsearch, 40 parallel queries, sustained speed at
-about 17857 queries per hour, that is around 5 queries/s.
-
-```
-$ time cat ~/data/researchgate/x04 | \
- parallel -j40 --pipe -N 1 ./fatcatx_rg_unmatched.py - \
- > ~/data/researchgate/x04_results.ndj
-...
-real 3409m16.442s
-user 29177m5.516s
-sys 4927m3.277s
-```
-
-### Without a search index
-
-Candidate grouping for self-match can be done locally by extracting a key per
-document, then a group by (via sort and uniq). Clustering 150M docs took about
-607min (around 4k docs/s, no verification step).
-
-## Data issues
-
-### A republished article
-
-* [https://fatcat.wiki/release/search?q=%22The+doctor+with+seven+billion+patients%22](https://fatcat.wiki/release/search?q=%22The+doctor+with+seven+billion+patients%22)
-
-There is "student BMJ" and "BMJ" - this (html) article (interview) has been
-first published on "sbmj" (Published 07 July 2011), then "bmj" (Published 10
-August 2011).
-
-> Notes; Originally published as: Student BMJ 2011;19:d3983
-
-* https://www.bmj.com/content/343/sbmj.d3983
-* https://www.bmj.com/content/343/bmj.d4964
-
-It is essentially the same text, same title, author, just different DOI and
-probably a different recorded date.
-
-Generic pattern "republication" duplicate:
-
-* metadata mostly same, except date and doi
-
-### Common title
-
-Probably a few thousand very common short titles.
-
-* [https://fatcat.wiki/release/search?q=%22Book+Reviews%22](https://fatcat.wiki/release/search?q=%22Book+Reviews%22) (238852)
-
-Some authors do this regularly:
-
-* [https://fatcat.wiki/release/search?q=%22Book+Reviews%22+%22william%22+%22michael%22](https://fatcat.wiki/release/search?q=%22Book+Reviews%22+%22william%22+%22michael%22) (398)
-
-Different DOI, so we know it is different.
-
-More examples:
-
-* [https://fatcat.wiki/release/search?q=%22errata%22](https://fatcat.wiki/release/search?q=%22errata%22) (37680)
-* [https://fatcat.wiki/release/search?q=%22Einleitung%22](https://fatcat.wiki/release/search?q=%22Einleitung%22) (68005)
-* [https://fatcat.wiki/release/search?q=%22Notes%22](https://fatcat.wiki/release/search?q=%22Notes%22) (1507705)
-* [https://fatcat.wiki/release/search?q=%22Letters+to+the+Editor%22](https://fatcat.wiki/release/search?q=%22Letters+to+the+Editor%22) (30976)
-
-### Title with extra data
-
-* like ISBN, ISSN, price and all kind of extra metadata
-* [https://fatcat.wiki/release/search?q=title%3A%22ISBN%22](https://fatcat.wiki/release/search?q=title%3A%22ISBN%22)
-* titles typically get longer: [https://fatcat.wiki/release/olxswrilxfci3ibb3bg5xhstr4](https://fatcat.wiki/release/olxswrilxfci3ibb3bg5xhstr4)
-* some of these are actually "reviews", e.g. [https://fatcat.wiki/release/4blc5mfc5bfaxkofuletqxuzp4](https://fatcat.wiki/release/4blc5mfc5bfaxkofuletqxuzp4)
-
-Another example:
-
-* too [long](https://fatcat.wiki/release/hewmq4afvnew7pwttvulzguubu), original suggested citation seems to be:
-
-> Parker, S. and Kerrod, R. (2002), "Children’s) Space Busters (1st) Looking at Stars (2nd)", Reference Reviews, Vol. 16 No. 5, pp. 26-27. https://doi.org/10.1108/rr.2002.16.5.26.252
-
-### Sometimes a title will be ambiguous
-
-For example given a title "Shakespeare in Tokyo" we would have to always return "ambiguous", as there are at least two separate publication with that name:
-
-* [https://fatcat.wiki/release/search?q=%22Shakespeare+in+Tokyo%22](https://fatcat.wiki/release/search?q=%22Shakespeare+in+Tokyo%22)
-
-This is similar to journal names, where some journal names will always be ambiguous.
-
-### Versions
-
-* same title, same authors, "vX" doi
-* [https://fatcat.wiki/release/search?q=%22Self-similarity+analysis+of+the+non-linear%22](https://fatcat.wiki/release/search?q=%22Self-similarity+analysis+of+the+non-linear%22)
-
-Sometimes, we have a couple of preprint versions, plus a published version (with a slightly different title):
-
-* [https://fatcat.wiki/release/search?q=%22Time-periodic+solutions+of+massive%22](https://fatcat.wiki/release/search?q=%22Time-periodic+solutions+of+massive%22)
-
-### Almost same
-
-* same author, maybe year
-* different DOI
-* title almost the same, e.g. [MassIVE MSV000085583 - Aedes aegypti protein profile and proteome analysis](https://fatcat.wiki/release/search?q=%22Aedes+aegypti+protein+profile+and+proteome+analysis%22)
-
-### Duplication by different granularity
-
-* [https://fatcat.wiki/release/search?q=%22Volkshochschule+Leipzig%22](https://fatcat.wiki/release/search?q=%22Volkshochschule+Leipzig%22) (20308)
-* contains both yearly entries, as well as "DOI per page",
- [https://fatcat.wiki/release/r734v367nza4tl37j6d74rfqo4](https://fatcat.wiki/release/r734v367nza4tl37j6d74rfqo4);
-could group pages under "container" of yearly release?
-* We have [one container](https://github.com/internetarchive/fatcat/blob/4f80b87722d64f27c985f0040ea177269b6e028b/fatcat-openapi2.yml#L704-L709) per release, currently.
-
-### Partial titles
-
-A metadata title might differ from the full title.
-
-* [https://fatcat.wiki/release/search?q=%22Brain-derived+neurotrophic+factor%22](https://fatcat.wiki/release/search?q=%22Brain-derived+neurotrophic+factor%22)
-
-Here, the [release](https://fatcat.wiki/release/2vi655gcejffhnzzbkkcnjpscm) points to two PDFs, one is an article, the other a weekly report (summary).
-
-### Exact duplicates
-
-* [https://fatcat.wiki/release/search?q=%22WEIGHTED+LIPSCHITZ+ESTIMATES+FOR+COMMUTATORS+ON+WEIGHTED+MORREY-HERZ+SPACES%22](https://fatcat.wiki/release/search?q=%22WEIGHTED+LIPSCHITZ+ESTIMATES+FOR+COMMUTATORS+ON+WEIGHTED+MORREY-HERZ+SPACES%22)
-
-### Difference in Subtitle (invisible)
-
-Subtitle is not visible metadata, all same, except for the DOI and the page number. Different.
-
-* [https://fatcat.wiki/release/search?q=%22Slip+in+tungsten+monocarbide%22](https://fatcat.wiki/release/search?q=%22Slip+in+tungsten+monocarbide%22)
-
-### The "what a difference a char makes" case
-
-Typically a yearly report, or "part 1", "part 2", like this:
-
-* [https://fatcat.wiki/release/search?q=%22The+Use+of+Bone+Age+in+Clinical+Practice+%22](https://fatcat.wiki/release/search?q=%22The+Use+of+Bone+Age+in+Clinical+Practice+%22)
-
-DOI differs and could hard code some patterns.
-
-### Published to two sites
-
-An article can have multiple DOI, e.g. when republished by a site that gives out DOI, e.g. researchgate. Example:
-
-* [Effect of Chlorophyll and Anthocyanin on the Secondary Bonds of Poly Vinyl Chloride](https://fatcat.wiki/release/search?q=%22Effect+of+Chlorophyll+and+Anthocyanin+on+the+Secondary+Bonds+of+Poly+Vinyl+Chloride+%22)
-
-> https://doi.org/10.11648/j.ijmsa.s.2015040201.15, https://doi.org/10.13140/rg.2.1.2398.3606
-
-Probably many "10.13140" prefixed DOI has at least another DOI.
-
-Some might be "rg-only", like this: [https://fatcat.wiki/release/search?q=%22Marco+de+trabajo+basado+en+los+datos+enlazados+para%22](https://fatcat.wiki/release/search?q=%22Marco+de+trabajo+basado+en+los+datos+enlazados+para%22)
diff --git a/notes/todo.md b/notes/todo.md
deleted file mode 100644
index 2c548b0..0000000
--- a/notes/todo.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# Todo
-
-## Releases
-
-* [ ] stats of cases: versions, exact title matches; common prefixes (e.g. "XYZ Report 20XX", ...)
-
-## Containers
-
-* [ ] create notebook on duplicates
-* [ ] static mapping, that is efficient to store, maybe via: https://github.com/pytries/marisa-trie
-
-If matching only by name, we need to lookup a (exact) name.
-
-* need a mapping from "name" and "name variants" to journal "issnl"
-
-## Bulk
-
-* [ ] download export
-
-## Performance
-
-* provide some fast path
-
diff --git a/notes/workflow.md b/notes/workflow.md
deleted file mode 100644
index 8cdd817..0000000
--- a/notes/workflow.md
+++ /dev/null
@@ -1,60 +0,0 @@
-# Workflow
-
-Separate problem in half, first find clusters, then examine clusters (as
-proposed).
-
-## Finding clusters
-
-* group by raw exact title
-* group by lowercase title
-* group by slug title
-* group by ngram title and authors
-* group by ngram title (prefix, suffix) and authors
-* group by elasticsearch
-* group by doi without vX prefix
-* group by soundex
-* group by a simhash over the record
-
-As for performance, the feature needs to be calculated in one pass, then the
-grouping reduces to a sort, in a second pass.
-
-The output could be a TSV file, with method and then release identifiers.
-
-```
-rawt o3utonw5qzhddo7l4lmwptgeey nnpmnwln7be2zb5hd2qanq3r7q
-```
-
-Or jsonlines for a bit of structure (e.g. method, ids)
-
-```
-{"m": "rawt", "c": ["o3utonw5qzhddo7l4lmwptgeey", "nnpmnwln7be2zb5hd2qanq3r7q"]}
-```
-
-```
-$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -g > clusters.json
-```
-
-### Performance considerations
-
-* [orjson](https://github.com/ijl/orjson), [pysimdjson](https://github.com/TkTech/pysimdjson)
-
-## Format
-
-Options:
-
-* emit minimal cluster information, e.g. method description and actual identifiers
-* emit methods, and for each cluster item some core fields (title, author, id, date)
-
-## Examine cluster
-
-There will be various methods by which to examine the cluster as well.
-
-We need to fetch releases by identifier (API, but use "hide"), this can be the
-full record or some partial record that has been cached somewhere.
-
-The input is then a list of releases and the output would be a equally sized or
-smaller cluster of releases which we assume represent the same record.
-
-Apart from that, there may be different relations, e.g. not the exact same
-thing, but something, that has an interval to it, like some thing that mostly
-differs in year?