aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-11-19 02:52:12 +0100
committerMartin Czygan <martin.czygan@gmail.com>2020-11-19 02:52:12 +0100
commitca0df663105335826d54baef8066f42a9d539961 (patch)
treecd5399da0cd3f904128e3cde83c0a32f87577f13 /README.md
parentcadd28ae6a06dc907fbbf2d8015899ab0f2a4f81 (diff)
downloadfuzzycat-ca0df663105335826d54baef8066f42a9d539961.tar.gz
fuzzycat-ca0df663105335826d54baef8066f42a9d539961.zip
update notes
Diffstat (limited to 'README.md')
-rw-r--r--README.md77
1 files changed, 43 insertions, 34 deletions
diff --git a/README.md b/README.md
index 412495c..72b91b4 100644
--- a/README.md
+++ b/README.md
@@ -87,21 +87,10 @@ Clustering 1M records (single core) takes about 64s (15K docs/s).
```shell
$ head -1 out.json
{
- "c": "release_key_title",
+ "k": "裏表紙",
"v": [
- "7ufkzsjywzejvjzsyegugradoa",
- "harjqexl5vagxc54zjfen5zlve",
- "i5jrdoxqmjfs3fk2dcpnqxqb2e",
- "i62bo63qqzggjjk7pf77z26djm",
- "omo3z5y7qvh6hbl7wjacinsfiq",
- "prkik3s5vzejnfe4u26g2vt2wu",
- "pyqss6ifnvgqjeqohlampswvkm",
- "spr2b23fk5asph7v6shrd6okt4",
- "togokylwfvcvzilhnx4jir2hfm",
- "us4artv2hbc5bljuwaopquicfu",
- "ycargjj4lzddnmyzbh2e22wsii"
- ],
- "k": "裏表紙"
+ ...
+ ]
}
```
@@ -115,25 +104,45 @@ Interestingly, the parallel variants detects fewer clusters (because data is
split and clusters are searched within each batch). TODO(miku): sort out sharding bug.
-## Cluster
+## QA
-```shell
-usage: fuzzycat command [options] cluster [-h] [--prefix PREFIX]
- [--tmpdir TMPDIR] [-P] [-f FILES]
- [-t TYPE]
- {cluster,verify} ...
-
-positional arguments:
- {cluster,verify}
- cluster group entities
- verify verify groups
-
-optional arguments:
- -h, --help show this help message and exit
- --prefix PREFIX temp file prefix
- --tmpdir TMPDIR temporary directory
- -P, --profile profile program
- -f FILES, --files FILES
- output files
- -t TYPE, --type TYPE cluster algorithm: title, tnorm, tnysi
+### 10M release dataset
+
+Notes on cadd28a version clustering (nysiis) and verification.
+
+* 10M docs
+* 9040789 groups
+* 665447 verification pairs
+
+```
+ 176 Miss.APPENDIX
+ 25 Miss.ARXIV_VERSION
+ 12082 Miss.BLACKLISTED
+ 5 Miss.BLACKLISTED_FRAGMENT
+ 46733 Miss.BOOK_CHAPTER
+ 1567 Miss.COMPONENT
+ 47691 Miss.CONTRIB_INTERSECTION_EMPTY
+ 30806 Miss.DATASET_DOI
+ 1 Miss.NUM_DIFF
+ 157718 Miss.RELEASE_TYPE
+ 16263 Miss.SHORT_TITLE
+ 6013 Miss.SUBTITLE
+ 57 Miss.TITLE_FILENAME
+ 148755 Miss.YEAR
+ 93 OK.ARXIV_VERSION
+ 88294 OK.DUMMY
+ 110 OK.PREPRINT_PUBLISHED
+ 15818 OK.SLUG_TITLE_AUTHOR_MATCH
+ 93240 OK.TITLE_AUTHOR_MATCH
```
+
+Cases
+
+* common title, "Books by Our Readers", https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq, https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq
+* common title, "The Future of Imprisonment"
+* same title "IEEE Transactions on Wireless Communications", same publisher, different year
+* same, except DOI, but maybe the same item, after all? https://fatcat.wiki/release/kxgsbh66v5bwhobcaiuh4i7dwy, https://fatcat.wiki/release/thl7o44z3jgk3njdypixwrdbve
+
+Possible improvements:
+
+* when title and authors match, check the year, and maybe the doi prefix; doi with the same prefix may not be duplicates