diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-04-27 21:38:07 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-04-27 21:38:07 +0200 |
commit | 0cf00f57575fb71e79d9a4b1bd7b3d59a682c63a (patch) | |
tree | 4242172362932557f624297645d690eb3ed075db | |
parent | 9728cd3d48a4490b67cd7c03aa7f41de6a069771 (diff) | |
download | refcat-0cf00f57575fb71e79d9a4b1bd7b3d59a682c63a.tar.gz refcat-0cf00f57575fb71e79d9a4b1bd7b3d59a682c63a.zip |
update notes
-rw-r--r-- | python/notes/version_3.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 7fce20f..66840bf 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -276,3 +276,29 @@ Sidenote, also in refs: ``` How many titles have "s p a c e s" in title? + +---- + +ISBN normalization. + +In refs, we mostly have ISBN in unstrcutured: + +``` +ISBN 3-906166-35-X. +ISBN 978-0- 470-25003-7. +Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007). +ISBN 88-13-19785-3 +ISBN GB3N-CL4-5HL4. +``` + +About 600/1M "isbn" in unstructured. + +``` +$ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn +675 +``` + +So maybe 500k isbn in total? + +* need to find them, then validate them + |