diff options
-rw-r--r-- | python/notes/version_3.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 7fce20f..66840bf 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -276,3 +276,29 @@ Sidenote, also in refs: ``` How many titles have "s p a c e s" in title? + +---- + +ISBN normalization. + +In refs, we mostly have ISBN in unstrcutured: + +``` +ISBN 3-906166-35-X. +ISBN 978-0- 470-25003-7. +Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007). +ISBN 88-13-19785-3 +ISBN GB3N-CL4-5HL4. +``` + +About 600/1M "isbn" in unstructured. + +``` +$ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn +675 +``` + +So maybe 500k isbn in total? + +* need to find them, then validate them + |