From 0cf00f57575fb71e79d9a4b1bd7b3d59a682c63a Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Tue, 27 Apr 2021 21:38:07 +0200 Subject: update notes --- python/notes/version_3.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'python/notes') diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 7fce20f..66840bf 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -276,3 +276,29 @@ Sidenote, also in refs: ``` How many titles have "s p a c e s" in title? + +---- + +ISBN normalization. + +In refs, we mostly have ISBN in unstrcutured: + +``` +ISBN 3-906166-35-X. +ISBN 978-0- 470-25003-7. +Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007). +ISBN 88-13-19785-3 +ISBN GB3N-CL4-5HL4. +``` + +About 600/1M "isbn" in unstructured. + +``` +$ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn +675 +``` + +So maybe 500k isbn in total? + +* need to find them, then validate them + -- cgit v1.2.3