diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-04-21 20:15:53 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-04-21 20:15:53 +0200 |
commit | 3bf4710e2e63eb6706b444fc244a8cdfe59fac0c (patch) | |
tree | 2c3bc9cf11dffe30d3b0252199a495c1c7b1e125 /python/notes | |
parent | 8553a35bb0fa91edd37e5728c7e546d4888514ce (diff) | |
download | refcat-3bf4710e2e63eb6706b444fc244a8cdfe59fac0c.tar.gz refcat-3bf4710e2e63eb6706b444fc244a8cdfe59fac0c.zip |
note on dups
Diffstat (limited to 'python/notes')
-rw-r--r-- | python/notes/version_3.md | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 0656d39..4ed4df4 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -2,12 +2,21 @@ V2 plus: +* [ ] no dups * [ ] unmatched * [ ] wikipedia * [ ] some unstrucutured refs * [ ] OL * [ ] weblinks +## Duplicates + +``` +$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)' +``` + +Only 0.001% though. + ## Unstructured * about 300M w/o title, etc. @@ -250,3 +259,6 @@ Options: * can sort refs by source ident That's almost the same, as the matching process, just another function working on the match group. + +---- + |