diff options
-rw-r--r-- | notes/data_issues.md | 7 | ||||
-rw-r--r-- | python/notes/version_3.md | 12 |
2 files changed, 19 insertions, 0 deletions
diff --git a/notes/data_issues.md b/notes/data_issues.md index 488ae94..a40fcb6 100644 --- a/notes/data_issues.md +++ b/notes/data_issues.md @@ -13,3 +13,10 @@ A DOI seems to be using the key, which leads to repeated entries. * https://qa.fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/outbound-refs The source and target seem to be the same. + +## Duplicated Edges + +* 2020-04-20 +* https://qa.fatcat.wiki/release/22222736evcc7kdn3bleua3fge/outbound-refs + +Source and target are the same, maybe DOI with ref key? diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 0656d39..4ed4df4 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -2,12 +2,21 @@ V2 plus: +* [ ] no dups * [ ] unmatched * [ ] wikipedia * [ ] some unstrucutured refs * [ ] OL * [ ] weblinks +## Duplicates + +``` +$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)' +``` + +Only 0.001% though. + ## Unstructured * about 300M w/o title, etc. @@ -250,3 +259,6 @@ Options: * can sort refs by source ident That's almost the same, as the matching process, just another function working on the match group. + +---- + |