aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-04-21 20:15:53 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-04-21 20:15:53 +0200
commit3bf4710e2e63eb6706b444fc244a8cdfe59fac0c (patch)
tree2c3bc9cf11dffe30d3b0252199a495c1c7b1e125
parent8553a35bb0fa91edd37e5728c7e546d4888514ce (diff)
downloadrefcat-3bf4710e2e63eb6706b444fc244a8cdfe59fac0c.tar.gz
refcat-3bf4710e2e63eb6706b444fc244a8cdfe59fac0c.zip
note on dups
-rw-r--r--notes/data_issues.md7
-rw-r--r--python/notes/version_3.md12
2 files changed, 19 insertions, 0 deletions
diff --git a/notes/data_issues.md b/notes/data_issues.md
index 488ae94..a40fcb6 100644
--- a/notes/data_issues.md
+++ b/notes/data_issues.md
@@ -13,3 +13,10 @@ A DOI seems to be using the key, which leads to repeated entries.
* https://qa.fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/outbound-refs
The source and target seem to be the same.
+
+## Duplicated Edges
+
+* 2020-04-20
+* https://qa.fatcat.wiki/release/22222736evcc7kdn3bleua3fge/outbound-refs
+
+Source and target are the same, maybe DOI with ref key?
diff --git a/python/notes/version_3.md b/python/notes/version_3.md
index 0656d39..4ed4df4 100644
--- a/python/notes/version_3.md
+++ b/python/notes/version_3.md
@@ -2,12 +2,21 @@
V2 plus:
+* [ ] no dups
* [ ] unmatched
* [ ] wikipedia
* [ ] some unstrucutured refs
* [ ] OL
* [ ] weblinks
+## Duplicates
+
+```
+$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)'
+```
+
+Only 0.001% though.
+
## Unstructured
* about 300M w/o title, etc.
@@ -250,3 +259,6 @@ Options:
* can sort refs by source ident
That's almost the same, as the matching process, just another function working on the match group.
+
+----
+