From 4fc6c4eae7fdd6d2820041204ea88a7fa957c21c Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 3 Sep 2020 18:21:12 +0200 Subject: update various docs; start data issue log --- README.md | 19 +++++++++++++++++++ notes/Abbreviations.md | 3 +++ notes/Todo.md | 23 +++++++++++++++++++++++ notes/abbrev.md | 3 --- notes/plan.md | 23 ----------------------- projects/grobid_refs/.gitignore | 2 ++ projects/grobid_refs/README.md | 5 ++++- 7 files changed, 51 insertions(+), 27 deletions(-) create mode 100644 notes/Abbreviations.md create mode 100644 notes/Todo.md delete mode 100644 notes/abbrev.md delete mode 100644 notes/plan.md create mode 100644 projects/grobid_refs/.gitignore diff --git a/README.md b/README.md index 7c6468d..2fe2e5e 100644 --- a/README.md +++ b/README.md @@ -42,3 +42,22 @@ user 29177m5.516s sys 4927m3.277s ``` +## Data issues + +### A republised article + +There is "student BMJ" and "BMJ" - this (html) article (interview) has been +first published on "sbmj" (Published 07 July 2011), then "bmj" (Published 10 +August 2011). + +> Notes; Originally published as: Student BMJ 2011;19:d3983 + +* https://www.bmj.com/content/343/sbmj.d3983 +* https://www.bmj.com/content/343/bmj.d4964 + +It is essentially the same text, same title, author, just different DOI and +probably a different recorded date. + +Generic pattern "republication" duplicate: + +* metadata mostly same, except date and doi diff --git a/notes/Abbreviations.md b/notes/Abbreviations.md new file mode 100644 index 0000000..5106d5b --- /dev/null +++ b/notes/Abbreviations.md @@ -0,0 +1,3 @@ +# Abbreviations + +* https://images.webofknowledge.com/images/help/WOS/V_abrvjt.html diff --git a/notes/Todo.md b/notes/Todo.md new file mode 100644 index 0000000..2c548b0 --- /dev/null +++ b/notes/Todo.md @@ -0,0 +1,23 @@ +# Todo + +## Releases + +* [ ] stats of cases: versions, exact title matches; common prefixes (e.g. "XYZ Report 20XX", ...) + +## Containers + +* [ ] create notebook on duplicates +* [ ] static mapping, that is efficient to store, maybe via: https://github.com/pytries/marisa-trie + +If matching only by name, we need to lookup a (exact) name. + +* need a mapping from "name" and "name variants" to journal "issnl" + +## Bulk + +* [ ] download export + +## Performance + +* provide some fast path + diff --git a/notes/abbrev.md b/notes/abbrev.md deleted file mode 100644 index 5106d5b..0000000 --- a/notes/abbrev.md +++ /dev/null @@ -1,3 +0,0 @@ -# Abbreviations - -* https://images.webofknowledge.com/images/help/WOS/V_abrvjt.html diff --git a/notes/plan.md b/notes/plan.md deleted file mode 100644 index 94c1297..0000000 --- a/notes/plan.md +++ /dev/null @@ -1,23 +0,0 @@ -# Plan - -## Releases - -* [ ] stats of cases: versions, exact title matches; common prefixes (e.g. "XYZ Report 20XX", ...) - -## Containers - -* [ ] create notebook on duplicates -* [ ] static mapping, that is efficient to store, maybe via: https://github.com/pytries/marisa-trie - -If matching only by name, we need to lookup a (exact) name. - -* need a mapping from "name" and "name variants" to journal "issnl" - -## Bulk - -* [ ] download export - -## Performance - -* provide some fast path - diff --git a/projects/grobid_refs/.gitignore b/projects/grobid_refs/.gitignore new file mode 100644 index 0000000..bd98a73 --- /dev/null +++ b/projects/grobid_refs/.gitignore @@ -0,0 +1,2 @@ +*.pdf + diff --git a/projects/grobid_refs/README.md b/projects/grobid_refs/README.md index 13ca3fc..15eaae0 100644 --- a/projects/grobid_refs/README.md +++ b/projects/grobid_refs/README.md @@ -2,5 +2,8 @@ References extracted from [grobid](https://grobid.readthedocs.io). -Example grobid output: [grobid.tei.xml](grobid.tei.xml). +Example grobid outputs: + +* [grobid.tei.xml](grobid.tei.xml), [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does not extract many refs; GS looks ok +* [](), [pdf](https://ia803202.us.archive.org/21/items/jstor-1064270/1064270.pdf) -- cgit v1.2.3