diff options
Diffstat (limited to 'notes/schema/mag_schema_comparison.txt')
-rw-r--r-- | notes/schema/mag_schema_comparison.txt | 65 |
1 files changed, 65 insertions, 0 deletions
diff --git a/notes/schema/mag_schema_comparison.txt b/notes/schema/mag_schema_comparison.txt new file mode 100644 index 00000000..0328ff7e --- /dev/null +++ b/notes/schema/mag_schema_comparison.txt @@ -0,0 +1,65 @@ + +Looking at the Microsoft Academic Graph schema: https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema + +My take-aways from this are: + +- should allow storing raw affiliations today in release_contrib rows, and some + day have a foreign key to institution there +- maybe should have an "original_title" field for releases? though could go in + 'extra' (along with subtitle) +- have a well-known 'extra' key to use saving citation context in references + + +## Data Model (high-level) + +Includes rich affiliation (at the per-paper level) and "field of study" +tagging. + +No work/release distinction. + +There are URLs, but no file-level metadata. + +Don't store full abstracts for legal reasons. + + +## Details (lower-level) + +Across many entities, there are "normalized" and "display" names. + +Some stats are aggregated: paper and citation counts + +#### Affilitions + +Institution names: "normalized" vs. "display" + +"GRID" id? + +What is the WikiPage? Wikipedia? + +#### Authors + +Saves "last known" affiliation. + +#### Field of Study + +Nested hierarchy + +#### Citations + +"Context" table stores... presumably text around the citaiton itself. + +"References" table stores little metadata about the citation itself. + +#### Papers + +Paper URLs now have types (an int). + +"Paper Title" / "Original Title" / "Book Title" + +Year and Date separately (same as fatcat) + +Stores first and last page separately. + +"Original Venue" (string), presumably name of the container/journal + +Has arbitrary resources (URLs) |