From c364e3cb9c55d36771e274cbac3d8825798b1612 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 22 Jan 2019 12:37:55 -0800 Subject: MAG schema notes --- notes/schema/mag_schema_comparison.txt | 65 ++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 notes/schema/mag_schema_comparison.txt (limited to 'notes/schema/mag_schema_comparison.txt') diff --git a/notes/schema/mag_schema_comparison.txt b/notes/schema/mag_schema_comparison.txt new file mode 100644 index 00000000..0328ff7e --- /dev/null +++ b/notes/schema/mag_schema_comparison.txt @@ -0,0 +1,65 @@ + +Looking at the Microsoft Academic Graph schema: https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema + +My take-aways from this are: + +- should allow storing raw affiliations today in release_contrib rows, and some + day have a foreign key to institution there +- maybe should have an "original_title" field for releases? though could go in + 'extra' (along with subtitle) +- have a well-known 'extra' key to use saving citation context in references + + +## Data Model (high-level) + +Includes rich affiliation (at the per-paper level) and "field of study" +tagging. + +No work/release distinction. + +There are URLs, but no file-level metadata. + +Don't store full abstracts for legal reasons. + + +## Details (lower-level) + +Across many entities, there are "normalized" and "display" names. + +Some stats are aggregated: paper and citation counts + +#### Affilitions + +Institution names: "normalized" vs. "display" + +"GRID" id? + +What is the WikiPage? Wikipedia? + +#### Authors + +Saves "last known" affiliation. + +#### Field of Study + +Nested hierarchy + +#### Citations + +"Context" table stores... presumably text around the citaiton itself. + +"References" table stores little metadata about the citation itself. + +#### Papers + +Paper URLs now have types (an int). + +"Paper Title" / "Original Title" / "Book Title" + +Year and Date separately (same as fatcat) + +Stores first and last page separately. + +"Original Venue" (string), presumably name of the container/journal + +Has arbitrary resources (URLs) -- cgit v1.2.3