notes/schema/mag_schema_comparison.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65


Looking at the Microsoft Academic Graph schema: https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema

My take-aways from this are:

- should allow storing raw affiliations today in release_contrib rows, and some
  day have a foreign key to institution there
- maybe should have an "original_title" field for releases? though could go in
  'extra' (along with subtitle)
- have a well-known 'extra' key to use saving citation context in references


## Data Model (high-level)

Includes rich affiliation (at the per-paper level) and "field of study"
tagging.

No work/release distinction.

There are URLs, but no file-level metadata.

Don't store full abstracts for legal reasons.


## Details (lower-level)

Across many entities, there are "normalized" and "display" names.

Some stats are aggregated: paper and citation counts

#### Affilitions

Institution names: "normalized" vs. "display"

"GRID" id?

What is the WikiPage? Wikipedia?

#### Authors

Saves "last known" affiliation.

#### Field of Study

Nested hierarchy

#### Citations

"Context" table stores... presumably text around the citaiton itself.

"References" table stores little metadata about the citation itself.

#### Papers

Paper URLs now have types (an int).

"Paper Title" / "Original Title" / "Book Title"

Year and Date separately (same as fatcat)

Stores first and last page separately.

"Original Venue" (string), presumably name of the container/journal

Has arbitrary resources (URLs)