Skip to content

Commit 1c6fb35

Browse files
committed
doc: add an explanation of Git's data model
Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca>
1 parent bb69721 commit 1c6fb35

File tree

4 files changed

+300
-2
lines changed

4 files changed

+300
-2
lines changed

Documentation/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc
5252
MAN7_TXT += gitcore-tutorial.adoc
5353
MAN7_TXT += gitcredentials.adoc
5454
MAN7_TXT += gitcvs-migration.adoc
55+
MAN7_TXT += gitdatamodel.adoc
5556
MAN7_TXT += gitdiffcore.adoc
5657
MAN7_TXT += giteveryday.adoc
5758
MAN7_TXT += gitfaq.adoc

Documentation/gitdatamodel.adoc

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
gitdatamodel(7)
2+
===============
3+
4+
NAME
5+
----
6+
gitdatamodel - Git's core data model
7+
8+
SYNOPSIS
9+
--------
10+
gitdatamodel
11+
12+
DESCRIPTION
13+
-----------
14+
15+
It's not necessary to understand Git's data model to use Git, but it's
16+
very helpful when reading Git's documentation so that you know what it
17+
means when the documentation says "object", "reference" or "index".
18+
19+
Git's core operations use 4 kinds of data:
20+
21+
1. <<objects,Objects>>: commits, trees, blobs, and tag objects
22+
2. <<references,References>>: branches, tags,
23+
remote-tracking branches, etc
24+
3. <<index,The index>>, also known as the staging area
25+
4. <<reflogs,Reflogs>>: logs of changes to references ("ref log")
26+
27+
[[objects]]
28+
OBJECTS
29+
-------
30+
31+
All of the commits and files in a Git repository are stored as "Git objects".
32+
Git objects never change after they're created, and every object has an ID,
33+
like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
34+
35+
This means that if you have an object's ID, you can always recover its
36+
exact contents as long as the object hasn't been deleted.
37+
38+
Every object has:
39+
40+
[[object-id]]
41+
1. an *ID* (aka "object name"), which is a cryptographic hash of its
42+
type and contents.
43+
It's fast to look up a Git object using its ID.
44+
This is usually represented in hexadecimal, like
45+
`1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
46+
2. a *type*. There are 4 types of objects:
47+
<<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
48+
and <<tag-object,tag objects>>.
49+
3. *contents*. The structure of the contents depends on the type.
50+
51+
Here's how each type of object is structured:
52+
53+
[[commit]]
54+
commit::
55+
A commit contains the full directory structure of every file
56+
in that version of the repository and each file's contents.
57+
It has these required fields
58+
(though there are other optional fields):
59+
+
60+
1. The *files* in the commit, stored as the *<<tree,tree>>* ID
61+
of the commit's base directory.
62+
2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
63+
regular commits have 1 parent, merge commits have 2 or more parents
64+
3. An *author* and the time the commit was authored
65+
4. A *committer* and the time the commit was committed.
66+
5. A *commit message*
67+
+
68+
Here's how an example commit is stored:
69+
+
70+
----
71+
tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
72+
parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
73+
author Maya <maya@example.com> 1759173425 -0400
74+
committer Maya <maya@example.com> 1759173425 -0400
75+
76+
Add README
77+
----
78+
+
79+
Like all other objects, commits can never be changed after they're created.
80+
For example, "amending" a commit with `git commit --amend` creates a new
81+
commit with the same parent.
82+
+
83+
Git does not store the diff for a commit: when you ask Git to show
84+
the commit with linkgit:git-show[1], it calculates the diff from its
85+
parent on the fly.
86+
87+
[[tree]]
88+
tree::
89+
A tree is how Git represents a directory.
90+
It can contain files or other trees (which are subdirectories).
91+
It lists, for each item in the tree:
92+
+
93+
1. The *filename*, for example `hello.py`
94+
2. The *file mode*. Git has these file modes. which are only
95+
spiritually related to Unix file modes:
96+
+
97+
- `100644`: regular file (with <<object,object type>> `blob`)
98+
- `100755`: executable file (with type `blob`)
99+
- `120000`: symbolic link (with type `blob`)
100+
- `040000`: directory (with type `tree`)
101+
- `160000`: gitlink, for use with submodules (with type `commit`)
102+
103+
3. The <<object-id,*object ID*>> with the contents of the file or directory
104+
+
105+
For example, this is how a tree containing one directory (`src`) and one file
106+
(`README.md`) is stored:
107+
+
108+
----
109+
100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
110+
040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
111+
----
112+
113+
[[blob]]
114+
blob::
115+
A blob object contains a file's contents.
116+
+
117+
When you make a commit, Git stores the full contents of each file that
118+
you changed as a blob.
119+
For example, if you have a commit that changes 2 files in a repository
120+
with 1000 files, that commit will create 2 new blobs, and use the
121+
previous blob ID for the other 998 files.
122+
This means that commits can use relatively little disk space even in a
123+
very large repository.
124+
125+
[[tag-object]]
126+
tag object::
127+
Tag objects contain these required fields
128+
(though there are other optional fields):
129+
+
130+
1. The object *ID* it references
131+
2. The object *type*
132+
3. The *tagger* and tag date
133+
4. A *tag message*, similar to a commit message
134+
135+
Here's how an example tag object is stored:
136+
137+
----
138+
object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3
139+
type commit
140+
tag v1.0.0
141+
tagger Maya <maya@example.com> 1759927359 -0400
142+
143+
Release version 1.0.0
144+
----
145+
146+
NOTE: All of the examples in this section were generated with
147+
`git cat-file -p <object-id>`.
148+
149+
[[references]]
150+
REFERENCES
151+
----------
152+
153+
References are a way to give a name to a commit.
154+
It's easier to remember "the changes I'm working on are on the `turtle`
155+
branch" than "the changes are in commit bb69721404348e".
156+
Git often uses "ref" as shorthand for "reference".
157+
158+
References can either refer to:
159+
160+
1. An object ID, usually a <<commit,commit>> ID
161+
2. Another reference. This is called a "symbolic reference".
162+
163+
References are stored in a hierarchy, and Git handles references
164+
differently based on where they are in the hierarchy.
165+
Most references are under `refs/`. Here are the main types:
166+
167+
[[branch]]
168+
branches: `refs/heads/<name>`::
169+
A branch refers to a commit ID.
170+
That commit is the latest commit on the branch.
171+
+
172+
To get the history of commits on a branch, Git will start at the commit
173+
ID the branch references, and then look at the commit's parent(s),
174+
the parent's parent, etc.
175+
176+
[[tag]]
177+
tags: `refs/tags/<name>`::
178+
A tag refers to a commit ID, tag object ID, or other object ID.
179+
There are two types of tags:
180+
1. "Annotated tags", which reference a <<tag-object,tag object>> ID
181+
which contains a tag message
182+
2. "Lightweight tags", which reference a commit, blob, or tree ID
183+
directly
184+
+
185+
Even though branches and tags both refer to a commit ID, Git
186+
treats them very differently.
187+
Branches are expected to change over time: when you make a commit, Git
188+
will update your <<HEAD,current branch>> to point to the new commit.
189+
Tags are usually not changed after they're created.
190+
191+
[[HEAD]]
192+
HEAD: `HEAD`::
193+
`HEAD` is where Git stores your current <<branch,branch>>,
194+
if there is a current branch. `HEAD` can either be:
195+
+
196+
1. A symbolic reference to your current branch, for example `ref:
197+
refs/heads/main` if your current branch is `main`.
198+
2. A direct reference to a commit ID. In this case there is no current branch.
199+
This is called "detached HEAD state", see the DETACHED HEAD section
200+
of linkgit:git-checkout[1] for more.
201+
202+
[[remote-tracking-branch]]
203+
remote-tracking branches: `refs/remotes/<remote>/<branch>`::
204+
A remote-tracking branch refers to a commit ID.
205+
It's how Git stores the last-known state of a branch in a remote
206+
repository. `git fetch` updates remote-tracking branches. When
207+
`git status` says "you're up to date with origin/main", it's looking at
208+
this.
209+
+
210+
`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's
211+
default branch. This is the branch that `git clone` checks out by default.
212+
213+
[[other-refs]]
214+
Other references::
215+
Git tools may create references anywhere under `refs/`.
216+
For example, linkgit:git-stash[1], linkgit:git-bisect[1],
217+
and linkgit:git-notes[1] all create their own references
218+
in `refs/stash`, `refs/bisect`, etc.
219+
Third-party Git tools may also create their own references.
220+
+
221+
Git may also create references other than `HEAD` at the base of the
222+
hierarchy, like `ORIG_HEAD`.
223+
224+
NOTE: Objects will only be deleted if they aren't "reachable" from any reference.
225+
An object is "reachable" if we can find it by following tags to whatever
226+
they tag, commits to their parents or trees, and trees to the trees or
227+
blobs that they contain.
228+
For example, if you amend a commit, with `git commit --amend`,
229+
the old commit will usually not be reachable, so it may be deleted eventually.
230+
231+
[[index]]
232+
THE INDEX
233+
---------
234+
The index, also known as the "staging area", is a list of files and
235+
the contents of each file, stored as a <<blob,blob>>.
236+
You can add files to the index or update the contents of a file in the
237+
index with linkgit:git-add[1]. This is called "staging" the file for commit.
238+
239+
Unlike a <<tree,tree>>, the index is a flat list of files.
240+
When you commit, Git converts the list of files in the index to a
241+
directory <<tree,tree>> and uses that tree in the new <<commit,commit>>.
242+
243+
Each index entry has 4 fields:
244+
245+
1. The *file mode*, which must be one of:
246+
- `100644`: regular file (with <<object,object type>> `blob`)
247+
- `100755`: executable file (with type `blob`)
248+
- `120000`: symbolic link (with type `blob`)
249+
- `160000`: gitlink, for use with submodules (with type `commit`)
250+
2. The *<<blob,blob>>* ID of the file,
251+
or (rarely) the *<<commit,commit>>* ID of the submodule
252+
3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if
253+
there's a merge conflict there can be multiple versions of the same
254+
filename in the index.
255+
4. The *file path*, for example `src/hello.py`
256+
257+
It's extremely uncommon to look at the index directly: normally you'd
258+
run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
259+
But you can use `git ls-files --stage` to see the index.
260+
Here's the output of `git ls-files --stage` in a repository with 2 files:
261+
262+
----
263+
100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
264+
100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
265+
----
266+
267+
[[reflogs]]
268+
REFLOGS
269+
-------
270+
271+
Every time a branch, remote-tracking branch, or HEAD is updated, Git
272+
updates a log called a "reflog" for that <<references,reference>>.
273+
This means that if you make a mistake and "lose" a commit, you can
274+
generally recover the commit ID by running `git reflog <reference>`.
275+
276+
A reflog is a list of log entries. Each entry has:
277+
278+
1. The *commit ID*
279+
2. *Timestamp* when the change was made
280+
3. *Log message*, for example `pull: Fast-forward`
281+
282+
Reflogs only log changes made in your local repository.
283+
They are not shared with remotes.
284+
285+
You can view a reflog with `git reflog <reference>`.
286+
For example, here's the reflog for a `main` branch which has changed twice:
287+
288+
----
289+
$ git reflog main --date=iso --no-decorate
290+
750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README
291+
4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit
292+
----
293+
294+
GIT
295+
---
296+
Part of the linkgit:git[1] suite

Documentation/glossary-content.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a
297297
identified by its <<def_object_name,object name>>. The objects usually
298298
live in `$GIT_DIR/objects/`.
299299
300-
[[def_object_identifier]]object identifier (oid)::
301-
Synonym for <<def_object_name,object name>>.
300+
[[def_object_identifier]]object identifier, object ID, oid::
301+
Synonyms for <<def_object_name,object name>>.
302302
303303
[[def_object_name]]object name::
304304
The unique identifier of an <<def_object,object>>. The

Documentation/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ manpages = {
192192
'gitcore-tutorial.adoc' : 7,
193193
'gitcredentials.adoc' : 7,
194194
'gitcvs-migration.adoc' : 7,
195+
'gitdatamodel.adoc' : 7,
195196
'gitdiffcore.adoc' : 7,
196197
'giteveryday.adoc' : 7,
197198
'gitfaq.adoc' : 7,

0 commit comments

Comments
 (0)