Why can't Git handle large files and large repos? -

July 15, 2011

dozens of questions , answers on , elsewhere emphasize git can't handle large files or large repos. handful of workarounds suggested such git-fat , git-annex, ideally git handle large files/repos natively.

if limitation has been around years, there reason limitation has not yet been removed? assume there's technical or design challenge baked git makes large file , large repo support extremely difficult.

lots of related questions, none seem explain why such big hurdle:

basically, comes down tradeoffs.

one of questions has example linus himself:

[...] cvs, ie ends being pretty oriented "one file @ time" model.

which nice in can have million files, , check out few of them - you'll never see impact of other 999,995 files.

git fundamentally never looks @ less whole repo. if limit things bit (ie check out portion, or have history go bit), git ends still caring whole thing, , carrying knowledge around.

so git scales badly if force @ 1 huge repository. don't think part fixable, although can improve on it.

and yes, there's "big file" issues. don't know huge files. suck @ them, know.

just won't find data structure o(1) index access , insertion, won't find content tracker fantastically.

git has deliberately chosen better @ things, detriment of others.

disk usage

since git dvcs (distributed version control system), has copy of entire repo (unless use relatively recent shallow clone).

this has really nice advantages, why dvcss git have become insanely popular.

however, 4 tb repo on central server svn or cvs manageable, whereas if use git, won't thrilled carrying around.

git has nifty mechanisms minimizing size of repo creating delta chains ("diffs") across files. git isn't constrained paths or commit orders in creating these, , work quite well....kinda of gzipping entire repo.

git puts these little diffs packfiles. delta chains , packfiles makes retrieving objects take little longer, effective @ minimizing disk usage. (there's tradeoffs again.)

that mechanism doesn't work binary files, tend differ quite bit, after "small" change.

history

when check in file, have forever , ever. grandchildren's grandchildren's grandchildren download cat gif every time clone repo.

this of course isn't unique git, being dcvs makes consequences more significant.

and while possible remove files, git's content-based design (each object id sha of content) makes removing files difficult, invasive, , destructive history. in contrast, can delete crufty binary artifact repo, or s3 bucket, without affecting rest of content.

difficulty

working large files requires a lot of careful work, make sure minimize operations, , never load whole thing in memory. extremely difficult reliably when creating program complex feature set git.

conclusion

ultimately, developers "don't put large files in git" bit "don't put large files in databases". don't it, alternatives have disadvantages (git intergration in 1 case, acid compliance , fks other). in reality, works okay, if have enough memory.

it doesn't work designed for.

Search This Blog

Plus Code

Why can't Git handle large files and large repos? -

Comments

Post a Comment

Popular posts from this blog

r - Trouble relying on third party package imports in my package -

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

Payment information shows nothing in one page checkout page magento -