Skip to content

Instantly share code, notes, and snippets.

@dbuenzli
Last active August 22, 2022 07:38
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dbuenzli/a78131f54580212986713ef3e9b313e8 to your computer and use it in GitHub Desktop.
Save dbuenzli/a78131f54580212986713ef3e9b313e8 to your computer and use it in GitHub Desktop.
OCaml compiler support for library linking
@gasche
Copy link

gasche commented Nov 12, 2019

Executables are also made to embed the library names that have been fully linked into them when they are produced with the -linkall flag. This allows the Dynlink API to avoid reloading the library dependencies it already contains.

I don't understand how that would work. Is Dynlink supposed to locate the current executable and somehow extract the metadata? (It sounds painful to do on an ELF binary, etc.). Or is the idea to have some Sys.loaded_archives variable that gives the list of linked archives, and is populated by the runtime like Sys.argv?

@gasche
Copy link

gasche commented Nov 12, 2019

lib_requires : string list

I would suggest lib_requires : Misc.library_name list.

@gasche
Copy link

gasche commented Nov 12, 2019

Note that directories of dependent libraries are not added to the includes.

This idea, introduced by dune for interesting reasons, (1) does not necessarily work well with the compiler today, and (2) may create usability for interfaces that mention types coming from dependent libraries. It has upsides and downsides and is probably worth further discussion, and also having written justifications/explanations somewhere (in an appendix?).

@dbuenzli
Copy link
Author

dbuenzli commented Nov 12, 2019

Thanks @gasche for the read.

I don't understand how that would work. Is Dynlink supposed to locate the current executable and somehow extract the metadata? (It sounds painful to do on an ELF binary, etc.).

That's just introduced via a special symbol in the executable. Nothing new is being invented this is already done at the module level for CRC checking via the caml_global_map value so I don't think it poses any problem. Here the code that embeds the CRCs at link time. This is then retrieved by the Dynlink module via the caml_natdynlink_getmap and used in various ways by the native dynlink API here. A similar strategy can be adopted for libs.

This idea, introduced by dune for interesting reasons, (1) does not necessarily work well with the compiler today, and (2) may create usability for interfaces that mention types coming from dependent libraries.

I don't think it was introduced by dune. It is simply the correct way of handling dependencies. If a library L mentions types that come from dependent libraries in its API then L didn't make its library usage abstract and formally as a client of L you also depend on these other libraries it uses in its API and you have to specify them in your root dependencies.

Note that this is quite important in practice, I can't count the number of dependent packages that fail when I remove the result compatibility package from mines because these dependent package get their result dependency via my package rather than specifying it themselves.

@gasche
Copy link

gasche commented Nov 13, 2019

We discussed this at an OCaml development meeting today. There is broad support for the proposal, although we did not discuss the detailed aspects in-depth. One point that was raised especially by @lpw25 is that it would be interesting to store type-checking package dependencies in .cmi files as well as .cma, so that we can keep doing the type-checking by reading only the .cmi files. This matters quite a bit to @alainfrisch who is perpetually afraid of the terrible Windows filesystem performance, and its impact on build time.

@dbuenzli
Copy link
Author

dbuenzli commented Nov 13, 2019

One point that was raised especially by @lpw25 is that it would be interesting to store type-checking package dependencies in .cmi files as well as .cma, so that we can keep doing the type-checking by reading only the .cmi files

If that's related to the point of "recursive include dependencies" then I'm not very fond of this idea. I would rather have the information whether and which recursive library dependencies are needed by a cmi because they export some aspect of them. That way graceful error messages can be produced if they are not specified on the cli while allowing clients to specify their root dependencies correctly.

@gasche
Copy link

gasche commented Nov 13, 2019

I must say that I don't remember enough of the details to construct an argument for why one should store package dependencies in .cmi files as well. Indeed, if you are not planning to recursively add those dependencies at type-checking time, then they are not required for type-checking (but then having them gives the possibility of adding them recursively if you want?). Maybe @lpw25 or @alainfrisch can comment.

@alainfrisch
Copy link

This matters quite a bit to @alainfrisch who is perpetually afraid of the terrible Windows filesystem performance, and its impact on build time.

I don't think I even mentioned concerns with filesystem performance (but yes, I'm concerned with that as well). I've several other concerns with having the type-checker loads .cma files:

  • It would be weird that the type-checker in ocamlc and the one in ocamlopt read different files (.cma/.cmxa) and thus possibly behave differently. (And which one would Merlin use?)

  • Having the type-checker depend on .cma means that modifying any .ml file in the library (not even an .mli) means that all client modules of the library must be recompiled (even in bytecode), which is bad for incremental compilation. Libraries are already bad in term of fine-grained dependency tracking (they force relinking all programs that depend on a library, even though they don't even depend on the module that has been modified), but that would be much worse.

And indeed, what we discussed was implicitly adding transitive dependencies to the load path, so that if a module A (in some library) exposes a type alias type t = int and another module B (in another library) exposes a function return A.t, the client of B "knows" about the type alias. So, as far as I understood the discussion, the idea was to add information (preferably to B.cmi) about the dependency on the first library. But this implicitly ties modules to libraries, which might be too restrictive.

Disclaimer: I did not read the discussion above (before @gasche note who mentions me).

@dbuenzli
Copy link
Author

And indeed, what we discussed was implicitly adding transitive dependencies to the load path, so that if a module A (in some library) exposes a type alias type t = int and another module B (in another library) exposes a function return A.t, the client of B "knows" about the type alias.

That's precisely what I'm against because it leads to library dependency under-specification (see the last two paragraphs of that comment). However I think library names (no archive lookup) that are needed to import the equations for using B.cmis could be embedded in the cmis so that good error messages can be reported if one tries to compile against module B without also specifying the library name that has the needed A.

@alainfrisch
Copy link

Thanks @dbuenzli for clarifying that, and for pointing to the relevant part of the discussion.

@lpw25
Copy link

lpw25 commented Nov 14, 2019

so that if a module A (in some library) exposes a type alias type t = int and another module B (in another library) exposes a function return A.t, the client of B "knows" about the type alias.

I would like to point out that removing accesses to aliases like this is not really supported. The type-checker would be perfectly justified in storing all the references to A.t as int in which case the client of B would already "see" the alias.

I think that forcing users to express transitive dependencies that they do not directly use is a mistake. I think it essentially breaks modularity and reveals implementation details of the type checker to the user. I should be able to a new dependency to my library without possibly forcing all downstream users to update their dependencies (and whether they have to potentially depending on the whims of type inference).

@alainfrisch
Copy link

From the discussion yesterday (and please correct if I didn't understand correctly), this new system would not be used by Dune itself for managing dependencies between libraries in a single project. It would be used by people calling the compiler from the command-line manually, or from non-Dune build-system, or by Dune when using installed libraries (not under its control).

In order to explore the design space, I'd like to understand the benefits/drawbacks of the system compared to something where .cma/.cmxa files disappear from the picture completely, replaced by "source (manifest) files" that specify the library content and dependencies. Imagine that MYLIB/lib.txt is a text file that list the units which form the library (say, a list of capitalized module name), the name of dependent libraries, and a few other meta-data attached to the library (C libraries, etc). Of course, .cmo files would now need to be installed. In essence, this lib.txt has the same information as .cma files, except that it doesn't copy the content of .cmo files but only point to them.

For the end-user, the behavior would be same, I believe. But I think this brings some advantages:

  • The library file gives a declarative and more transparent definition of the installed library, instead of it being specified through the build system of the library. The files could be considered as source code, or created by the build system (or perhaps even by the compiler from a command-line spec, as is currently the case for .cma files, at least to help with the transition). We could imagine tools making use of the library manifest files before the library is even built; examples: (1) a very simple build tool (perhaps the compiler itself?) which builds and installs the library only based on the manifest file; (2) a connection with the package manager, to download/install dependent libraries before building the library (similar to the npm workflow I guess).

  • We avoid the library link step when building the library, which reduces a bit I/O traffic and disk usage.

  • A clever build system can now see inside the library, and do finer-grained dependency tracking. So Dune would not need to re-link an executable the only modules in a dependent library that have changed are not actually dependencies of the executable. This is perhaps not a common situation, but this could happen if you are developing a "standalone" library, repeatedly re-installing it, and trying to rebuild one or several Dune projects that depend on it.

  • The text file could could be read by the compiler during type-checking, to know the list of interfaces exposed by the library (i.e. .cmi files that can be loaded), instead of reading the file system. (This is related to the PR about adding individual .cmi instead of -I directories to the load path.) A step further could allow specifying multiple interfaces for the same library (different subsets of interface); this suggests that one could distinguish the file that describes the library (for linking purposes) from the file(s) that describe one or several interfaces for the library (for type-checking purposes).

  • This reduces dependencies on the native (C) toolchain concepts and tools (namely libraries and ar).

A drawback is that more files need to be installed and copied around to make the library usable, but we already need to deploy all/most .cmi/.cmx files anyway; also deploying .cmo/.o files doesn't change much this picture.

Those library manifest files are quite similar to "response" files, with some differences: (1) unit are referred to by unit names, not file names (so we don't need to maintain two versions with foo.cmo and foo.cmx); (2) the compilation unit names they mention are interpreted relative to the manifest file itself; (3) units are linked only if needed; (4) files themselves need to be looked up in the load path (contrary to normal response files), and a given file is loaded only once. But perhaps those aspects could be supported directly : (1) allow referring to unit names, not file names at link stage -- when specified directly on the command line, do a lookup in the full load path; (2) add a new kind of response file (or change their interpretation) so that when a unit name is specified in a file, the lookup is restricted to the directory of that file (similarly, if C libraries are specified, they would be looked up relative to the file of the response file); (3) allow specifying "weak unit names" (on the command line or in response files), with the same semantics as units in libraries; (4) support load-path based lookup of response files.

@lpw25
Copy link

lpw25 commented Nov 14, 2019

(All that said -- this proposal is not changing things in this respect, so I don't mind if we fix the problem later by adding automatic importing of transitive dependencies).

@lpw25
Copy link

lpw25 commented Nov 14, 2019

If you're going to drop the .cma files, why do you even need the text file?

The library contents are already represented by the contents of the directory, and the .cmo and .cmx files know what the required dependencies are. See my namespaces proposal for details.

@gasche
Copy link

gasche commented Nov 14, 2019

I think it essentially breaks modularity and reveals implementation details of the type checker to the user. I should be able to a new dependency to my library without possibly forcing all downstream users to update their dependencies.

I think that the behavior that the authors of this proposal expect is the following:

  • if my library does not mention the new dependency in its interface, users don't need to add the dependency and everything works as expected
  • if my library does mention the new dependency in its interface, but it is missing from the include path, then types and signatures from that interface are handled by the type-checker as abstract types and abstract interfaces
  • if my library does mention the new dependency and downstream user code depends on the definition of its types and signatures, then they do need to add the new dependency

I agree with @dbuenzli that this model has benefits, and I disagree with @lpw25 that it breaks abstraction, assuming users need to add the downstream dependency only if their own code needs more information about it than abstract types.

At the same time, I have strong doubts that the type-checker currently allows this to work flawlessly in all cases. (We know a bit about this because Dune tried to use this model before and it broke in various ways in corner-cases. @lpw25 and @Octachron have looked for example at ocaml/ocaml#8779). From a language perspective, it looks like a reasonable model and in fact a fairly good model (missing .cmis are just "free module/unit variables", with abstract type-level components and no value-level components), but our implementation probably isn't quite there yet.

@lpw25
Copy link

lpw25 commented Nov 14, 2019

The issue is that "mentions in its interface" is not actually well-defined, and neither is "handled by the type-checker as abstract types". This is a fundamental aspect of OCaml's design -- you cannot remove an equality without specifying a complete interface. It sort of happens to sometimes work at the moment -- and this proposal doesn't do anything to change that -- but encouraging people to rely on the current behaviour is a mistake.

The need to read transitive .cmi files is really just an optimisation in the current compiler implementation -- it would be perfectly possible to avoid it by just expanding type aliases, module aliases and module type aliases. We should avoid making the behaviour of the system dependent on whether this optimisation is being applied.

@gasche
Copy link

gasche commented Nov 14, 2019

When we discussed this together, @Octachron suggested that we could open the .cmi of transitive dependencies in a degraded mode where type-level definitions are available, but term-level definitions are not (using a term variable or a variant/field from the module would be an error/warning). This is not "enough" from the point of view of proposal authors, who would like (at least) any definition (even type-level) that was not needed when type-checking my library to be hidden from the users of my library.

Your remark on the fact that transitive .cmi reading is "just an optimization" does not consider the fact that it allows my library users to do more than just rely on the definition of aliases used by my library. The intention of the authors of the proposal is precisely to restrict the extra visibility that it allows, whose use is arguably problematic. I think we should acknowledge that this is a reasonable feature wish.

Meta-level remark: I think that this particular question (the type-checking visibility of transitive dependencies) is a small sub-point of the proposal, and maybe we shouldn't get too distracted with it when discussing the proposal as a whole. But I agree that it is controversial and needs to be discussed in details.

@lpw25
Copy link

lpw25 commented Nov 14, 2019

Your remark on the fact that transitive .cmi reading is "just an optimization" does not consider the fact that it allows my library users to do more than just rely on the definition of aliases used by my library. The intention of the authors of the proposal is precisely to restrict the extra visibility that it allows, whose use is arguably problematic.

Sorry, I was already assuming that we would remove that behaviour. @trefis and I have been planning to fix this for ages and he even wrote an RFC describing how to get rid of it. I forgot that he had not actually posted the RFC since ocaml/ocaml#9056 covered some of the same ground as his RFC.

The part of his proposal that is relevant to this discussion is that there should be a --hidden-cmi <file> option (alongside --cmi <file> as a per-file version of -I) for adding cmi files to the Path.t lookup without adding them to the Longident.t lookup -- essentially the degraded mode that you mention.

I agree that we should not get too side-tracked by this discussion -- the proposal here does not change things in this regard nor does it make it harder to fix these issues later.

@dbuenzli
Copy link
Author

dbuenzli commented Nov 14, 2019

@gasche summarized exactly the model I want. I personally think it matches, at the library level, the notion of abstraction we have in the language.

This being said I'm all for changing the system in the long term along the various lines that are suggested here (e.g. the possible eventual removal of archive files), but I prefer if we avoid changing everything at the same time.

This proposal has the benefit that it mostly doesn't change anything except re-encoding the current state of the world in a simpler, more obvious and formal manner and made aware to the compiler.

This first step will then only make it easier to introduce gradual improvements without disturbing the eco-system -- for example namespacing with which this proposal is highly compatible and will only make it easier to introduce in my opinion. But I really think it's better if this first step which is rather big, not compiler-wise, but eco-system wise is done without trying to turn everything upside down.

Regarding the specific issue of recursive -I or not, if it's unclear then erring on the non-recursive can only lead to over specification rather than under specification, which will then not break if it turns out recursive is what is needed (but I doubt).

@alainfrisch
Copy link

If you're going to drop the .cma files, why do you even need the text file?

  • The file provides library dependencies, which themselves give information on where to find dependent units. We could instead keep the library name, in addition to the unit name, for dependencies in each .cmo file.

  • Attaching various kinds of properties at the library level: dependencies to C libraries (we could also add them to .cmo/.cmx files), a per-library -linkall mode (if any of the object in the library is used, link the entire library). (Possibly also attaching information on preprocessors to be applied when the library is used.)

  • Generally speaking, relying on explicit information (from the command line or files) rather than the mere presence of files on the filesystem is more robust, and allow detecting problems faster, esp. with parallel builds.

  • To support multiple library interfaces, we need to specify somewhere a list of units anyway (admittedly, this is to a large extent independent of .cma libraries).

  • The text files can also serve as specification for other tools (to compile the library itself, or download/install dependencies).

@alainfrisch
Copy link

This being said I'm all for changing the system in the long term along the various lines that are suggested here (e.g. the possible eventual removal of archive files), but I prefer if we avoid changing everything at the same time.

It makes sense. I don't want to derail the project, and the proposal looks ok to me, even if it seems to me that the community direction is rather to push users towards Dune anyway, and even possibly a "mono-repo" approach (duniverse style); in that context, the user interface for OCaml is really Dune, and the current proposal doesn't bring much. But we are not there yet!

@dbuenzli
Copy link
Author

Personally I don't care about dune and I think it's good if the OCaml compiler interface is good without assuming or needing a particular build system or a closed world mentality.

There are many different ways you may want to go about compiling OCaml, here's an alternate one for example.

Copy link

ghost commented Nov 18, 2019

Regarding recursive include paths, it's not just about typing, it also impacts compilation and in particular optimisations such as inlining. I was discussing with @mshinwell recently who mentioned that the compiler not seeing some cmx files was a huge pain for flambda. And even without considering the middle-end, in the typer we might still want to carry the information that a type is immediate even if the user is not allowed to make assumptions about the type declaration.

That said, I concur regarding the benefits of not exposing transitive dependencies to the user. But it seems to me that @lpw25's idea is the most viable one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment