Open-weight ≠ Open source

These days, I hear and read a lot about "open-source" models, in particular when it comes to LLMS. The truth of the matter is that, as far as I know, there is no actual open-source model out there. There are "open-weight" models, which is an entirely different thing.

The concept of open-source, is that one can take the source code of a software and study it, change it, and build the same thing or something different from it. Open-weight models are not the same. We have absolutely zero clue about what data went into the training sets, thus we wouldn't be able to rebuild the weight matrices even if we wanted to.

You might ask: "Aren't the weights enough? You can customize the models from there..." to which I would reply "Meh, it depends..." Because, yes, we can definitely fine-tune models ex-post, however not knowing what data went into the training means that we will never know how they are biased, and if the training data reflects our needs. A good example will be the extreme bias towards the English language: English is the language represented most efficiently, but is it what you need?

Also do you ever wonder why we need strings of repeated emojis in the training set of LLMs? Unless you want it to respond in the same way, there is really no good reason to learn such pattern. The more general the system, the easier it is mis-use it or to obtain mis-information from it, and maybe you just need an LLM for a specific field of science, say medicine.