pcalau12i

pcalau12i@lemmygrad.ml · edit-2 2 days ago

There is no “fundamentally” here, you are referring to some abstraction that doesn’t exist. The models are modified during the fine-tuning process, and the process trains them to learn to adopt DeepSeek R1’s reasoning technique. You are acting like there is some “essence” underlying the model which is the same between the original Qwen and this model. There isn’t. It is a hybrid and its own thing. There is no such thing as “base capability,” the model is not two separate pieces that can be judged independently. You can only evaluate the model as a whole. Your comment is just incredibly bizarre to respond to because you are referring to non-existent abstractions and not actually speaking of anything concretely real.

The model is neither Qwen nor DeepSeek R1, it is DeepSeek R1 Qwen Distill as the name says. it would be like saying it’s false advertising to say a mule is a hybrid of a donkey and a horse because the “base capabilities” is a donkey and so it has nothing to do with horses, and it’s really just a donkey at the end of the day. The statement is so bizarre I just do not even know how to address it. It is a hybrid, it’s its own distinct third thing that is a hybrid of them both. The model’s capabilities can only be judged as it exists, and its capabilities differ from Qwen and the original DeepSeek R1 as actually scored by various metrics.

Do you not know what fine-tuning is? It refers to actually adjusting the weights in the model, and it is the weights that define the model. And this fine-tuning is being done alongside DeepSeek R1, meaning it is being adjusted to take on capabilities of R1 within the model. It gains R1 capabilities at the expense of Qwen capabilities as DeepSeek R1 Qwen Distill performs better on reasoning tasks but actually not as well as baseline models on non-reasoning tasks. The weights literally have information both of Qwen and R1 within them at the same time.

Speaking of its “base capabilities” is a meaningless floating abstraction which cannot be empirically measured and doesn’t refer to anything concretely real. It only has its real concrete capabilities, not some hypothetical imagined capabilities. You accuse them of “marketing” even though it is literally free. All DeepSeek sells is compute to run models, but you can pay any company to run these distill models. They have no financial benefit for misleading people about the distill models.

You genuinely are not making any coherent sense at all, you are insisting a hybrid model which is objectively different and objectively scores and performs differently should be given the exact same name, for reasons you cannot seem to actually articulate. It clearly needs a different name, and since it was created utilizing the DeepSeek R1 model’s distillation process to fine-tune it, it seems to make sense to call it DeepSeek R1 Qwen Distill. Yet for some reason you insist this is lying and misrepresenting it and it actually has literally nothing to do with DeepSeek R1 at all and it should just be called Qwen and we should pretend it is literally the same model despite it not being the same model as its training weights are different (you can do a “diff” on the two model files if you don’t believe me!) and it performs differently on the same metrics.

There is simply no rational reason to intentionally want to mislabel the model as just being Qwen and having no relevance to DeepSeek R1. You yourself admitted that the weights are trained on R1 data so they necessarily contain some R1 capabilities. If DeepSeek was lying and trying to hide that the distill models are based on Qwen and Llama, they wouldn’t have literally put that in the name to let everyone know, and released a paper explaining exactly how those were produced.

It is clear to me that you and your other friends here have some sort of alternative agenda that makes you not want to label it correctly. DeepSeek is open about the distill models using Qwen and Llama, but you want them to be closed and not reveal that they also used DeepSeek R1. The current name for it is perfectly fine and pretending it is just a Qwen model (or Llama, for the other distilled versioned) is straight-up misinformation, and anyone who downloads the models and runs them themselves will clearly see immediately that they perform differently. It is a hybrid model correctly called what they are: DeepSeek R1 Qwen Distill and DeepSeek R1 Llama Distill.

pcalau12i@lemmygrad.ml · edit-2 2 days ago

The 1.5B/7B/8B/13B/32B/70B models are all officially DeepSeek R1 models, that is what DeepSeek themselves refer to those models as. It is DeepSeek themselves who produced those models and released them to the public and gave them their names. And their names are correct, it is just factually false to say they are not DeepSeek R1 models. They are.

The “R1” in the name means “reasoning version one” because it does not just spit out an answer but reasons through it with an internal monologue. For example, here is a simple query I asked DeepSeek R1 13B:

Me: can all the planets in the solar system fit between the earth and the moon?

DeepSeek: Yes, all eight planets could theoretically be lined up along the line connecting Earth and the Moon without overlapping. The combined length of their diameters (approximately 379,011 km) is slightly less than the average Earth-Moon distance (about 384,400 km), allowing them to fit if placed consecutively with no required spacing.

However, on top of its answer, I can expand an option to see its internal monologue it went through before generating the answer, which you can find the internal monologue here because it’s too long to paste.

What makes these consumer-oriented models different is that that rather than being trained on raw data, they are trained on synthetic data from pre-existing models. That’s what the “Qwen” or “Llama” parts mean in the name. The 7B model is trained on synthetic data produced by Qwen, so it is effectively a compressed version of Qen. However, neither Qwen nor Llama can “reason,” they do not have an internal monologue.

This is why it is just incorrect to claim that something like DeepSeek R1 7B Qwen Distill has no relevance to DeepSeek R1 but is just a Qwen model. If it’s supposedly a Qwen model, why is it that it can do something that Qwen cannot do but only DeepSeek R1 can? It’s because, again, it is a DeepSeek R1 model, they add the R1 reasoning to it during the distillation process as part of its training. They basically use synthetic data generated from DeepSeek R1 to fine-tune readjust its parameters so it adopts a similar reasoning style. It is objectively a new model because it performs better on reasoning tasks than just a normal Qwen model. It cannot be considered solely a Qwen model nor an R1 model because its parameters contain information from both.

pcalau12i@lemmygrad.ml · edit-2 1 month ago

Isn’t the quantum communication (if it were possible) supposed to be actually instantaneous, not just “nearly instantaneous”?

There is no instantaneous information transfer (“nonlocality”) in quantum mechanics. You can prove this with the No-communication Theorem. Quantum theory is a statistical theory, so predictions are made in terms of probabilities, and the No-communication Theorem is a relativity simple proof that no physical interaction with a particle in an entangled pair can alter the probabilities of the other particle it is entangled with.

(It’s actually a bit more broad than this as it shows that no interaction with a particle in an entangled pair can alter the reduced density matrix of the other particle it is entangled with. The density matrix captures more than probabilities, but also the ability for the particle to exhibit interference effects.)

The speed of light limit is a fundamental property of special relativity, and if quantum theory violated this limit then it would be incompatible with special relativity. Yet, it is compatible with it and the two have been unified under the framework of quantum field theory.

There are two main confusions as to why people falsely think there is anything nonlocal in quantum theory, stemming from Bell’s theorem and the EPR paradox. I tried to briefly summarize these two in this article here. But to even more briefly summarize…

People falsely think Bell’s theorem proves there is “nonlocality” but it only proves there is nonlocality if you were to replace quantum theory with a hidden variable theory. It is important to stress that quantum theory is not a hidden variable theory and so there is nothing nonlocal about it and Bell’s theorem just is not applicable.

The EPR paradox is more of a philosophical argument that equates eigenstates to the ontology of the system, which such an equation leads to the appearance of nonlocal action, but this is just because the assumption is a bad one. Relational quantum mechanics, for example, uses a different assumption about the relationship between the mathematics and the ontology of the system and does not run into this.

pcalau12i@lemmygrad.ml · 2 months ago

There is a strange phenomenon in academia of physicists so distraught over the fact that quantum mechanics is probabilistic that they invent a whole multiverse to get around it.

Let’s say a photon hits a beam splitter and has a 25% chance of being reflected and a 75% chance of passing through. You could make this prediction deterministic if you claim the universe branches off into a grand multiverse where in 25% of the branches the photon is reflected and in 75% of the branches it passes through. The multiverse would branch off in this way with the same structure every single time, guaranteed.

Believe it or not, while they are a minority opinion, there are quite a few academics who unironically promote this idea just because they like that it restores determinism to the equations. One of them is David Deutsch who, to my knowledge, was the first to publish a paper arguing that he believed quantum computers delegate subtasks to branches of the multiverse.

It’s just not true at all that the quantum chip gives any evidence for the multiverse, because believing in the multiverse does not make any new predictions. Everyone who proposes this multiverse view (called the Many-Worlds Interpretation) do not actually believe the other branches of the multiverse would actually be detectable. It is something purely philosophical in order to restore determinism, and so there is no test you could do to confirm it. If you believe the outcome of experiments are just random and there is one universe, you would also predict that we can build quantum computers, so the invention of quantum computers in no way proves a multiuverse.

pcalau12i@lemmygrad.ml · edit-2 2 months ago

It does not lend credence to the notion at all, that statement doesn’t even make sense. Quantum computing is inline with the predictions of quantum mechanics, it is not new physics, it is engineering, the implementation of physics we already know to build stuff, so it does not even make sense to suggest engineering something is “discovering” something fundamentally new about nature.

MWI is just a philosophical worldview from people who dislike that quantum theory is random. Outcomes of experiments are nondeterministic. Bell’s theorem proves you cannot simply interpret the nondeterminism as chaos, because any attempt to introduce a deterministic outcome at all would violate other known laws of physics, so you have to just accept it is nondeterministic.

MWI proponents, who really dislike nondeterminism (for some reason I don’t particularly understand) came up with a “clever” workaround. Rather than interpreting probability distributions as just that, probability distributions, you instead interpret them as physical objects in an infinite-dimensional space. Let’s say I flip four coins so the possible outcomes are HH, HT, TH, and TT, and each you can assign a probability value to. Rather than interpreting the probability values as the likelihood of events occurring, you interpret the “faceness” property of the coin as a multi-dimensional property that is physically “stretched” in four dimensions, where the amount it is “stretched” depends upon those values. For example, if the probabilities are 25% HH, 0% HT, 25% TH, and 50% TT, you interpret it as if the coin’s “faceness” property is physically stretched out in four physical dimensions of 0.25 HH, 0 HT, 0.25 TH, and 0.5 TT.

Of course, in real quantum mechanics, it gets even more complicated than this because probability amplitudes are complex-valued, so you have an additional degree of freedom, so this would be an eight-dimensional physical space the “quantum” coins (like electron spin state) would be stretched out in. Additionally, notice how the number of dimensions depends upon the number of possible outcomes, which would grow exponentially by 2^N the more coins you have under consideration. MWI proponents thus posit that each description like this is actually just a limited description due to a limited perspective. In reality, the dimensions of this physical space would be 2^N where N=number of possible states of all particles in the entire universe, so basically infinite. The whole universe is a single giant infinite-dimensional object propagating through this infinite-dimensional space, something they called the “universal wave function.”

If you believe this, then it kind of restores determinism. If there is a 50% probability a photon will reflect off of a beam splitter and a 50% probability it will pass through, what MWI argues is that there is in fact a 100% chance it will pass through and be reflected simulateously, because it basically is stretched out in proportions of 0.5 going both directions. When the observer goes to observe it, the observer themselves also would get stretched out in those proportions, of both simulateously seeing it it pass through and be reflected. Since this outcome is guaranteed, it is deterministic.

But why do we only perceive a single outcome? MWI proponents chalk it up to how our consciousness interprets the world, that it forms models based on a limited perspective, and these perspectives become separated from each other in the universal wave function during a process known as decoherence. This leads to an illusion that only a single perspective can be seen at a time, that even though the human observer is actually stretched out across all possible outcomes, they only believe they can perceive one of them at a time, and which one we settle on is random, I guess kind of like the blue-black/white-gold dress thing, your brain just kind of picks one at random, but the randomness is apparent rather than real.

This whole story really is not necessary if you are just fine with saying the outcome is random. There is nothing about quantum computers that changes this story. Crazy David has a bad habit of publishing embarrassingly bad papers in favor of MWI. One paper he defends MWI with a false dichotomy pitching MWI as if its only competition is Copenhagen, then straw manning Copenhagen by equating it to an objective collapse model, which no supporter of this interpretation I am aware of would ever agree to this characterization of it.

Another paper where he brings up quantum computing, he basically just argues that MWI must be right because it gives a more intuitive understanding of how quantum computing actually provides an advantage, that it delegates subtasks to different branches of the multiverse. It’s bizarre to me how anyone could think something being “intuitive” or not (it’s debatable whether or not it even is more intuitive) is evidence in favor of it. At best, it is an argument in favor of utility: if you personally find MWI intuitive (I don’t) and it helps you solve problems, then have at ya, but pretending this somehow is evidence that there really is a multiverse makes no sense.