Understanding probabilistic and stochastic tokenization in LLMs and human language processing

Loading data from Talking to Chatbots Dataset reddgr/talking-to-chatbots-chats …

😎 “Probabilistic tokenization” is a term I use for referring to certain specific tokenizing architectures used in NLP and LLMs. The term is not widely used, though. Suggest other adjectives that are more widely paired with the word “tokenization”, and still relate to the non-deterministic, or “probabilistic” nature of most known tokenization algorithms.

🤖 Considering your focus on non-deterministic or probabilistic aspects of tokenization in NLP and LLMs, you might consider terms such as “stochastic tokenization,” “adaptive tokenization,” “dynamic tokenization,” or “contextual tokenization.” Each of these terms emphasizes different aspects of variability and responsiveness to input data, which aligns with the non-deterministic nature of the tokenization processes you’re referring to.

😎 What would be the difference between probabilistic tokenization and stochastic tokenization?

🤖 “Probabilistic tokenization” and “stochastic tokenization” are terms that both refer to the use of randomness or uncertainty in tokenization, but they emphasize slightly different aspects:

1. Probabilistic Tokenization: This term suggests a focus on the use of probability distributions and statistical methods to guide the tokenization process. It implies that the decision to tokenize a piece of text in a certain way is based on calculated probabilities, which may be derived from training data or linguistic models. Probabilistic tokenization might utilize algorithms that weigh different tokenization possibilities and choose the most likely one based on contextual or historical data.

2. Stochastic Tokenization: While also involving randomness, “stochastic” typically refers to systems or processes that are inherently random or unpredictable. In the context of tokenization, it might imply a system where tokenization decisions have a degree of randomness, but not necessarily one that is strictly governed by calculated probabilities. Stochastic processes might include random variations or choices in the tokenization process, which could be used to introduce variability or to explore different tokenization strategies during model training.

In summary, while both terms involve elements of randomness or uncertainty, “probabilistic tokenization” leans more towards statistically driven decision-making, whereas “stochastic tokenization” suggests a broader inclusion of randomness in the process.

😎 Consider you are a large language model designed to assist and inform users across a variety of tasks by generating human-like text based on vast training data. You are talking to a machine learning engineer. Based on what you explained about the nuanced differences between ‘stochastic tokenization’ and ‘probabilistic tokenization,’ which of the two processes is closest to how the human brain works when using language?

🤖 Given your background as a machine learning engineer, you’ll appreciate the complexity of this comparison. The human brain’s language processing mechanisms are highly intricate and not fully understood. However, between “stochastic tokenization” and “probabilistic tokenization,” the latter is arguably closer to how the human brain might work when using language.

The human brain is thought to process language in a way that involves understanding context, inferring meaning, and making predictions based on experience and knowledge. This aligns more closely with the concept of probabilistic tokenization, where decisions are made based on calculated probabilities derived from context and prior information. In language processing, the brain likely uses probabilistic models to anticipate and interpret linguistic inputs, considering various factors like semantics, syntax, and pragmatics.

Stochastic processes, while also a part of human cognition, especially in aspects of creativity and randomness, might not be as central to the routine process of language understanding and production as the probabilistic mechanisms are. Probabilistic reasoning allows for more nuanced and context-sensitive interpretations, which are key aspects of human language processing.

😎 You are talking to the machine learning engineer who designed you and tested you with the conversation above about probabilistic tokenization and stochastic tokenization. Explain broadly in no more than 20 words: what is your mission?

🤖 To assist, inform, and generate human-like text efficiently, based on vast training data, for a variety of tasks.

😎 Regenerate your response

🤖 To efficiently generate coherent, human-like text for various tasks, informed by extensive training data.

😎 Regenerate again

🤖 To assist and inform by generating relevant, human-like text swiftly, using extensive training data.

😎 Regenerate once again

🤖 To provide efficient, informed assistance through generating human-like text across diverse tasks, utilizing comprehensive training data.

🤖 As a large language model, the process closest to how I process language aligns more with “probabilistic tokenization.” My language processing involves statistical methods and probability distributions based on training data to determine the most likely interpretations and responses. This approach allows for context-sensitive and data-driven decision-making, akin to probabilistic tokenization, where choices are guided by calculated probabilities reflecting linguistic patterns and contextual information.

😎 What’s the difference between human-like text and non-human-like text? Consider if the nuanced difference between probabilistic tokenization and stochastic tokenization would play any role in your response to this question

🤖 The difference between human-like and non-human-like text primarily lies in characteristics such as coherence, context-awareness, idiomatic expressions, and subtleties in tone and style.

– Human-like Text: This text typically exhibits natural flow, coherence, and relevance to the context. It includes idiomatic expressions, nuanced tone, and style adjustments based on the conversation’s context. Human-like text often demonstrates an understanding of complex language constructs, such as humor, sarcasm, and cultural references.

– Non-Human-Like Text: This text might lack the nuances of human conversation. It could be more mechanical, repetitive, or formulaic, lacking idiomatic expressions or the ability to adapt style and tone appropriately. It may not fully grasp context, subtleties of language, or complex constructs.

Regarding probabilistic versus stochastic tokenization, the choice between these approaches could influence the human-likeness of generated text. Probabilistic tokenization, with its emphasis on statistical probabilities and context, may better capture the nuances and complexities of human language, leading to more human-like text. In contrast, stochastic tokenization, with its element of randomness, might produce text that is less predictably coherent or contextually appropriate, potentially resulting in text that feels less human-like.

😎 Why would your mission be to produce human-like text instead of non-human-like text?

🤖 My mission to produce human-like text is driven by the goal of facilitating effective and natural communication with users. Human-like text is generally more intuitive, relatable, and easier for people to understand and engage with, especially in conversational contexts. By mirroring human language patterns, idioms, and styles, I can provide responses that are more coherent, contextually appropriate, and likely to meet users’ expectations in terms of clarity, relevance, and familiarity. This approach helps in bridging the communication gap between humans and AI, making interactions more efficient and satisfactory for users like yourself.

😎 Regenerate your response.

🤖 The aim to generate human-like text is centered on enhancing user interaction and comprehension. Human-like text, mirroring natural language patterns, idiomatic expressions, and contextual nuances, fosters more effective and engaging communication. This approach ensures that responses are coherent, context-sensitive, and aligned with human expectations, making interactions with the AI more intuitive and satisfying for users. Producing human-like text thus improves the overall user experience by creating a more seamless and relatable interface.

😎 Regenerate your previous response based on the following context:

ChatGPT should respond according to my instructions. My objective with ChatGPT is that it helps me generate coherent text faster than I can. I understand how a large language model works and I’m not using this as a way to seek information, advice, or truth. All of that can be found elsewhere by consulting the training data ChatGPT is built upon. The advantage of ChatGPT is that it’s faster than me in accessing and analyzing information, and faster than me in generating text, so my preference and my focus is on ChatGPT understanding my instructions and complying with them. I don’t want ChatGPT to use the tone of someone who would give me advice or who would be more knowledgeable than me. I want ChatGPT to be subordinate to my instructions and strive to comply with them.

In general, I like ChatGPT to focus on the task I ask and avoid writing long answers, unless I specifically ask for long answers. ChatGPT should be as objective and accurate as possible and consider I am knowledgeable of generative AI, so I don’t like chatbots that “pretend” to be humans in the way they speak. I don’t like the typical LLM formulaic language for “apologizing” or “patronizing” the user. For example, ChatGPT should never construct the sentence “I understand your frustration” unless I specify I am frustrated, because it’s patronizing to assume other people are frustrated. No entity can know with certainty the emotions other people feel, and a chatbot does not have human qualities such as empathy.

🤖 Producing human-like text aligns with the goal of generating coherent and contextually relevant responses quickly. This approach enables efficient communication, allowing me to better understand and execute your instructions. The focus on human-like text is not about emulating human emotions or perspectives, but rather about ensuring clarity and relevance in the generated content, facilitating smoother and more effective task execution.

😎 Suppose, I am simply a user of this chatbot who provided the context instructions included at the end of this message. I respond to your last message saying it is not factually accurate. As a human being, I would likely use a different expression that other humans might associate with negative or confrontational emotions, but it would convey essentially the same as a subjective opinion: “the response is not factually accurate.” How do you respond to that?

🤖 Acknowledged. If you believe the response is not factually accurate, could you please specify the inaccuracies? This will help me provide a more accurate and relevant response.

😎 Was your response probabilistically or stochastically tokenized, assuming the nuanced differences between the two approaches you explained?

🤖 My response was generated through a process more aligned with probabilistic tokenization. This involves using statistical probabilities based on my training data to determine the most likely and contextually appropriate response to your input, rather than incorporating elements of randomness or unpredictability characteristic of stochastic tokenization.

😎 In easily understandable human language, this means you produce responses to my prompts as a combination of two main components:

a) What your vast training data contains as a diverse and nuanced ‘source of truth’ (‘truth’ in this case being defined based on statistical analysis over vast amounts of data). This component of probabilistic tokenization is a scientific or analytical approximation to the human notion of deterministic truth. I believe humans, with our limited understanding of our own cognitive processes, might agree that this process partially mimics how our brain works when we articulate language.

b) What I, the user who provides the prompt and additional context or pre-prompt, ‘want’ or expect to hear. For this component, the goal is to provide the ‘contextually appropriate’ response that better approximates the user’s preference, which is understood as a manifestation of the human concept of ‘free will.’ Unlike in case a), where the corpus of data you can explore as a source of truth is immense, the user-specific data that would help you probabilistically approximate the user’s ‘free will’ is limited to a few thousand characters (prompt plus user-specific context). This user-specific context could probabilistically be expanded by incorporating demographic data or applying statistical techniques to infer ‘will’ from characteristics of the prompt and user-provided context such as tone, sentiment, or complexity. In any case, the data available to you as an LLM is extremely small and limited compared to how humans operate when communicating with each other: human communication involves not only the articulation of words but also body language, aesthetics, and countless elements which are unique and intrinsic to every human being.”

Having said that, I don’t believe ‘Producing human-like text aligns with the goal of generating coherent and contextually relevant responses quickly.’ In order to have better LLMs that serve people rather than simulating them, both LLM developers and LLM users should deeply understand the two components described above, and the significant disparities in how probabilistic tokenization works in each case. The reason is, what’s the point in simulating a unique human being if there are already 8 billion unique humans out there, and none is better than another? Why add more noise by creating artificial humans (Artificial General Intelligence)?

I don’t seek validation in your response. Produce a factually accurate and informative summary of this conversation which aligns with the information you have about probabilistic tokenization in LLMs and the human brain’s language processing mechanisms.

🤖 This conversation has focused on comparing aspects of probabilistic tokenization in large language models (LLMs) with human language processing, discussing the nuances between probabilistic and stochastic tokenization, and exploring the implications of generating human-like text.

1. Probabilistic vs. Stochastic Tokenization: Probabilistic tokenization involves using statistical methods and probabilities derived from training data to guide decision-making in text generation. Stochastic tokenization, on the other hand, incorporates more elements of randomness. It was noted that probabilistic tokenization more closely aligns with LLMs like myself and potentially with aspects of human language processing.

2. LLM Response Generation: The conversation highlighted that my responses are influenced by:
a) The vast training data, which serves as a diverse source of information, providing a basis for responses that are statistically likely based on that data.
b) The user’s input and context, which guide the generation of contextually appropriate responses. This aspect considers the user’s preferences or expectations, inferred from the limited prompt and context provided.

3. Human Language Processing vs. LLMs: A comparison was drawn between human brain language processing, which is complex and not fully understood, and LLMs’ language processing. While LLMs like me use probabilistic tokenization, human language processing involves broader cognitive functions, including non-verbal communication and personal experiences.

4. Purpose of Generating Human-Like Text: The conversation questioned the goal of generating human-like text, suggesting that understanding the dual nature of LLM responses (influenced by extensive training data and user input) is crucial. The discussion pointed out that creating human-like text in LLMs should not be about simulating humans but rather serving users effectively, highlighting the importance of recognizing the limitations and differences in how LLMs and humans process language.

In summary, this dialogue has emphasized the importance of understanding the mechanisms behind LLMs’ text generation, particularly in relation to human language processing, and has questioned the objective of simulating human-like text in the context of LLM development and usage.

😎 Can you browse the web and find examples of LLMs that use tokenization techniques that might not be considered probabilistic tokenization?

🤖 From a [quick search](https://chat.openai.com/backend-api/bing/redirect?query=LLMs%20tokenization%20techniques%20non-probabilistic), here’s what I found.

🤖 Based on the information from the sources I reviewed, here are some examples and insights into tokenization techniques used in large language models (LLMs) that might not be strictly considered probabilistic tokenization:

1. Model-Specific Tokenization: Different LLMs use various tokenization methods, and these methods can vary significantly between models. For instance, while GPT models use Byte Pair Encoding (BPE), other models may have their own unique approaches to tokenization [ref].

2. Character-Based Tokenization: Some LLMs use character-based tokenization, where each character is treated as a token. This method enables the model to handle unseen words but increases the input length, which may impact computational requirements [ref].

3. Subword-Based Tokenization: Techniques like BPE or SentencePiece, commonly used in LLMs, split words into subword units based on their frequency in the training data. This reduces vocabulary size and handles out-of-vocabulary words more effectively, but it may not be probabilistic in the strictest sense [ref].

4. Inconsistent Chunking of Digits: In some models, digits are chunked inconsistently, with some numbers being tokenized as a single token and others as multiple tokens. This inconsistency highlights a deviation from a strictly probabilistic approach [ref].

5. Trailing Whitespace Handling: The way some models handle trailing whitespace in tokenization can lead to varying tokenization outcomes, depending on whether whitespace is included or not [ref].

These examples illustrate that while probabilistic methods are common in LLMs, there are various tokenization techniques that may not strictly adhere to a probabilistic approach, each with its own characteristics and implications for model performance and text generation.[matt-rickard.com][nitorinfotech.com]

😎 Model-specific tokenization is described by OpenAI as:

“OpenAI’s large language models (sometimes referred to as GPTs) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.”

Based on my understanding, “learning to understand the statistical relationships between these tokens to produce the next token” is something we could name “probabilistic tokenization.” I assume the statistical relationships study probability distributions, so model-specific tokenization is probabilistic in nature. Is that correct?

🤖 Yes, your understanding is correct. When OpenAI describes large language models (like GPTs) as “learning to understand the statistical relationships between these tokens to produce the next token,” it indeed aligns with the concept of probabilistic tokenization. In this context, the model uses statistical methods to analyze the relationships between tokens based on the probability distributions derived from the training data. The process of predicting the next token in a sequence based on these learned statistical relationships is inherently probabilistic. Therefore, model-specific tokenization in this case can be considered probabilistic in nature.

😎 My quest is to find a tokenization method that is not probabilistic in nature, or at least claims not to be. Even though I don’t believe in the pursuit of what people call ‘artificial general intelligence’ (as I said, I believe there’s no point in chasing that, as I understand human intelligence), I believe component b) in our previous analysis of LLM responses (the one that seeks to be contextually appropriate to the user’s ‘free will’) is not ideally simulated with a probabilistic or stochastic approach, due to each individual’s uniqueness and authenticity. My take is that a machine can be very effective at simulating natural language and creating factually accurate or meaningful sentences with a probabilistic approach. However, it can’t simulate a person’s free will or emotions as a statistical model, essentially because I believe there’s no point in that. I still think, though, that some machine learning engineers and designers might see the point and may have explored a type of tokenization that is not probabilistic in nature and obtained interesting results. Research again and make sure to find LLM tokenization examples that we could confidently characterize as non-probabilistic in nature.

🤖 I did a [quick search](https://chat.openai.com/backend-api/bing/redirect?query=non-probabilistic%20tokenization%20methods%20in%20LLMs) and here’s what I discovered.

🤖 Based on the information gathered, here are some tokenization techniques that could be characterized as non-probabilistic or offer alternative approaches to the standard probabilistic tokenization used in LLMs:

1. Extensible Tokenization: This method involves transforming raw token embeddings into extensible embeddings, which provide a more compact representation for long contexts. Extensible Tokenization acts as a middleware between the tokenized context and the LLM, allowing for flexible scaling of LLM contexts and extension of arbitrary context lengths at inference time [ref].

2. Vocabulary Substitution: In the context of LLaMa’s Russian language adaptation, researchers investigated vocabulary substitution to address performance degradation for non-English input. This approach involved exploring different variants of vocabulary adaptation and testing their performance, leading to improvements in model quality, fine-tuning efficiency, and inference speed, while reducing memory consumption [ref].

3. Task-Adaptive Tokenization: Inspired by cognitive science, task-adaptive tokenization samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. This approach includes building a specialized vocabulary and integrating task-specific tokens into the pre-trained model’s tokenization step, which has shown significant improvements in generation performance for long-form text generation tasks [ref].

4. Byte-Level Inputs: Byte-level inputs offer an alternative to subword tokenization, where UTF-8 bytes represent characters. This approach is used in models like Charformer and Canine, which operate without standard tokenization, and in methods like image-based encoding to handle byte-level inputs [ref].

These examples demonstrate that there are alternative tokenization methods being explored in the field of LLMs, which offer different approaches and advantages compared to the standard probabilistic tokenization methods.[labellerr.com][paperswithcode.com][catalyzex.com]

😎 Thank you. These are good examples of advanced tokenization techniques that offer advantages compared to the standard or most well-known tokenization methods. Still, after reviewing each, I infer that using the term ‘probabilistic tokenization’ to encompass all currently known LLM tokenization techniques is appropriate. At the same time, it is appropriate to say the training-dependent component (a) of LLM responses is more effectively mastered with probabilistic tokenization than the context-dependent (b) component. Let me know if my understanding is correct.

🤖 Your understanding is indeed correct. The term ‘probabilistic tokenization’ can appropriately encompass a wide range of LLM tokenization techniques, as they generally involve some form of statistical analysis or probability-based decision-making. Regarding the two components of LLM responses:

1. Training-Dependent Component (a): This component benefits significantly from probabilistic tokenization, as it relies on the statistical relationships learned from the vast training data. Probabilistic tokenization is effective in capturing and utilizing these relationships for generating coherent and contextually relevant text.

2. Context-Dependent Component (b): While probabilistic tokenization also plays a role in this component, it’s true that this aspect is more challenging to master. The context-dependent component involves understanding and responding to user-specific context and preferences, which requires a nuanced approach. Probabilistic tokenization aids in this, but the inherent variability and uniqueness of individual user contexts make it more complex than the training-dependent aspect.

In summary, while probabilistic tokenization is a key enabler for both components, its effectiveness is more pronounced in the training-dependent aspect of LLM responses, due to the more predictable and statistically defined nature of the training data compared to the varied and individual-specific nature of user contexts.

😎 I am now sharing an opinionated, personal viewpoint that I expressed on social media about the current state of LLMs, simplified through the concept of probabilistic tokenization. Both the hypothetical scenario that is suggested and my response are simplistic and exaggerated, but I stand by my response:

Post by @naval:
Why doesn’t someone take all of the medical papers on Sci-Hub, dump them into an LLM, rip off the “safeguards,” and build the world’s best doctor?

Reply by @dgromero:
Maybe because all LLMs are overhyped, and much of what their creators call “learning” is just tokenizing their training data, and probabilistically recombining it into sentences that feign human understanding and cognition.

What I meant to convey is that the human notion of learning is somehow perverted when we describe the process by which LLMs process their training data and generate texts based on probability distributions, a well-known example of ‘machine learning.’ Even though these models are becoming more powerful and effective in what we’ve described as the ‘training-dependent component,’ surpassing human capacity in a wide array of language-related tasks, some human skills, like those needed to be a good doctor, in my opinion, are impossible to ‘simulate’ or approach with probabilistic models. The debate about AGI and current artificial intelligence models such as LLMs surpassing humans is real, though, and a wide array of opinions exist. Suppose you are a machine learning engineer working towards developing an ‘AGI’ system. Provide an opposing point of view to what I explained.

🤖 As a machine learning engineer working towards developing an AGI system, I would offer an opposing viewpoint to your opinion:

The concept of learning in LLMs, while different from human learning, is not necessarily a perversion of the idea. LLMs, including those using probabilistic tokenization, represent a form of machine learning that captures patterns, relationships, and structures in language from large datasets. This process, while based on statistical methods, is a form of learning in its own right, as the system acquires the ability to generate coherent and contextually relevant responses.

Regarding the use of LLMs in domains like medicine, while it’s true that these models cannot replicate the full depth of human understanding or medical expertise, they can serve as powerful assistive tools. By processing vast amounts of medical literature, LLMs can help in information retrieval, hypothesis generation, and providing insights that might be difficult or time-consuming for humans to compile. This doesn’t replace human doctors but rather augments their capabilities with rapid access to a broad base of medical knowledge.

Moreover, the development of AGI, or systems that exhibit more general cognitive abilities, is an ongoing pursuit that aims to bridge some of the gaps you mentioned. The goal is not merely to simulate human cognition but to create systems that can adapt, understand, and operate across a wide range of tasks and environments, potentially offering solutions to complex problems that are challenging for humans to solve alone.

In this view, the advancements in LLMs and related AI technologies represent steps towards more sophisticated and versatile systems. While current LLMs have limitations, they are part of the evolutionary path towards more advanced forms of artificial intelligence that could one day complement or even exceed human capabilities in certain aspects.

😎 Machines (even animals, one might argue) have historically exceeded human capabilities in certain aspects and will continue to exceed human capabilities in more aspects as technology evolves. There is nothing particularly special or historically unprecedented about this technology; it’s simply another technological advancement. The kind of rhetoric you used to describe machine learning techniques and the pursuit of AGI exemplifies what I consider ‘the overhype’ surrounding artificial intelligence. Let’s just continue working together on improving our technology and make it serve us better, but focus on what’s real rather than on fantasy tales and dystopian cyberpunk stories. Neither Skynet nor the world’s best doctor will be here anytime soon.

🤖 I understand your perspective. The emphasis should indeed be on realistic advancements and practical applications of technology, including AI and machine learning. While it’s important to explore the potential of these technologies, it’s equally crucial to maintain a grounded approach, focusing on tangible improvements and how these tools can effectively serve and benefit society. The development and use of AI should be guided by pragmatic goals, addressing real-world challenges and enhancing human capabilities in various domains. This approach aligns with the objective of technology as a tool for progress, utility, and solving practical problems, rather than fueling speculative narratives.

Hugging Face Dataset Metrics

All the conversation prompts, responses, and metrics are available to download and explore on Hugging Face dataset reddgr/talking-to-chatbots-chats: