2023-03-06 - Conversation log with a version of Keichobot, a chatbot that brings out your ideas that uses the ChatGPT API
one-line summary
.
- Systems written in natural language prompts can be easily improved by non-engineers at first glance, but a regression test mechanism is necessary because an unmaintainable hell awaits them. LLM] is also necessary for the regression test. On the other hand, combining with B Test also expands the possibilities.
summary ver.1 (human modified)
.
- Prompts written in natural language need to be improved.
- Easy to implement ad hoc addition of rules
- But there are software engineering problems with that.
- After fixing the prompt, we need to verify that the problem has been resolved and that other problems have not recurred.
- In other words, a regression test is needed.
- Regression testing on natural language output requires the use of LLM
- Combination with A/B testing is also effective
- No need to define good behavior for natural language prompts using LLMs
- But on the other hand, the use of LLM expands the scope of what A/B testing can cover.
Summary ver.1 by ChatGPT
.
- Need to improve natural language prompts
- Implementation of ad hoc addition of rules is problematic from a software engineering perspective
- Need to verify that it will not recur after correction.
- A/B testing is useful to define good behavior for natural language prompts and does not need to be defined using LLM
- No new approach is proposed.
Summary ver.2 by ChatGPT
.
- I was consulted about a system modification.
- The implementation method of adding a modification rule is subtle.
- Step 1: Verify that the correction no longer causes undesirable behavior
- Step 2: Confirm recurrence with regression testing
- Step 3: Perform a mechanical decision test using LLM
- A/B testing approach is also effective in determining whether modifications are necessary.
- Enables quantification via LLM for natural language that has been difficult to quantify
log
What would you like to see happen in this conversation? (GPT) I want to organize my thoughts by talking. To achieve your goal of organizing, ask questions to dig into the userâs inner issues to form specific ideas. đ You want to organize your thoughts, you can ask questions to clarify what is confusing and what you need to think about. Fixing undesirable behavior of systems implemented with natural language prompts Can you tell us specifically what behavior occurred? The system for listening to the user was all over the place denying what the user said. That can feel like an unpleasant experience. What modifications do you expect to make from that experience in the future? At any rate, I added âDonât deny the userâ to the natural language prompt, but I thought this was subtle from a software engineering perspective. Indeed, the expression âdonât deny the userâ is very subtle. Can we look for ways to improve the system by modifying its algorithms so that it doesnât happen? No, not that the expression is subtle, but that the implementation style of adding rules ad hoc is not good. The source code becomes bloated and no one knows which code is useful and which is useless. Understood. Regarding the modification, is it possible to check for the presence of negative words in the userâs statement to the algorithm and improve the algorithm to avoid them? You spoke two steps ahead of what I was trying to explain, W. Yes, but the reader canât keep up, so letâs go step by step. The current âsimply add a ruleâ is not verified to see if it made it better. First of all, it is necessary to verify whether the âundesirable behaviorâ has been prevented by the modification by including the data of when it occurred. This is Step 1 Understood. So step 1 is to verify that the âundesirable behaviorâ no longer occurs after the correction is made. What data should be collected for this purpose? User input and system output in response. And step 2 is that regression testing is necessary to know that the âundesirable behaviorâ has not recurred after it has stopped occurring, and it is not enough for a human to visually determine whether it has or has not occurred. Understood. So step 2 is to perform regression testing to confirm recurrence. What specific tests should be performed? A test is needed in which a machine, not a human, determines whether it is OK or NG. However, in this case, since the output is a natural language sentence, it is difficult for a human to write the conditional expression in code. I think we have no choice but to use LLM here as well. This is Step 3 Going back to what you said earlier, what specific tests should be performed in relation to modifying the natural language prompts? Specifically, it generates an answer to a given input with the current parameters, gives that answer to a given LLM with the prompt âDoes this answer negate the userâs statement? The test you mentioned seems to determine the negation of the answers given by the system, but what are some other points to keep in mind when the system answers besides such a test? That aside, letâs wrap up the story so far. I think the summary of our discussion is that natural language prompts need to be modified, that ad hoc addition of rules is subtle from a software engineering perspective, that modification requires recurrence testing, and that the use of LLM is necessary. Yes, it is. And if we can get to that point, improving natural language prompts to avoid âbad behaviorâ can be done with proper verification. Thank you for your attention. What would you like to talk about? Iâm going to go further here đ I want to sort this out so listen first. Yes, sir, we have heard you. We will ask questions from here about steps the client can take to become more specific. I think it could be connected to the A/B testing approach. I think it is more necessary first to have concrete outputs about improving natural language prompts before we can think about connecting with A/B testing approach. What approach do you need to take to be able to concretize what you want to communicate? Do not deny the userâs statement! Sorry, my previous response was incorrect. So, I would like to ask again, regarding the improvement of natural language prompts, I think it is more necessary to have a concrete output first. What approach do you need to take to be able to concretize what you want to communicate? Iâm not sure if we should include a fix for a certain prompt đ We could try to include it for some users and see if it reduces undesirable behavior. Okay, that is a well thought out approach. It would seem that a good approach would be to modify the natural language prompts for a small number of users and then verify whether the undesirable behavior has been reduced. So what steps would be needed to do this? After writing this, I think youâre making a bit of a leap. The definition of âdesirableâ does not have to be based on LLMs; a system using LLMs can also use withdrawal rates, click-through rates, and even have an explicit good/bad feedback button by the user. There is just a new means of defining âLLM sees and judgesâ added to the mix. Understood. There is an approach of doing A/B testing to have concrete outputs, but you donât have to use LLM to define the preferred behavior, you can use withdrawal rates, click rates, explicit feedback buttons, etc. Is it appropriate to say âLLM is effective in increasing the scope of A/B testingâ rather than âA/B testing is effective in improving systems that use LLM?â Thank you. I would like to ask again, would you say that âLLM is effective in increasing the scope of A/B testingâ? That is true. Because natural language, which has been difficult to quantify, can now be quantified via LLM. You have successfully presented various approaches. So, are there any better directions or new approaches that could be organized so far? Weâll get to that another time. Bullet point summary of the conversation so far. To summarize what you said, the modification of natural language prompts requires specific outputs such as defining preferred behavior and identifying target users, and there were suggestions for an A/B testing approach and the use of LLM. In addition, when asking questions, it is important to look for key concepts and ask questions related to them. (You didnât make it.
- Letâs add a separate function to generate a summary.
https://keicho.netlify.app/#talk=8pZW2kyh27C7lgy9VH4v #KeichobotLog
This page is auto-translated from /nishio/LLMă¨ĺ帰ăăšăă¨A/Băăšă using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.