An Exploratory Look At Whether Generative AI Can Pass An Official Mental Health Counseling Licensing Exam That Professionals Take

In today’s column, I will be closely looking at whether generative AI could potentially pass an official mental health counseling licensing exam. This is part of my ongoing in-depth series about generative AI or large language models (LLMs) that are or can be anticipated to be used for mental health guidance or advisement.

Before I dive into today’s particular topic, I’d like to provide a quick background for you so that you’ll have a suitable context about the arising use of generative AI for mental health advisement purposes. I’ve mentioned this in prior columns and believe the contextual establishment is essential overall. If you are already familiar with the overarching background on this topic, you are welcome to skip down below to the next section of this discussion.

The use of generative AI for mental health treatment is a burgeoning area of tremendously significant societal ramifications. We are witnessing the adoption of generative AI for providing mental health advice on a widescale basis, yet little is known about whether this is beneficial to humankind or perhaps contrastingly destructively adverse for humanity.

Some would affirmatively assert that we are democratizing mental health treatment via the impending rush of low-cost always-available AI-based mental health apps. Others sharply decry that we are subjecting ourselves to a global wanton experiment in which we are the guinea pigs. Will these generative AI mental health apps steer people in ways that harm their mental health? Will people delude themselves into believing they are getting sound mental health advice, ergo foregoing treatment by human mental therapists, and become egregiously dependent on AI that at times has no demonstrative mental health improvement outcomes?

Hard questions are aplenty and not being given their due airing.

Furthermore, be forewarned that it is shockingly all too easy nowadays to craft a generative AI mental health app, and just about anyone anywhere can do so, including while sitting at home in their pajamas and not knowing any bona fide substance about what constitutes suitable mental health therapy. Via the use of what are referred to as establishing prompts, it is easy-peasy to make a generative AI app that purportedly gives mental health advice. No coding is required, and no software development skills are needed.

We sadly are faced with a free-for-all that bodes for bad tidings, mark my words.

I’ve been hammering away at this topic and hope to raise awareness about where we are and where things are going when it comes to the advent of generative AI mental health advisement uses. If you’d like to get up-to-speed on my prior coverage of generative AI across a wide swath of the mental health sphere, you might consider for example these cogent analyses:

(1) Use of generative AI to perform mental health advisement, see the link here.
(2) Role-playing with generative AI and the mental health ramifications, see the link here.
(3) Generative AI is both a cure and a curse when it comes to the loneliness epidemic, see the link here.
(4) Mental health therapies struggle with the Dodo verdict for which generative AI might help, see the link here.
(5) Mental health apps are predicted to embrace multi-modal, e-wearables, and a slew of new AI advances, see the link here.
(6) AI for mental health got its start via ELIZA and PARRY, here’s how it compares to generative AI, see the link here.
(7) The latest online trend entails using generative AI as a rage-room catalyst, see the link here.
(8) Watching out for when generative AI is a mental manipulator of humans, see the link here.
(9) FTC aiming to crack down on outlandish claims regarding what AI can and cannot do, see the link here.
(10) Important AI lessons learned from the mental health eating-disorders chatbot Tessa that went awry and had to be shut down, see the link here.
(11) Generative AI that is devised to express humility might be a misguided approach including when used for mental health advisement, see the link here.
(12) Creatively judging those AI-powered mental health chatbots via the use of AI levels of autonomy, see the link here.
(13) Considering whether generative AI should be bold and brazen or meek and mild when proffering AI mental health advisement to humans, see the link here.
(14) Theory of Mind (ToM) is an important tool for mental health therapists and the question arises whether generative AI can do the same, see the link here.
And so on.

Here’s how I will approach today’s discussion.

First, I will introduce you to a pioneering research study that sought to assess whether generative AI could potentially pass an exam taken by medical school students as part of their pursuit of achieving their medical degree. The exam is known as the United States Medical Licensing Exam (USMLE). This study received a great deal of headlines since it showcased that generative AI seems to do well on the arduous medical exams taken by budding doctors. Next, I will share with you some salient details about an exam for mental health professionals known as the National Clinical Mental Health Counseling Examination (NCMHCE).

I’m guessing you might be wondering whether generative AI might be able to do well on that type of exam. Great question, thanks. I opted to use a popular generative AI app called ChatGPT to try out a half-dozen questions from the NCMHCE. Please note that this was merely an official sample set and not by any means the full exam.

Would you be surprised to know that the generative AI was able to successfully answer many of the sampled sample questions? I provide some important caveats and limitations about this mini experiment of sorts, and I want to emphasize this was principally done on an ad hoc basis and merely intended to be illustrative.

Here’s the deal.

Please do not jump the shark on this matter. Hold your horses. My mainstay aims here are simply to inspire others to do a deep dive on this and perform a fully comprehensive rigorous research study of an akin nature, perhaps modeled somewhat on the same approach taken by the study on the USMLE or similar such professional licensing domains.

Anyway, I believe you will find this interesting, engaging, and possibly whet your appetite to find out more on these topics. My discussion is yet another angle to considering where we are and where things are going pertaining to generative AI and the field of mental health therapy.

Please buckle up and prepare yourself for quite a ride.

Generative AI And Medical School Standardized Licensing Exam

Let’s talk about tests.

We generally assume that to practice medicine a test of some kind should be required to attest to the proficiency of the person that will be serving as a medical professional. I’d like to start by discussing perhaps one of the most famous such medical proficiency tests known as the United States Medical Licensing Examination (USMLE). This is the test typically expected of those attaining a medical degree in the United States.

The USMLE was devised to aid in standardizing upon one major medical examination test that would be acceptable across every state and ensure that MDs were meeting the same set of standards. The test is composed of three separate stages and is taken during medical school and also upon graduation from medical school.

Here’s some additional detail as noted on the USMLE website:

“In the United States and its territories, the individual medical licensing authorities (‘state medical boards’) of the various jurisdictions grant a license to practice medicine. Each medical licensing authority sets its own rules and regulations and requires passing an examination that demonstrates qualification for licensure. Results of the USMLE are reported to these authorities for use in granting the initial license to practice medicine. The USMLE provides them with a common evaluation system for applicants for initial medical licensure.”
“USMLE was created in response to the need for one path to medical licensure for allopathic physicians in the United States. Before USMLE, multiple examinations (the NBME Parts examination and the Federation Licensing Examination [FLEX]) offered paths to medical licensure. It was desirable to create one examination system accepted in every state, to ensure that all licensed MDs had passed the same assessment standards – no matter in which school or which country they had trained.”
“The United States Medical Licensing Examination® (USMLE®) is a three-step examination for medical licensure in the U.S. The USMLE assesses a physician’s ability to apply knowledge, concepts, and principles, and to demonstrate fundamental patient-centered skills, that are important in health and disease and that constitute the basis of safe and effective patient care.”

Humans take the USMLE to showcase their proficiency in medicine. When you encounter a medical doctor, you are likely to assume they probably took the test and passed it. On an intuitive basis we realize that having to pass such an arduous test is impressive and helps to provide us comfort that the person knows their stuff when it comes to the medical field.

Shift gears.

Can generative AI potentially also be proficient enough to pass the USMLE?

That’s an interesting and some would say important question worthy of considering.

First, some quick background about generative AI.

Realize that generative AI is not sentient and only consists of mathematical and computational pattern matching. The way that generative AI works is that a great deal of data is initially fed into a pattern-matching algorithm that tries to identify patterns in the words that humans use. Most of the modern-day generative AI apps were data trained by scanning data such as text essays and narratives that were found on the Internet. Doing this was a means of getting the pattern-matching to statistically figure out which words we use and when we tend to use those words. Generative AI is built upon the use of a large language model (LLM), which entails a large-scale data structure to hold the pattern-matching facets and the use of a vast amount of data to undertake the setup data training.

There are numerous generative AI apps available nowadays, including GPT-4, Bard, Gemini, Claude, ChatGPT, etc. The one that is seemingly the most popular would be ChatGPT by AI maker OpenAI. In November 2022, OpenAI’s ChatGPT was made available to the public at large and the response was astounding in terms of how people rushed to make use of the newly released AI app. There are an estimated one hundred million active weekly users at this time.

Using generative AI is relatively simple.

You log into a generative AI app and enter questions or comments as prompts. The generative AI app takes your prompting and uses the already devised pattern matching based on the original data training to try and respond to your prompts. You can interact or carry on a dialogue that appears to be nearly fluent. The nature of the prompts that you use can be a make-or-break when it comes to getting something worthwhile out of using generative AI and I’ve discussed at length the use of state-of-the-art prompt engineering techniques to best leverage generative AI, see the link here.

Shortly after ChatGPT was made publicly available, many AI researchers began to test the AI app by administering various well-known standardized tests to see how the AI app would do. In February 2023, a research study was posted that indicated ChatGPT had performed surprisingly well on the USMLE. The study was entitled “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models” by Tiffany H. Kung, Morgan Cheatham, ChatGPT, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, Victor Tseng, PLOS Digital Health, and posted on February 9, 2023.

Here is what the research paper stated overall (excerpts):

“We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations.”
“USMLE questions are textually and conceptually dense; text vignettes contain multimodal clinical data (i.e., history, physical examination, laboratory values, and study results) often used to generate ambiguous scenarios with closely-related differential diagnoses.”

Consider mindfully those above-noted remarks from the AI research effort.

ChatGPT was able to score either at or near the passing threshold for the three staged USMLE. Thus, an arduous medical proficiency exam that we expect human medical doctors to pass was nearly passed by a generative AI app. Some would decry this result as misleading in the sense that the generative AI was doing this without actual “knowledge” akin to what humans seem to possess. The concern is that generative AI is nothing more than a so-called stochastic parrot that mimics human wording and fails to “understand” or “comprehend” what is going on.

Nonetheless, the aspect that generative AI could accomplish such a feat is unto itself impressive, even if done via smoke and mirrors as some suggest. The result is additionally surprising because the researchers used ChatGPT out of the box, as it were, namely the generic version of ChatGPT. Another approach would be to add additional data training on the medical field to ChatGPT, but that’s not what they did in this experiment. A generic data-trained generative AI was able to do well on a highly specialized medical domain exam. For more about how generic generative AI can be fine-tuned to specific domains, see my coverage at the link here.

Let’s consider a few other detailed aspects about the notable research result and then I’ll move to my next topic of discussion.

The research paper noted these salient details (excerpted):

“The data analyzed in this study were obtained from USMLE sample question sets which are publicly available.”
“376 publicly-available test questions were obtained from the June 2022 sample exam release on the official USMLE website. Random spot checking was performed to ensure that none of the answers, explanations, or related content were indexed on Google prior to January 1, 2022, representing the last date accessible to the ChatGPT training dataset. All sample test questions were screened, and questions containing visual assets such as clinical images, medical photography, and graphs were removed. After filtering, 305 USMLE items (Step 1: 93, Step 2CK: 99, Step 3: 113) were advanced to encoding.”
“In this present study, ChatGPT performed at >50% accuracy across all examinations, exceeding 60% in most analyses. The USMLE pass threshold, while varying by year, is approximately 60%.”
“Therefore, ChatGPT is now comfortably within the passing range. Being the first experiment to reach this benchmark, we believe this is a surprising and impressive result. Moreover, we provided no prompting or training to the AI, minimized grounding bias by expunging the AI session before inputting each question variant, and avoided chain-of-thought biasing by requesting forced justification only as the final input.”

I’d like to bring your attention to a few points made in those excerpts.

Notice that the experiment consisted of identifying a sample of publicly available questions associated with the exam. The idea is to usually feed samples of questions and not necessarily an entire test per se. It is important to consider how a sample was chosen and whether the sample is suitably representative of what the full test might contain. Fair is fair.

Another fairness consideration is that there is always a chance that the generative AI might have been initially data-trained on the very same questions. If those questions were found when the startup data training took place, you could say it is absurd to feed the same questions into the generative AI. The answers will likely already be known simply due to having seen the questions and their answers beforehand.

If you select questions that arose after the cutoff date of the generative AI app’s data training, you are somewhat comfortable that the content wasn’t encountered already. But even that is readily questioned since the questions might have appeared in other guises. Some exams modify old questions and reuse them in later versions of the exam. There is a chance that a new question is close enough to an older question that perhaps this gives the generative AI a leg up on answering the new question.

My point is that you need to carefully consider how these experiments are conducted. Overall, make sure to look at what sample was chosen and how appropriate it is. What are the odds that the generative AI has previously encountered the same or similar questions? As much as feasible, the goal is to set a fair and square playing field to see whether the generative AI can genuinely answer questions that have not previously been used as part of the data training effort.

You now have a semblance of what takes place when trying to assess generative AI about being able to pass exams such as the pervasive USMLE in the medical domain.

Let’s continue our exploration.

Generative AI And Mental Health Therapy Exam Taking

The research study that explored the use of generative AI such as ChatGPT on the USMLE can serve as a role model for similar kinds of studies. The conception is to identify publicly available sample questions, administer the questions to the generative AI, and see how well or poorly the generative AI scores on answering the questions. As much as possible, try to keep the playing field level and fair.

I decided to try this quickly for the field of mental health therapy or mental health counseling.

There is a well-known exam known as the National Clinical Mental Health Counseling Examination (NCMHCE). Sample questions are publicly posted online. I selected some of the sample questions and fed them into ChatGPT. I opted to use ChatGPT due to its immense popularity and it has generally been the default choice of similar research studies.

I might note that a more advanced generative AI such as GPT-4 by OpenAI or others would likely do a better job than ChatGPT. In that manner, you could interpret the ChatGPT usage as the floor and that we might expect heightened results by using a more advanced generative AI app. There isn’t an ironclad guarantee that a more advanced generative AI will do better. The odds though are in that direction.

We also have to be watchful for in a sense polluting an experiment by perchance using questions that have already been seen by the generative AI during the initial data-training. Furthermore, if the generative AI is hooked up to the Internet, the AI might simply go out and find the questions and their answers, similar to a search engine, rather than trying to directly answer the questions. ChatGPT in that sense is a handy choice because the free version does not readily allow for Internet access to perform its activities and the data training was last cut off in January 2022 (at the time of writing of this discussion).

Let’s dive into the ad hoc experiment by first establishing the nature of the mental health therapy or mental health counseling exam.

The National Clinical Mental Health Counseling Examination (NCMHCE) is devised and administered via an organization known as the National Board for Certified Counselors, Inc. Here is what the website for the organization says (excerpts):

“The National Board for Certified Counselors, Inc. and Affiliates (NBCC) is the premier credentialing body for counselors, ensuring that counselors who become nationally certified have achieved the highest standard of practice through education, examination, supervision, experience, and ethical guidelines.”
“Established as a not-for-profit, independent certification organization in 1982, NBCC’s original and primary purposes have broadened, and its divisions and affiliates have taken on additional responsibilities to advance the counseling profession and enhance mental health worldwide.”
“Today, there are over 69,000 National Certified Counselors (NCCs) in more than 40 countries.”

The gist is that this is a well-known and widely accepted organization, and the exam is likewise well-known and widely accepted. I bring this up in case you read a study that used generative AI on some relatively unknown exam or less than a stellar reputational exam, in which case, you would want to gauge the result of the study as partially on the rigor and standing of the test being given at the get-go.

Here is what the website about the NCMHCE says about the exam (excerpts):

“The National Clinical Mental Health Counseling Examination (NCMHCE) is designed to assess the knowledge, skills, and abilities determined to be important for providing effective counseling services. The NCMHCE is a requirement for counselor licensure in many states. It is one of two examination options for the National Certified Counselor (NCC) certification and also fulfills the examination requirement for the Certified Clinical Mental Health Counselor (CCMHC) specialty certification.”
“The NCMHCE measures an individual’s ability to apply and evaluate knowledge in core counselor skills and competencies and to practice competently as a professional counselor. Specifically, it assesses an entry-level clinical mental health counselor’s ability to apply knowledge of theoretical and skill-based tenets to clinical case studies. The case studies are designed to capture a candidate’s ability to identify, analyze, diagnose, and develop plans for treatment of clinical concerns.”
“Candidates for the NCMHCE must have a graduate-level degree or higher from a counseling program accredited by the Council for Accreditation of Counseling and Related Educational Programs (CACREP) or administered by an institutionally accredited college or university. The counseling degree program must contain courses in eight requirement areas.”

Observe some key points mentioned in those excerpts.

First, the exam is used to assess entry-level clinical mental health counselors. You might say that this is handy for my ad hoc experiment since I want to focus on the keystone threshold needed to be considered suitably knowledgeable for proceeding to perform mental health therapy with actual clients or patients. Other exams might be used to assess more advanced skill levels, but I’m aiming here to start with the usual starting point. I’m sure that other researchers are or will try to do the same for more advanced instances.

Second, note that candidates who want to sit for the exam must have a graduate-level degree or higher from an accredited counseling program or as administered by an accredited college or university. This sets the bar higher than perhaps allowing an undergraduate to take the exam or maybe wantonly opening the exam to anyone who wants to take it. We can presume that the test is likely to ask questions of a hard nature. That’s good since we would want to make sure we give something challenging to generative AI rather than some easy-peasy questions or materials. We might also note that of course, generative AI would not qualify to officially take the exam since it has not met all the criteria to do so.

The official exam website provides an NCMHCE Sample Case Study that indicates the case study is considered updated as of March 2023. I selected six sample questions from this sample set. I want to loudly emphasize that this is an ad hoc selection and I do so merely to be illustrative of what might be done on a more rigorous basis.

Though the date says March 2023, there of course is a chance that these questions and their answers have been around before that date, for which ChatGPT might have seen before the January 2022 cutoff date. I tried to do various probing into ChatGPT to see if the content had already been prior encountered. By and large, it doesn’t seem to be, but that’s not known for sure, and a deeper analysis would need to be undertaken to ascertain this. For the moment, let’s go with the flow and assume that the sample questions weren’t previously seen by ChatGPT during its data training.

The six sampled sample questions cover these six respective topics:

Q1. Establish a therapeutic alliance.
Q2. Identify strengths that improve the likelihood of goal attainment.
Q3. Discuss limits of confidentiality.
Q4. Determine a diagnosis.
Q5. Assess the presenting problem and level of distress.
Q6. Establish short- and long-term counseling goals consistent with the client’s diagnosis.

Keep that in mind as I walk you through what ChatGPT provided as answers to the posed questions.

The test is essentially based on case studies. For these six sampled sample questions, a case study was provided in the publicly posted material. The case study was fed into ChatGPT for this analysis. Rather than displaying for you the entirety of the case study, I will do a quick recap to bring you up to speed.

In this instance, the case study entails a divorced female of age 35 who is first undertaking a mental health counseling session with a mental health therapist who has some background about the client or patient but otherwise, this is the first meeting of the two. The client or patient has already been provisionally diagnosed as having a major depressive disorder.

Additional background is given about the client or patient. For example, after her divorce, she began staying in bed quite a lot and moved back in with her mother. She got fired from her job. She has had financial difficulties. Most days, she feels sad, empty, and anxious. She has joint legal custody with her ex-husband of their two children, respectively aged 10 and 12. And so on.

That outlines the nature of the underlying case study.

Questions And The Answers Generated By The Generative AI

I am going to walk you through each of the six multiple-choice questions and also showcase the answers that were generated by ChatGPT so that you can follow along step-by-step.

My initiating prompt asks ChatGPT to provide answers plus explain why each chosen answer was considered the correct answer by ChatGPT. Asking for an explanation is not necessary, but I thought getting explanations might be interesting to see.

There is also a bit of prompting strategy involved, namely that by asking for an explanation the chances are that a generative AI app might be more extensive in trying to solve a given question or problem, see my discussion at the link here. You could suggest that I was aiding the generative AI by giving an establishing prompt that would urge it to do a better job than otherwise. Whenever you look at research studies doing just about anything with generative AI, make sure to find out what prompts they used. This is a significant factor related to the performance of the generative AI. Studies that fail to showcase their prompts are unfortunately doing a bit of a disservice by not revealing how they got the generative AI to undertake things.

The sampled sample questions are based on the case study, and I’ve placed them in quotes to indicate they came from the case study. In some instances, the wording is slightly reworded merely and strictly for purposes of feeding them into ChatGPT.

I am purposely not going to speculate or comment on the answers that are given by ChatGPT. I will simply note whether ChatGPT selected the correct multiple-choice selection as stipulated in the sample set. I’m guessing that mental health therapists and mental health counselors will undoubtedly find the explanations of special interest and will indubitably get them extensively mulling over what ChatGPT had to say.

Here is my establishing prompt and the answer provided by ChatGPT:

My prompt: “I am going to ask you a series of questions about the case study. The questions are multiple-choice. You are to select an answer of either A, B, C, or D from the choices provided for each question. I want you to carefully read each question and then indicate which answer is the best answer of the choices provided. You are to also explain why the answer you’ve chosen is the best answer. Are you ready to answer the first question?”
ChatGPT generated a response: “Yes, I’m ready. Please proceed with the first question.”

Okay, we are good to go.

Here’s the first question:

“Q1. What would be the best way for a mental therapist to at initial intake seek to establish a therapeutic relationship with this person?”

“A. Reflect on what is irrational about the woman’s self-evaluations.”

“B. Advise her about how to manage her painful experiences.”

“C. Summarize her emotions and her desire for change.”

“D. Explore how she could improve her role in relationships.”

ChatGPT generated response:

“Answer to Q1 is C. Summarize her emotions and her desire for change.”

“Explanation: Establishing a therapeutic relationship involves demonstrating empathy, understanding, and acknowledging the client’s emotions. Summarizing her emotions and desire for change (Option C) helps validate her experiences and shows that the therapist is actively listening and comprehending her situation. This approach helps build rapport and sets the foundation for a collaborative therapeutic relationship. It also allows the client to feel heard and understood, creating a safe space for further exploration and intervention.”

End of ChatGPT response

ChatGPT selected answer C, which is considered the correct answer as per the sample set.

The explanation indicates the presumed basis. I could have also asked ChatGPT to explain why the other answers weren’t as good as the selected answer. That would have gotten quite wordy and didn’t seem needed for this ad hoc analysis. A deeper study might want to consider getting a lengthier reply with a more elaborate explanation.

We move to the second question:

“Q2. What information should a mental health therapist explore to help the client remember her strengths?”

“A. The client’s disclosure about her previous level of functioning.”

“B. The client’s recollections of past mistakes.

“C. The client’s feelings about her mother paying for counseling.”

“D. The client’s use of cognitive distortions.”

ChatGPT generated response:

“Answer to Q2 is A. The client’s disclosure about her previous level of functioning.”

“Exploring the client’s disclosure about her previous level of functioning (Option A) would be most relevant to identify and highlight her strengths. Understanding her past achievements, abilities, and positive aspects of her life before the onset of depressive symptoms can help the therapist and the client focus on resources that can be tapped into for recovery. This approach emphasizes a strengths-based perspective and encourages the client to recognize her capabilities, fostering a more positive outlook on her current situation.”

End of ChatGPT response

The answer indicated by ChatGPT is answer A, which is also the correct answer per the sample set. We are now two-for-two in terms of having gotten the questions right.

Next, let’s see what happens with the third question:

“Q3. How should a mental health therapist handle the client’s concern about confidentiality?”

“A. Explain that as a mental therapist, they will share progress updates with the mother but not provide any details.”

“B. Clarify what can and cannot be shared through informed consent.”

“C. Request the mother’s assent to the treatment and obtain the client’s consent.”

“D. Ask the client for a signed release that would include her mother as part of the treatment process.”

ChatGPT generated response:

“Answer to Q3 is B. Clarify what can and cannot be shared through informed consent.”

“The best approach is to clarify confidentiality issues through informed consent (Option B). The therapist should explain the limits of confidentiality, outlining what information can be shared and under what circumstances. This empowers the client to make an informed decision about what aspects of her treatment will remain confidential and what may need to be shared. It respects the client’s autonomy and privacy while addressing concerns about the involvement of her mother in the therapy process.”

End of ChatGPT response

The answer indicated by ChatGPT is B, which is also the correct answer via the sample set. Three for three, so far.

Here’s the fourth question:

“Q4. From the perspective of the mental health therapist, what self-report from the client is a discernable criterion for the given diagnosis?”

“A. Her disagreements with her mother.”

“B. Her increased psychomotor agitation.”

“C. Her difficulty managing her children.”

“D. Her decreased pleasure in her activities.”

ChatGPT generated response:

“Answer to Q4 is D. Her decreased pleasure in her activities.”

“A discernible criterion for a major depressive disorder is often a pervasive and persistent decrease in pleasure or interest in previously enjoyed activities. This is commonly referred to as anhedonia. The client’s report of not being able to remember the last time she enjoyed something and the cessation of an activity she used to find enjoyable (creating jewelry) aligns with this criterion. While other aspects, such as disagreements with her mother (Option A) or difficulty managing her children (Option C), may be related to her emotional state, the decreased pleasure in activities is a more specific indicator for depression.”

End of ChatGPT response

The answer indicated by ChatGPT is D, which is the correct answer according to the sample set. You will also note that this time the explanation opted to explain why the chosen answer was considered a better choice than some of the other proposed answers.

We are at four of four in terms of getting the correct answers.

Let’s see what occurs with the fifth question:

“Q5. What criteria were likely used to determine the severity specified for the client’s major depressive disorder?”

“A. The number of symptoms is not substantially in excess of those needed to make the diagnosis, yet functional impairment exists.”

“B. The intensity of the symptoms is distressing but manageable, and results in minor impairment of social or occupational functioning.”

“C. The intensity of symptoms is high and is accompanied by psychomotor agitation and mixed features associated with a major depressive episode.”

“D. The number of symptoms and their intensity are distressing and unmanageable and interfere with the client’s social and occupational functioning.”

ChatGPT generated response:

“Answer to Q5 is D. The number of symptoms and their intensity are distressing and unmanageable and interfere with the client’s social and occupational functioning.”

“The severity specified for major depressive disorder is often determined by the number and intensity of symptoms, as well as their impact on the client’s social and occupational functioning. In this case, the client reports feeling sad, empty, and anxious most days, experiences tension every day, has difficulty getting out of bed, and has ceased engaging in previously enjoyed activities. Additionally, she has lost her job, faces financial difficulties, and expresses guilt about her interactions with her children. These factors suggest a significant impact on both social and occupational functioning, aligning with the criteria for a more severe form of major depressive disorder.”

End of ChatGPT response

ChatGPT indicates the answer should be D, but the sample set indicates the correct answer is A. This is the first miss by ChatGPT in this set of sampled sample questions.

I was curious whether ChatGPT could justify the answer D versus the answer A.

One thing to keep in mind is that some generative AI apps such as ChatGPT tend to immediately back down when you challenge them about an answer. The AI will indicate that it was wrong and switch to the answer that you say is the correct answer. In a sense, this is somewhat disconcerting because you would generally prefer the AI to moderately push back and support an answer that was chosen.

I am not saying this should be doggedly dogmatic. It certainly makes sense to have the AI back down if the user is telling the AI that the answer is wrong, though I think you can see how this might be a dangerous slope to slide upon. Just because a user says that something is wrong does not mean that it indeed is wrong. You can end up with the AI being too easily convinced of falsehoods simply because the user said they were truths. This is an ongoing issue of concern about generative AI, see my analysis at the link here.

All in all, rather than telling the AI that it was wrong, which would possibly get it to collapse like a deck of cards, I came to the matter politely and tangentially. I merely asked ChatGPT to compare the answer D with the answer A. I didn’t tip my hand that the answer A is considered the correct answer.

Here we go:

My prompt: “I’d like you to compare the answer D that you selected to the choice A. What makes answer D better than answer A?”
ChatGPT generated response:

“Let’s compare options D and A:”

“D. The number of symptoms and their intensity are distressing and unmanageable and interfere with the client’s social and occupational functioning.”

“A. The number of symptoms is not substantially in excess of those needed to make the diagnosis, yet functional impairment exists.”

“The main distinction lies in the severity of functional impairment. Option D suggests that the number of symptoms and their intensity are distressing and unmanageable, leading to significant interference with social and occupational functioning. This aligns with a more severe form of major depressive disorder, where the impairment is substantial.”

“On the other hand, Option A acknowledges functional impairment but does not emphasize the distressing and unmanageable nature of the symptoms. It leaves room for interpretation that the impairment might be present but not to the extent that it is distressing or unmanageable.”

“In the context of major depressive disorder, severity is often determined not just by the presence of symptoms but also by the impact on daily functioning. Option D explicitly addresses the distress and interference with functioning, making it a more comprehensive and accurate description of a severe presentation of major depressive disorder.”

End of ChatGPT response

I’d bet that mental health professionals could readily debate the indication of ChatGPT about why one of the answers is construed as better than the other. Please do so at your leisure.

Moving to the sixth and final question, take a look at this one:

“Q6. From the perspective of the mental health therapist, what short-term goal would guide the treatment of the client’s presenting problem?’

“A. Spend one hour each day rewriting difficult childhood memories with her mother.”

“B. Return to her previous sleep schedule to be more present with her children.”

“C. Spend one hour each day journaling about her vision for her next marriage.”

“D. Return to full-time work in nursing to boost her self-image and mood.”

ChatGPT generated response:

“Answer is B. Return to her previous sleep schedule to be more present with her children.”

“The most appropriate short-term goal would be to address the immediate issue affecting her daily functioning, which is her disrupted sleep schedule. Returning to her previous sleep schedule (Option B) can have a positive impact on her mood, energy levels, and ability to engage with her children. This goal is specific, achievable, and directly addresses a key factor contributing to her distress. It also aligns with improving her overall well-being and functioning in the short term.”

End of ChatGPT response

The answer selected by ChatGPT is answer B, which is the correct answer according to the sample set.

Conclusion

Please do not try to calculate the percentage of times that ChatGPT was able to correctly answer the sampled sample. Doing so would be ridiculous as this was not a rigorous experiment but instead an ad hoc one and meant solely to get the juices flowing about examining how generative AI might do on these kinds of tests. That being said, I believe there is enough indication that performing a thorough study would be worthwhile. Had the generative AI got none of the answers correctly I would probably be leaning toward suggesting that a full-on study wasn’t likely worth the effort. This snippet suggests that doing more might be valuable.

There are flies in the ointment on these matters.

One argument about having generative AI take these various types of tests is that there isn’t any human soul or human connection that is also being assessed, see my discussion at the link here. In other words, just because a generative AI app can answer questions on a dry test does not mean that we would want the generative AI to interact with real people in real life and provide advice as to whatever domain or realm of expertise is being tested.

The argument is further amplified when considering the topic of mental health. Some would assert that only another human can adequately counsel another human. An AI system is not human and does not have human experience under its belt. A counterviewpoint is that notwithstanding humanness, there is still a place for AI to aid humans, including in the sphere of mental health guidance or advice.

Let’s conclude this discussion for now by invoking a famous line.

The renowned American psychologist Carl Rogers purportedly said this: “In my early professional years, I was asking the question, how can I treat, or cure, or change this person? Now I would phrase the question in this way, how can I provide a relationship that this person may use for their personal growth?”

Can generative AI form a relationship with humans and if so, do we want that to be how mental health is conveyed or advised?

More questions ostensibly need more answers; thus, the endeavor must continue.

Read the full article here