2Department of Medical Informatics, Başkent University Faculty of Medicine, Ankara-Türkiye DOI : 10.5505/tjo.2025.4563
Summary
OBJECTIVECancer drug reimbursement policies play a crucial role in regulating access to novel therapies. In Türkiye, the Social Security Institution's Health Practice Regulation (SUT) defines the reimbursement criteria for cancer treatments. Despite frequent updates, oncologists face challenges in interpreting and applying these regulations. Large language models (LLMs) have shown potential in processing complex medical texts, yet their utility in regulatory compliance remains underexplored.
METHODS
We evaluated the effectiveness of GOOGLE Gemini 2.5 Pro Preview 03-25 in analyzing drug reimbursement
policies within the SUT. A structured prompt was developed to ensure responses
strictly adhered to regulatory text. A total of 80 oncology-related test questions, covering multiple
cancer types, were used to assess the model"s accuracy. Responses were categorized as correct and
complete, correct but incomplete, or incorrect. Performance metrics, including precision, recall,
and F1-score, were calculated. An iterative prompt engineering process was employed to optimize
model performance.
RESULTS
The LLM provided completely correct responses to 77 (96.3%) of 80 test cases and correct but incomplete
responses to 3 (3.7%), with no incorrect answers. Performance metrics demonstrated high accuracy
(precision: 1.00, recall: 0.96, F1-score: 0.98). The model successfully processed medical terminology
variations but showed limitations in synthesizing implicit reimbursement rules.
CONCLUSION
LLMs demonstrate strong potential for interpreting cancer drug reimbursement regulations, reducing
administrative burden for oncologists. Future refinements should address inference limitations to enhance
regulatory compliance support in clinical practice.
Introduction
Cancer treatment is a rapidly evolving field, with novel therapeutic strategies emerging at an unprecedented pace. However, the high cost of these advancements necessitates the implementation of strict reimbursement policies to regulate access to novel treatments. In Türkiye, the Social Security Institution dictates the reimbursement criteria for healthcare services through the Health Practice Regulation (SUT), an official guideline defining the eligibility criteria and cost coverage for medical interventions, including cancer treatments. [1] Despite frequent updates to the SUT, discrepancies persist between international treatment guidelines and national reimbursement policies, imposing a significant burden on medical oncologists who must navigate these complexities to ensure timely and appropriate patient care. This often translates into extensive time spent consulting the SUT during various stages of treatment. Non-compliance can lead to treatment delays, increased workload, and administrative challenges.Artificial intelligence (AI) has demonstrated significant potential in oncology, improving diagnostic accuracy, treatment decision-making, and workflow efficiency.[2-4] Large language models (LLMs), a subset of AI, can process extensive medical literature and regulatory documents, supporting clinical decisions. Prior studies have evaluated AI applications in radiology, precision oncology, and treatment recommendations,[5,6] yet their role in automating regulatory compliance for drug reimbursement remains underexplored.
This study aims to evaluate the effectiveness of LLMs in analyzing drug utilization principles in cancer treatment according to the SUT regulations. Specifically, we will investigate the capacity of LLMs to interpret complex regulatory text and provide relevant guidance on drug eligibility and reimbursement criteria within the context of cancer treatment.
Methods
Study Design and ObjectiveThis study evaluated the effectiveness of LLMs in analyzing drug use principles in cancer treatment according to the SUT. The aim was to assess the accuracy and completeness of LLM-generated responses in extracting and interpreting regulatory information on drug reimbursement criteria.
Large Language Models and Prompt Development
We employed the advanced LLM, GOOGLE Gemini
2.5 Pro Preview 03?25, to process and analyze the SUT regulations. The model was selected based on its
demonstrated capabilities in natural language understanding
and complex regulatory interpretation, large
context capability up to 2 million tokens.[7,8] Notably,
OpenAI"s GPT-4 Turbo was not utilized in our study, as
it was unable to process the prompt text due to length
constraints. To ensure accurate and reproducible results,
we developed a structured prompt designed to
constrain the models" responses strictly to the provided
regulatory text, minimizing hallucinations or extraneous
information (Appendix 1). The prompt was constructed
in a stepwise manner, beginning with a clearly
defined Task statement: "Your task is to inform Medical
Oncology specialists about the conditions under which
certain drugs used in cancer treatment can be utilized
according to the Republic of Türkiye"s Social Security
Institution "Health Practice Regulation" (SUT)." Following
this, we established Key Instructions to guide the
models in generating responses strictly based on the
provided regulatory document. These instructions included
responding only in Turkish, using only the provided
SUT text, referencing only drugs listed in the SUT,
and providing verbatim regulatory information.[1]
Testing and Evaluation of Model Responses
To evaluate the LLMs" ability to extract and apply SUT
regulations, a set of test questions was developed by independent
experts in medical oncology. The test questions
were based on real-world clinical scenarios and
covered various tumor types, including lung, breast,
gastrointestinal, gynecological, genitourinary, central
nervous system, melanoma and skin, sarcoma, lymphoma,
multiple myeloma, and other cancers. Each
question was tested in an isolated session to prevent
contextual memory effects. The generated responses
were then compared against the SUT text to assess
their accuracy. LLM responses were categorized as: (1)
Correct and complete, (2) correct but incomplete, or
(3) incorrect. We calculated precision, recall, and F1-
score to quantitatively assess performance. These metrics
were computed as follows:
Support, defined as the number of instances per category, was also recorded to ensure balanced representation across different cancer types.
Iterative Refinement via Prompt Engineering
A key aspect of this study was the iterative refinement
of the prompt to improve model accuracy. When a response
was classified as incorrect or incomplete, an error
analysis was conducted to identify potential sources
of misunderstanding. We then updated the prompt
by incorporating specific clarifications to address these
issues. This process was repeated iteratively until the
model demonstrated optimal performance across multiple
test scenarios (Fig. 1). The final prompt consisted
of 16,288 words, 40,844 tokens, and 131,472 characters.
The model used the default temperature of 1.0.
Fig. 1. Flow chart of the study.
LLMs: Large language models; SUT: Health practice regulation.
Statistical Analysis
Descriptive findings were reported as frequencies and
percentages. The Python sklearn.metrics module was
utilized for calculating classification metrics, while
matplotlib.pyplot and seaborn were employed to visualize
the confusion matrix.
Results
The model was tested using a total of 80 questions covering various cancer types and treatment scenarios. The distribution of test questions included 12 (15%) related to lung cancer, 12 (15%) to breast cancer, 12 (15%) to gastrointestinal cancers, 6 (7.5%) to gynecological cancers, 9 (11.3%) to genitourinary cancers, 6 (7.5%) to central nervous system tumors, 6 (7.5%) to melanoma and skin cancers, 3 (3.7%) to sarcomas, 6 (7.5%) to lymphoma and multiple myeloma, and 8 (10%) to general oncology topics (Table 1).Table 1 Test questions by medical oncology subspecialty
The LLM demonstrated a high degree of accuracy in extracting and interpreting drug reimbursement criteria from the SUT regulations. Out of 80 test questions, the model provided completely correct responses to 77 (96.3%) and correct but incomplete responses to 3 (3.7%), with no instances (0%) of incorrect answers (Fig. 2). The model"s precision was calculated as 1.00, recall as 0.96, and the F1-score as 0.98, indicating strong overall performance in regulatory text interpretation.
Fig. 2. Confusion matrix illustrates the classification outcomes (D-reflects the truth).
Discussion
This study demonstrates the potential of LLMs in analyzing complex regulatory frameworks, specifically cancer drug reimbursement policies. The high accuracy rates achieved by the model, with an F1-score of 0.98, indicate that LLMs can effectively interpret structured medical regulations and provide reliable guidance to oncology specialists.The model's ability to correctly interpret and extract drug eligibility criteria from the SUT aligns with prior studies evaluating LLMs in clinical decision support. Benary et al.[6] demonstrated that LLMs could assist in personalized oncology by identifying treatment options based on complex biomarkers. Similarly, Rydzewski et al.[9] found that LLMs exhibit promising performance in oncology-related queries. In the present study, the LLM accurately retrieved verbatim regulatory criteria in 96.3% of test cases, reinforcing its utility as a tool for regulatory compliance in oncology practice.
Despite these strengths, three specific test cases were answered incompletely. Notably, when asked about first-line treatment options for metastatic lung cancer without driver mutations, the model omitted some chemotherapies that could be used without explicit indication restrictions. Similarly, in a question concerning high-risk bone metastases in metastatic hormonesensitive prostate cancer, the model correctly identified denosumab but failed to acknowledge zoledronic acid, despite its inclusion under broader reimbursement conditions. A third case regarding metastatic laryngeal squamous cell carcinoma highlighted the model"s tendency to focus on explicitly mentioned therapies (cetuximab) while neglecting chemotherapy options that could be inferred from related SUT provisions. These limitations were likely due to our prompt design, which strictly instructed the model to rely only on SUT text without making medical inferences. While this prevented hallucinations, it also restricted the model"s ability to synthesize related rules. This tradeoff is consistent with findings from AI studies in radiology and oncology, where models performed well when retrieving direct information but struggled with complex reasoning.[3,5]
Our study also tested the model"s ability to handle variations in medical terminology. We deliberately rephrased key terms (e.g., "liver cancer" instead of "hepatocellular carcinoma," "cerbB2" instead of "HER2") and used common abbreviations to mimic real-world oncology practice. The model successfully recognized these variations, suggesting that it can adapt to different ways oncologists phrase reimbursement- related queries. This contrasts with earlier LLM studies where models had difficulty processing medical language inconsistencies.[10]
The integration of AI into oncology decision-making has been met with both enthusiasm and caution. While AI holds significant promise for improving efficiency in clinical workflows, concerns remain regarding reliability, accountability, and regulatory compliance.[11] A major advantage observed in this study was the LLM's ability to rapidly process and retrieve relevant reimbursement criteria, potentially reducing the administrative burden on medical oncologists. This aligns with the efficiency gains that AI provides in facilitating clinical documentation and treatment planning.[12,13] However, as noted by Haftenberger et al.,[14] regulatory and legal frameworks must be carefully considered when implementing AI-driven decision- support tools in medical practice.
A significant challenge in cancer drug reimbursement worldwide is the misalignment between national policies and international treatment guidelines, often resulting in delays in patient access to novel therapies. [15,16] Studies have shown that Türkiye"s reimbursement system, like those in many middle-income countries, is subject to economic constraints that necessitate stringent cost-control measures, sometimes at the expense of clinical flexibility.[17,18] The ability of LLMs to systematically navigate and interpret reimbursement criteria could help bridge this gap by providing realtime guidance to clinicians.
Conclusion
Our findings support the use of LLMs in analyzing Türkiye's cancer drug reimbursement policies. The model performed well in retrieving structured reimbursement rules, correctly interpreting terminology variations, and demonstrating high accuracy in regulatory text processing.Informed Consent: Not applicable.
Conflict of Interest Statement: The authors have no conflicts of interest to declare.
Funding: The authors declared that this study received no financial support.
Use of AI for Writing Assistance: The authors used ChatGPT (GPT-4o, OpenAI) for language editing only. All content was authored by the authors without AI-generated material.
Author Contributions: Concept - R.I., Z.A.; Design - Z.A., M.K.; Supervision - Z.A.; Funding - Z.A.; Materials - R.I., Z.A., A.F., M.N.R.; Data collection and/or processing - A.O., Ö.A., R.I.; Data analysis and/or interpretation - R.I., Z.A.; Literature search - R.I.; Writing - R.I.; Critical review - Z.A., O.A.
Peer-review: Externally peer-reviewed.
References
1) Sosyal Güvenlik Kurumu Sağlık Uygulama Tebliği.
Available at: https://mevzuat.gov.tr/mevzuat?Mevzuat-
No=17229&MevzuatTur=9&MevzuatTertip=5. Accessed
Aug 18, 2025.
2) Lotter W, Hassett MJ, Schultz N, Kehl KL, Van Allen
EM, Cerami E. Artificial intelligence in oncology:
Current landscape, challenges, and future directions.
Cancer Discov 2024;14(5):711-26.
3) Bera K, Braman N, Gupta A, Velcheti V, Madabhushi
A. Predicting cancer outcomes with radiomics and artificial
intelligence in radiology. Nat Rev Clin Oncol
2022;19(2):132-46.
4) Jiang Y, Yang M, Wang S, Li X, Sun Y. Emerging
role of deep learning-based artificial intelligence
in tumor pathology. Cancer Commun (Lond)
2020;40(4):154-66.
5) Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett
T. Assessing the performance of ChatGPT and
Bard/Gemini against radiologists for Prostate Imaging-
Reporting and Data System classification based on
prostate multiparametric MRI text reports. Br J Radiol
2025;98(1167):368-74.
6) Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus
G, Nassir M, et al. Leveraging large language models
for decision support in personalized oncology. JAMA
Netw Open 2023;6(11):e2343689.
7) Akhtar ZB. From bard to Gemini: An investigative exploration
journey through Google"s evolution in conversational
AI and generative AI. Comput Artif Intell
2024;2(1):1378.
8) Rane N, Choudhary S, Rane J. Gemini versus
ChatGPT: Applications, performance, architecture,
capabilities, and implementation. Jal of Appl Artif Intell
2024;5(1):69-93.
9) Rydzewski NR, Dinakaran D, Zhao SG, Ruppin
E, Turkbey B, Citrin DE, et al. Comparative Evaluation
of LLMs in Clinical Oncology. NejmAI
2024;1(5):aioa2300151.
10) Irmici G, Cozzi A, Della Pepa G, De Berardinis C,
D"Ascoli E, Cellina M, et al. How do large language
models answer breast cancer quiz questions? A comparative
study of GPT-3.5, GPT-4 and Google Gemini.
Radiol Med 2024;129(10):1463-7.
11) Caglayan A, Slusarczyk W, Rabbani RD, Ghose A, Papadopoulos
V, Boussios S. Large language models in
oncology: Revolution or cause for concern? Curr Oncol
2024;31(4):1817-30.
12) Alhur A. Redefining healthcare with artificial intelligence
(AI): The contributions of ChatGPT, Gemini,
and Co-pilot. Cureus 2024;16(4):e57795.
13) Jin Q, Wang Z, Floudas CS, Chen F, Gong C, Bracken-
Clarke D, et al. Matching patients to clinical trials with
large language models. Nat Commun 2024;15(1):9074.
14) Haftenberger A, Dierks C. Legal integration of artificial
intelligence into internal medicine : Data protection,
regulatory, reimbursement and liability questions.
Inn Med (Heidelb) 2023;64(11):1044-50.
15) Atikeler EK, Leufkens H, Goettsch W. Access to medicines
in Turkey: Evaluation of the process of medicines
brought from abroad. Int J Technol Assess Health Care
2020;36(6):585-91.
16) Güven AT. Time to close the gap between guidelines
and the reimbursement policy for diabetes treatment
in Turkey. Pharmacoeconomics 2023;41(8):843-4.