TURKISH JOURNAL OF ONCOLOGY

Summary

OBJECTIVE
Cancer drug reimbursement policies play a crucial role in regulating access to novel therapies. In Türkiye, the Social Security Institution's Health Practice Regulation (SUT) defines the reimbursement criteria for cancer treatments. Despite frequent updates, oncologists face challenges in interpreting and applying these regulations. Large language models (LLMs) have shown potential in processing complex medical texts, yet their utility in regulatory compliance remains underexplored.

METHODS
We evaluated the effectiveness of GOOGLE Gemini 2.5 Pro Preview 03-25 in analyzing drug reimbursement policies within the SUT. A structured prompt was developed to ensure responses strictly adhered to regulatory text. A total of 80 oncology-related test questions, covering multiple cancer types, were used to assess the model"s accuracy. Responses were categorized as correct and complete, correct but incomplete, or incorrect. Performance metrics, including precision, recall, and F1-score, were calculated. An iterative prompt engineering process was employed to optimize model performance.

RESULTS
The LLM provided completely correct responses to 77 (96.3%) of 80 test cases and correct but incomplete responses to 3 (3.7%), with no incorrect answers. Performance metrics demonstrated high accuracy (precision: 1.00, recall: 0.96, F1-score: 0.98). The model successfully processed medical terminology variations but showed limitations in synthesizing implicit reimbursement rules.

CONCLUSION
LLMs demonstrate strong potential for interpreting cancer drug reimbursement regulations, reducing administrative burden for oncologists. Future refinements should address inference limitations to enhance regulatory compliance support in clinical practice.

Summary

Introduction

Cancer treatment is a rapidly evolving field, with novel therapeutic strategies emerging at an unprecedented pace. However, the high cost of these advancements necessitates the implementation of strict reimbursement policies to regulate access to novel treatments. In Türkiye, the Social Security Institution dictates the reimbursement criteria for healthcare services through the Health Practice Regulation (SUT), an official guideline defining the eligibility criteria and cost coverage for medical interventions, including cancer treatments. [1] Despite frequent updates to the SUT, discrepancies persist between international treatment guidelines and national reimbursement policies, imposing a significant burden on medical oncologists who must navigate these complexities to ensure timely and appropriate patient care. This often translates into extensive time spent consulting the SUT during various stages of treatment. Non-compliance can lead to treatment delays, increased workload, and administrative challenges.

Artificial intelligence (AI) has demonstrated significant potential in oncology, improving diagnostic accuracy, treatment decision-making, and workflow efficiency.[2-4] Large language models (LLMs), a subset of AI, can process extensive medical literature and regulatory documents, supporting clinical decisions. Prior studies have evaluated AI applications in radiology, precision oncology, and treatment recommendations,[5,6] yet their role in automating regulatory compliance for drug reimbursement remains underexplored.

This study aims to evaluate the effectiveness of LLMs in analyzing drug utilization principles in cancer treatment according to the SUT regulations. Specifically, we will investigate the capacity of LLMs to interpret complex regulatory text and provide relevant guidance on drug eligibility and reimbursement criteria within the context of cancer treatment.

Introduction

Methods

Study Design and Objective
This study evaluated the effectiveness of LLMs in analyzing drug use principles in cancer treatment according to the SUT. The aim was to assess the accuracy and completeness of LLM-generated responses in extracting and interpreting regulatory information on drug reimbursement criteria.

Large Language Models and Prompt Development
We employed the advanced LLM, GOOGLE Gemini 2.5 Pro Preview 03?25, to process and analyze the SUT regulations. The model was selected based on its demonstrated capabilities in natural language understanding and complex regulatory interpretation, large context capability up to 2 million tokens.[7,8] Notably, OpenAI"s GPT-4 Turbo was not utilized in our study, as it was unable to process the prompt text due to length constraints. To ensure accurate and reproducible results, we developed a structured prompt designed to constrain the models" responses strictly to the provided regulatory text, minimizing hallucinations or extraneous information (Appendix 1). The prompt was constructed in a stepwise manner, beginning with a clearly defined Task statement: "Your task is to inform Medical Oncology specialists about the conditions under which certain drugs used in cancer treatment can be utilized according to the Republic of Türkiye"s Social Security Institution "Health Practice Regulation" (SUT)." Following this, we established Key Instructions to guide the models in generating responses strictly based on the provided regulatory document. These instructions included responding only in Turkish, using only the provided SUT text, referencing only drugs listed in the SUT, and providing verbatim regulatory information.[1]

Testing and Evaluation of Model Responses
To evaluate the LLMs" ability to extract and apply SUT regulations, a set of test questions was developed by independent experts in medical oncology. The test questions were based on real-world clinical scenarios and covered various tumor types, including lung, breast, gastrointestinal, gynecological, genitourinary, central nervous system, melanoma and skin, sarcoma, lymphoma, multiple myeloma, and other cancers. Each question was tested in an isolated session to prevent contextual memory effects. The generated responses were then compared against the SUT text to assess their accuracy. LLM responses were categorized as: (1) Correct and complete, (2) correct but incomplete, or (3) incorrect. We calculated precision, recall, and F1- score to quantitatively assess performance. These metrics were computed as follows:

Support, defined as the number of instances per category, was also recorded to ensure balanced representation across different cancer types.

Iterative Refinement via Prompt Engineering
A key aspect of this study was the iterative refinement of the prompt to improve model accuracy. When a response was classified as incorrect or incomplete, an error analysis was conducted to identify potential sources of misunderstanding. We then updated the prompt by incorporating specific clarifications to address these issues. This process was repeated iteratively until the model demonstrated optimal performance across multiple test scenarios (Fig. 1). The final prompt consisted of 16,288 words, 40,844 tokens, and 131,472 characters. The model used the default temperature of 1.0.

Fig. 1. Flow chart of the study.
LLMs: Large language models; SUT: Health practice regulation.

Statistical Analysis
Descriptive findings were reported as frequencies and percentages. The Python sklearn.metrics module was utilized for calculating classification metrics, while matplotlib.pyplot and seaborn were employed to visualize the confusion matrix.

Methods

Results

The model was tested using a total of 80 questions covering various cancer types and treatment scenarios. The distribution of test questions included 12 (15%) related to lung cancer, 12 (15%) to breast cancer, 12 (15%) to gastrointestinal cancers, 6 (7.5%) to gynecological cancers, 9 (11.3%) to genitourinary cancers, 6 (7.5%) to central nervous system tumors, 6 (7.5%) to melanoma and skin cancers, 3 (3.7%) to sarcomas, 6 (7.5%) to lymphoma and multiple myeloma, and 8 (10%) to general oncology topics (Table 1).

Table 1 Test questions by medical oncology subspecialty

Table 1 Cont.

The LLM demonstrated a high degree of accuracy in extracting and interpreting drug reimbursement criteria from the SUT regulations. Out of 80 test questions, the model provided completely correct responses to 77 (96.3%) and correct but incomplete responses to 3 (3.7%), with no instances (0%) of incorrect answers (Fig. 2). The model"s precision was calculated as 1.00, recall as 0.96, and the F1-score as 0.98, indicating strong overall performance in regulatory text interpretation.

Fig. 2. Confusion matrix illustrates the classification outcomes (D-reflects the truth).

Results

Discussion

This study demonstrates the potential of LLMs in analyzing complex regulatory frameworks, specifically cancer drug reimbursement policies. The high accuracy rates achieved by the model, with an F1-score of 0.98, indicate that LLMs can effectively interpret structured medical regulations and provide reliable guidance to oncology specialists.

The model's ability to correctly interpret and extract drug eligibility criteria from the SUT aligns with prior studies evaluating LLMs in clinical decision support. Benary et al.[6] demonstrated that LLMs could assist in personalized oncology by identifying treatment options based on complex biomarkers. Similarly, Rydzewski et al.[9] found that LLMs exhibit promising performance in oncology-related queries. In the present study, the LLM accurately retrieved verbatim regulatory criteria in 96.3% of test cases, reinforcing its utility as a tool for regulatory compliance in oncology practice.

Despite these strengths, three specific test cases were answered incompletely. Notably, when asked about first-line treatment options for metastatic lung cancer without driver mutations, the model omitted some chemotherapies that could be used without explicit indication restrictions. Similarly, in a question concerning high-risk bone metastases in metastatic hormonesensitive prostate cancer, the model correctly identified denosumab but failed to acknowledge zoledronic acid, despite its inclusion under broader reimbursement conditions. A third case regarding metastatic laryngeal squamous cell carcinoma highlighted the model"s tendency to focus on explicitly mentioned therapies (cetuximab) while neglecting chemotherapy options that could be inferred from related SUT provisions. These limitations were likely due to our prompt design, which strictly instructed the model to rely only on SUT text without making medical inferences. While this prevented hallucinations, it also restricted the model"s ability to synthesize related rules. This tradeoff is consistent with findings from AI studies in radiology and oncology, where models performed well when retrieving direct information but struggled with complex reasoning.[3,5]

Our study also tested the model"s ability to handle variations in medical terminology. We deliberately rephrased key terms (e.g., "liver cancer" instead of "hepatocellular carcinoma," "cerbB2" instead of "HER2") and used common abbreviations to mimic real-world oncology practice. The model successfully recognized these variations, suggesting that it can adapt to different ways oncologists phrase reimbursement- related queries. This contrasts with earlier LLM studies where models had difficulty processing medical language inconsistencies.[10]

The integration of AI into oncology decision-making has been met with both enthusiasm and caution. While AI holds significant promise for improving efficiency in clinical workflows, concerns remain regarding reliability, accountability, and regulatory compliance.[11] A major advantage observed in this study was the LLM's ability to rapidly process and retrieve relevant reimbursement criteria, potentially reducing the administrative burden on medical oncologists. This aligns with the efficiency gains that AI provides in facilitating clinical documentation and treatment planning.[12,13] However, as noted by Haftenberger et al.,[14] regulatory and legal frameworks must be carefully considered when implementing AI-driven decision- support tools in medical practice.

A significant challenge in cancer drug reimbursement worldwide is the misalignment between national policies and international treatment guidelines, often resulting in delays in patient access to novel therapies. [15,16] Studies have shown that Türkiye"s reimbursement system, like those in many middle-income countries, is subject to economic constraints that necessitate stringent cost-control measures, sometimes at the expense of clinical flexibility.[17,18] The ability of LLMs to systematically navigate and interpret reimbursement criteria could help bridge this gap by providing realtime guidance to clinicians.

Discussion

Conclusion

Our findings support the use of LLMs in analyzing Türkiye's cancer drug reimbursement policies. The model performed well in retrieving structured reimbursement rules, correctly interpreting terminology variations, and demonstrating high accuracy in regulatory text processing.

Informed Consent: Not applicable.

Conflict of Interest Statement: The authors have no conflicts of interest to declare.

Funding: The authors declared that this study received no financial support.

Use of AI for Writing Assistance: The authors used ChatGPT (GPT-4o, OpenAI) for language editing only. All content was authored by the authors without AI-generated material.

Author Contributions: Concept - R.I., Z.A.; Design - Z.A., M.K.; Supervision - Z.A.; Funding - Z.A.; Materials - R.I., Z.A., A.F., M.N.R.; Data collection and/or processing - A.O., Ö.A., R.I.; Data analysis and/or interpretation - R.I., Z.A.; Literature search - R.I.; Writing - R.I.; Critical review - Z.A., O.A.

Peer-review: Externally peer-reviewed.

Conclusion