Data Science Intern at Natwest Group
Jun - Aug 2024
Overview
I was placed into the Data Science team within Natwest Group Solutions, worked on improving the machine learning classifier for the Suspicious Activity Reports (SAR) which the FinCrime team need to analyze.
Challenges
The SAR are written by different humans, without a set strict format in general.
The purpose is to classify each SAR into a category (e.g. Money laundering, gambling, fraud etc), for further analysis. The old classifier used by the FinCrime team is a rule-based model (based on a bunch of if-elses), and can lead to low classication accuracy especially in terms of minority groups. A major challenge for using a machine learning model that can be trained for higher accuracy is the lack of labeled data.
My Approach
We used LLM to generate synthetic SARs for each category. To ensure the quality, we employed clustering analysis by looking at the vector embeddings of the original data and the generated data, both fed into the embedding layer of an LLM into a (1546,) vector. We used PCA to find the two most important dimension for plotting the data vectors, computed average consine similarity for both old and new data, and also constructed a cosine-distance based KNN graph. All evidence suggested there isn’t a systematic difference between the old and the new synthetic text data.
An alternative approach where we directly prompt the LLM to do the classication also worked, though it still suffers from hallucinations and occasionally create categories that didn’t exist.
Technologies Used
We used OpenAI API through Azure.
The coding environment was based on the AWS Sagemaker Studio, and data are stored in the AWS S3 database.