Are there any good open datasets for training fintech models (fraud detection, credit scoring, etc.)?

Are there any free data sources available that I can use to build or train AI for finance-related apps?

Seems so many…


There are good free/open datasets for fintech AI, but they are best used for learning, prototyping, research, benchmarking, and demos — not as a complete replacement for real bank, lender, or payment-company data.

The easiest way to think about it:

Goal Good data type
Fraud detection Card transactions, bank transfers, synthetic fraud data
Credit scoring Loan/default datasets, mortgage data, credit-risk benchmarks
AML / money laundering Synthetic bank-transfer data, crypto transaction graphs
Finance chatbots / RAG SEC filings, financial reports, complaints, financial news
Market/macro apps Economic time series, public financial indicators

Best open datasets by use case

1. Fraud detection

Credit Card Fraud Detection — Kaggle / ULB

This is the classic beginner dataset for payment-card fraud detection. It contains 284,807 transactions and 492 fraud cases, so it is highly imbalanced. That makes it useful for learning why fraud detection is hard: fraud is rare, and simple accuracy can be misleading. (Kaggle)

Good for:

  • beginner fraud models
  • imbalanced classification
  • anomaly detection
  • precision/recall practice
  • threshold tuning

Main caution:
Most features are anonymized, so it is not very business-interpretable. It is good for learning, not for proving a real fraud system is production-ready.


Bank Account Fraud Dataset Suite, BAF

BAF is a public suite of synthetic bank-account-opening fraud datasets. It was published with NeurIPS 2022 and was designed to capture realistic problems such as class imbalance, bias, and time dynamics. (Kaggle)

Good for:

  • account-opening fraud
  • fairness testing
  • synthetic fraud modeling
  • temporal validation
  • tabular ML benchmarks

Main caution:
It is synthetic. That is useful for privacy, but a model can learn patterns from the generator rather than real fraud behavior.


Synthetic Financial Datasets for Fraud Detection — Hugging Face

This Hugging Face dataset is available in Parquet format and has around millions of rows, making it convenient for larger fraud-detection experiments. (Hugging Face)

Good for:

  • scalable fraud modeling
  • tabular ML
  • pipeline testing
  • Hugging Face / pandas workflows

Main caution:
Synthetic fraud data is useful for practice, but it should not be treated as real bank validation.


2. AML / money-laundering detection

IBM AML-Data

IBM’s AML-Data repo provides synthetic financial transactions such as bank transfers, purchases, credit-card transactions, and checks. Most transactions are legitimate, while some represent money laundering; the data is in CSV format and generated using a multi-agent virtual-world model. (GitHub)

Good for:

  • AML transaction monitoring
  • suspicious-transaction classification
  • graph features
  • money-flow analysis
  • alert-ranking demos

Main caution:
Synthetic AML patterns may be easier or cleaner than real laundering behavior.


IBM AMLSim

AMLSim is a multi-agent simulator for generating synthetic banking transaction data so researchers can test AML algorithms on shared synthetic data. (GitHub)

Good for:

  • generating custom AML data
  • testing money-laundering typologies
  • graph-based AML experiments
  • synthetic transaction simulation

Main caution:
You may need engineering work to generate exactly the scenarios you want.


Elliptic++ Bitcoin Dataset

Elliptic++ is useful for crypto-related AML. It contains about 203k Bitcoin transactions and 822k wallet addresses, enabling illicit-transaction and illicit-address detection with graph data. (GitHub)

Good for:

  • crypto AML
  • graph neural networks
  • illicit transaction detection
  • wallet/address risk scoring

Main caution:
Bitcoin graph behavior does not automatically generalize to normal bank transfers, card payments, or lending.


3. Credit scoring and default prediction

German Credit Data — UCI

German Credit is a classic credit-risk dataset. It classifies people as good or bad credit risks and has 1,000 instances and 20 features. (UCI Machine Learning Repository)

Good for:

  • beginner credit scoring
  • scorecard modeling
  • fairness demos
  • cost-sensitive classification

Main caution:
It is small and old. Use it for learning, not for production credit decisions.


Default of Credit Card Clients

This dataset has 30,000 instances, 24 features, and a binary default label. It is commonly used for credit-card default prediction. (OpenML)

Good for:

  • default prediction
  • credit-risk classification
  • calibration
  • fairness checks
  • explainability demos

Main caution:
It is older and jurisdiction-specific, so do not assume it reflects your customer base.


Home Credit Default Risk — Kaggle

This Kaggle competition dataset asks whether each applicant is capable of repaying a loan. The data includes a main application table, with one row per loan, and a target for the training set. (Kaggle)

Good for:

  • more realistic credit-risk modeling
  • feature engineering
  • relational/tabular data
  • loan repayment prediction

Main caution:
It is more complex than beginner datasets. Also, Kaggle datasets may have specific competition/data-use rules.


FICO HELOC Dataset

The FICO HELOC dataset is widely used for explainable credit-risk modeling. The task is to predict whether applicants will repay a home-equity line of credit within two years. (docs.interpretable.ai)

Good for:

  • explainable AI
  • credit underwriting examples
  • scorecards
  • interpretable ML
  • adverse-action-style explanations

Main caution:
Check access terms before using it commercially.


HMDA Mortgage Data

HMDA is one of the most important public datasets for U.S. mortgage analysis. The CFPB describes HMDA data as the most comprehensive public source of information on the U.S. mortgage market. (Consumer Financial Protection Bureau)

Good for:

  • mortgage lending analysis
  • fair-lending research
  • loan approval/denial analysis
  • geographic and demographic studies

Main caution:
HMDA is not a full credit-bureau dataset. It does not contain every underwriting variable a lender would use.


4. Finance NLP, chatbots, and RAG

SEC EDGAR APIs

The SEC provides RESTful APIs for company submissions and XBRL financial-statement data. The APIs return JSON, need no authentication or API key, and include submissions history plus XBRL data from filings such as 10-K, 10-Q, 8-K, 20-F, 40-F, and related forms. (The Securities and Trade Commission.)

Good for:

  • finance chatbots
  • SEC filing search
  • 10-K / 10-Q analysis
  • financial-statement extraction
  • RAG systems
  • company-risk summarization

Main caution:
Raw filings are long and messy. You need good chunking, retrieval, citations, and date handling.


PleIAs/SEC — Hugging Face

This Hugging Face dataset contains SEC annual reports, Form 10-K, from 1993 to 2024, stored in Parquet format. (Hugging Face)

Good for:

  • SEC filing RAG
  • finance document search
  • long-document summarization
  • financial text embeddings
  • risk-factor extraction

Main caution:
A chatbot trained or built on filings should cite sources and avoid giving unsupported investment advice.


Financial PhraseBank

Financial PhraseBank contains 4,840 English financial-news sentences labeled by sentiment. (Hugging Face)

Good for:

  • financial sentiment classification
  • small NLP baselines
  • fine-tuning a classifier
  • positive/neutral/negative financial text analysis

Main caution:
It is small. It is better for evaluation or a simple classifier than for training a large financial language model.


CFPB Consumer Complaint Database

The CFPB Consumer Complaint Database lets users explore, filter, map, read, and export consumer complaints about financial products and services. (Consumer Financial Protection Bureau)

Good for:

  • complaint classification
  • customer-support routing
  • financial product taxonomy
  • consumer-finance NLP
  • topic modeling
  • trend detection

Main caution:
Complaints are not a random sample of all customers. They reflect people who chose to complain.


5. Market, macroeconomic, and public financial data

FRED API

The FRED API gives programmatic access to economic data from FRED and ALFRED. It can retrieve data by source, release, category, series, and other parameters. (FRED)

Good for:

  • interest rates
  • inflation
  • unemployment
  • GDP
  • credit-cycle indicators
  • macroeconomic features

Main caution:
FRED is great for context, but it is not customer-level fintech data.


World Bank Indicators API

The World Bank Indicators API provides programmatic access to nearly 16,000 time-series indicators across many databases, with many series going back more than 50 years. (World Bank Data Help Desktop)

Good for:

  • country risk
  • financial inclusion analysis
  • macroeconomic modeling
  • emerging-market fintech analysis
  • development-finance applications

Main caution:
Most indicators are country-level or macro-level, not transaction-level.


Best starting choices

If you want to build a fraud model

Start with:

  1. Credit Card Fraud Detection
  2. BAF
  3. IBM AML-Data
  4. Elliptic++ if you want crypto/graph fraud

Use metrics like:

  • precision
  • recall
  • PR-AUC
  • precision at top-k
  • recall at fixed false-positive rate

Do not rely only on accuracy. Fraud is usually rare, so accuracy can look good even when the model misses most fraud.


If you want to build a credit scoring model

Start with:

  1. German Credit
  2. Default of Credit Card Clients
  3. Home Credit Default Risk
  4. FICO HELOC
  5. HMDA for mortgage and fair-lending analysis

For credit scoring, also think about explainability. The CFPB has said lenders using AI or complex credit models must provide specific and accurate reasons when taking adverse action against consumers. (Consumer Financial Protection Bureau)


If you want to build a finance chatbot or RAG app

Start with:

  1. SEC EDGAR APIs
  2. PleIAs/SEC on Hugging Face
  3. CFPB complaints
  4. Financial PhraseBank
  5. FRED / World Bank indicators

This is often a safer and more practical starting point than credit scoring because public filings and complaints are real public text data.


Simple ranking: best datasets by beginner-friendliness

Dataset/source Beginner-friendly? Best use
Credit Card Fraud Detection High fraud basics
German Credit High credit scoring basics
Default of Credit Card Clients High default prediction
Financial PhraseBank High sentiment classification
CFPB complaints Medium finance NLP
SEC EDGAR / PleIAs SEC Medium finance RAG
Home Credit Default Risk Medium/hard advanced credit modeling
IBM AML-Data Medium AML transaction modeling
BAF Medium realistic fraud/fairness research
Elliptic++ Hard crypto graph ML

Important warning

Open fintech datasets are useful, but they usually have limits:

  • Many fraud datasets are synthetic or anonymized.
  • Many credit datasets are old or small.
  • Public datasets rarely contain the full data a bank or lender would use.
  • Real fraud and AML labels are hard to publish because of privacy and security.
  • Production credit models need legal, compliance, fairness, and explainability review.

So the best use of open fintech data is:

Build prototypes, learn modeling techniques, test pipelines, create demos, and benchmark methods.

The wrong use is:

Train on a public dataset and assume it is ready for real lending, fraud blocking, or AML decisions.


Short summary

  • Yes, good free fintech datasets exist.
  • For fraud, start with Credit Card Fraud Detection, BAF, IBM AML-Data, and Elliptic++.
  • For credit scoring, start with German Credit, Default of Credit Card Clients, Home Credit, FICO HELOC, and HMDA.
  • For finance NLP/RAG, use SEC EDGAR, PleIAs/SEC, CFPB complaints, and Financial PhraseBank.
  • For macro/market context, use FRED and World Bank indicators.
  • These datasets are best for learning, prototyping, research, and demos, not direct production deployment.