Awesome

GitHub projects from awesome lists

Search awesome repositories

Search names, descriptions, topics, tags, and stacks, then tune results by ecosystem, freshness, health, and cross-list signal.

Continue with GitHub Browse awesome lists Request a list

Repos indexed: 17,373
Awesome lists tracked: 125
Current results: 32

Find repositories

Start broad, then narrow by ecosystem, freshness, health, and growth.

Clear 1 refinement

Search repositories

Search mode

Keyword Semantic

Tune results

The controls most people need first.

Awesome list

Language

Freshness

Sort

Direction

More filters Topics, generated tags, stack, files, age, archive status, and growth.

Ecosystem

GitHub topic

Generated tag

Framework or stack

Package manager

Files

Has file

Choose a suggestion or use commas to require multiple files.

Health

Minimum stars

Repository age

Uses known first-commit dates.

Archive status

AI development signals

Momentum

Unmaintained for

Commit velocity

Star growth

Reset filters

32 repos shown

Topic: spark

Browse

Highlighted

Open highlighted repo slot

Put your repository first

Promote a GitHub repo at the top of Awesome repository list views for 7 days.

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

AI dev

Stack

Scala Bundler Maven npm

GitHub topics

#big-data #java #jdbc #python #r #scala

Updated: 2026-07-15
Lists: 4 list mentions
First commit: 2010-03-29
History: 10 history points
License: Apache-2.0
Issues: 450 open

43,614

stars

Forks: 29,275
Commits: 49,068 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

Stack

Python Flask pytest React npm pip pnpm

GitHub topics

#analytics #athena #bi #bigquery #business-intelligence #dashboard

Updated: 2026-07-09
Lists: 5 list mentions
First commit: 2013-10-25
History: 7 history points
License: BSD-2-Clause
Issues: 795 open

28,691

stars

Forks: 4,614
Commits: 7,975 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

yeasy/docker_practice

最新Docker容器技术，从真实案例中学习最佳实践！| Learn and understand Docker&Container technologies, with real DevOps practice!

Stack

Go Django npm pip

GitHub topics

#book #cloud-computing #container #devops #docker #kubernetes

Updated: 2026-07-10
Lists: 1 list mention
First commit: 2014-09-05
History: 2 history points
License: Unknown
Issues: 0 open

26,146

stars

Forks: 5,812
Commits: 1,577 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

FavioVazquez/ds-cheatsheets

List of Data Science Cheatsheets to rule the world

GitHub topics

#cheatsheet #datascience #jupyter #programming #python #r

Updated: 2024-07-18
Lists: 0 list mentions
First commit: 2018-12-22
History: 18 history points
License: MIT
Issues: 13 open

16,281

stars

Forks: 4,050
Commits: 51 commits
Star growth, last 7 days: +6 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

GaiZhenbiao/ChuanhuChatGPT

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

Stack

Python FastAPI Gradio LangChain pip

GitHub topics

#chatbot #chatglm #chatgpt-api #claude #dalle3 #ernie

Updated: 2026-04-30
Lists: 2 list mentions
First commit: 2023-03-02
History: 5 history points
License: GPL-3.0
Issues: 129 open

15,307

stars

Forks: 2,220
Commits: 1,263 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Stack

Python CMake pip

GitHub topics

#baidu #deep-learning #deeplearning #keras #machine-learning #machinelearning

Updated: 2026-06-20
Lists: 1 list mention
First commit: 2017-08-09
History: 6 history points
License: NOASSERTION
Issues: 406 open

14,692

stars

Forks: 2,240
Commits: 1,349 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

Stack

Java pytest CMake Maven npm

GitHub topics

#artificial-intelligence #clojure #deeplearning #deeplearning4j #dl4j #gpu

Updated: 2026-07-10
Lists: 2 list mentions
First commit: 2019-06-06
History: 5 history points
License: Apache-2.0
Issues: 53 open

14,241

stars

Forks: 3,834
Commits: 2,888 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

tobymao/sqlglot

Python SQL Parser and Transpiler

AI dev

Stack

Python PEP 517 pip

GitHub topics

#bigquery #clickhouse #databricks #duckdb #hive #mysql

Updated: 2026-07-17
Lists: 1 list mention
First commit: 2021-03-13
History: 22 history points
License: MIT
Issues: 5 open

9,433

stars

Forks: 1,207
Commits: 7,893 commits
Star growth, last 7 days: +23 +0.2%
Commit velocity, last 7 days: +44 +0.6%

Website GitHub

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

Stack

Scala Astro Vite Maven PEP 517 pip

GitHub topics

#acid #analytics #big-data #delta-lake #spark

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2019-04-12
History: 5 history points
License: Apache-2.0
Issues: 1,547 open

8,905

stars

Forks: 2,137
Commits: 5,361 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

Stack

Python LangChain Next.js pytest React npm PEP 517 pip

GitHub topics

#artificial-intelligence #data #data-engineering #data-integration #data-pipelines #data-science

Updated: 2026-07-02
Lists: 1 list mention
First commit: 2022-05-16
History: 5 history points
License: Apache-2.0
Issues: 617 open

8,769

stars

Forks: 978
Commits: 5,784 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

AI dev

Stack

Jupyter Notebook Bundler Gradle npm

GitHub topics

#automl #big-data #data-science #deep-learning #distributed #ensemble-learning

Updated: 2026-07-12
Lists: 3 list mentions
First commit: 2014-03-03
History: 8 history points
License: Apache-2.0
Issues: 2,879 open

7,498

stars

Forks: 2,030
Commits: 32,791 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

donnemartin/dev-setup

macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

Stack

Python

GitHub topics

#android-development #aws #bash #cli #cloud #elasticsearch

Updated: 2023-02-27
Lists: 0 list mentions
First commit: 2015-07-08
History: 22 history points
License: NOASSERTION
Issues: 35 open

6,264

stars

Forks: 1,142
Commits: 356 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

microsoft/SynapseML

Simple and Distributed Machine Learning

AI dev

Stack

Scala React npm PEP 517 pip

GitHub topics

#ai #apache-spark #azure #big-data #cognitive-services #data-science

Updated: 2026-07-06
Lists: 1 list mention
First commit: 2017-06-02
History: 5 history points
License: MIT
Issues: 392 open

5,231

stars

Forks: 863
Commits: 1,792 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Stack

Scala Maven

GitHub topics

#dataquality #scala #spark #unit-testing

Updated: 2026-07-13
Lists: 1 list mention
First commit: 2018-08-07
History: 5 history points
License: Apache-2.0
Issues: 92 open

3,631

stars

Forks: 585
Commits: 371 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Stack

Java Spring Boot Vite Vue Maven npm

GitHub topics

#application-manager #context-service #engine #hive #hive-table #impala

Updated: 2026-07-16
Lists: 1 list mention
First commit: 2019-07-23
History: 55 history points
License: Apache-2.0
Issues: 170 open

3,407

stars

Forks: 1,168
Commits: 4,339 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: +3 +0.1%

Website GitHub

WeBankFinTech/DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

Stack

Java Spring Boot Vite Vue Maven npm

GitHub topics

#airflow #atlas #azkaban #dataworks #davinci #dolphinscheduler

Updated: 2025-11-04
Lists: 1 list mention
First commit: 2019-11-24
History: 5 history points
License: Apache-2.0
Issues: 360 open

3,265

stars

Forks: 1,035
Commits: 12,523 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

lakehq/sail

Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.

AI dev

Stack

Rust Axum pytest React Tailwind CSS Cargo PEP 517 pnpm

GitHub topics

#apache-iceberg #apache-spark #arrow #artificial-intelligence #big-data #data-engineering

Updated: 2026-07-17
Lists: 1 list mention
First commit: 2023-12-21
History: 20 history points
License: Apache-2.0
Issues: 215 open

3,188

stars

Forks: 194
Commits: 1,458 commits
Star growth, last 7 days: +50 +1.6%
Commit velocity, last 7 days: +14 +1.0%

Website GitHub

gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

Archived

Stack

Java Spring Boot Maven

GitHub topics

#accumulo #aggregation #big-data #graph #graph-database #hadoop

Updated: 2025-06-06
Lists: 1 list mention
First commit: 2015-12-14
History: 5 history points
License: Apache-2.0
Issues: 138 open

1,788

stars

Forks: 363
Commits: 7,332 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

strapdata/elassandra

Elassandra = Elasticsearch + Apache Cassandra

Stack

Java Gradle pip

GitHub topics

#aggregation #cassandra #completion #elasticsearch #fuzzy-search #kibana

Updated: 2026-05-17
Lists: 1 list mention
First commit: 2010-02-08
History: 5 history points
License: Apache-2.0
Issues: 60 open

1,713

stars

Forks: 194
Commits: 44,130 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

maxpumperla/elephas

Distributed Deep learning with Keras & Spark

Stack

Python pip

GitHub topics

#deep-learning #distributed-computing #keras #neural-networks #spark

Updated: 2023-05-01
Lists: 1 list mention
First commit: 2015-08-13
History: 6 history points
License: MIT
Issues: 11 open

1,579

stars

Forks: 305
Commits: 509 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

iflytek/datasophon

The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.

Stack

Java Spring Boot Vue Maven npm

GitHub topics

#cloudnative #doris #easy-to-use #kubernetes #spark #yarn

Updated: 2025-07-22
Lists: 1 list mention
First commit: 2022-11-02
History: 4 history points
License: Apache-2.0
Issues: 173 open

1,325

stars

Forks: 458
Commits: 697 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

TIBCOSoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

Stack

Scala Gradle pip

GitHub topics

#analytics #memory-database #scale #snappydata #spark #stream

Updated: 2022-11-21
Lists: 1 list mention
First commit: 2015-05-13
History: 54 history points
License: NOASSERTION
Issues: 117 open

1,032

stars

Forks: 198
Commits: 4,151 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

a616567126/GPT-WEB-JAVA

基于JDK8 AI 聊天机器人！微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图，卡密兑换，易支付，公众号引流，邮件注册🔥

Stack

Java Spring Boot Maven

GitHub topics

#bard-api #chatgpt #google #midjourney-api #spark #stable-diffusion

Updated: 2026-05-18
Lists: 2 list mentions
First commit: 2023-03-28
History: 40 history points
License: Apache-2.0
Issues: 8 open

779

stars

Forks: 197
Commits: 451 commits
Star growth, last 7 days: +1 +0.1%
Commit velocity, last 7 days: 0 0.0%

GitHub

MigoXLab/dingo

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

AI dev

Stack

Python LangChain pip

GitHub topics

#agent-as-a-judge #common-crawl #data-agent #data-evaluation #data-quality #data-quality-assessment

Updated: 2026-07-13
Lists: 2 list mentions
First commit: 2024-12-27
History: 8 history points
License: Apache-2.0
Issues: 5 open

725

stars

Forks: 74
Commits: 838 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

AI dev

Stack

Python PEP 517

GitHub topics

#compare #dask #data #data-science #dataframes #fugue

Updated: 2026-06-19
Lists: 2 list mentions
First commit: 2018-03-27
History: 6 history points
License: Apache-2.0
Issues: 7 open

652

stars

Forks: 161
Commits: 336 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

polyaxon/traceml

Engine for AI/ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Stack

Python pytest pip

GitHub topics

#dask #data-exploration #data-profiling #data-quality #data-quality-checks #data-science

Updated: 2026-06-17
Lists: 1 list mention
First commit: 2016-03-25
History: 6 history points
License: Apache-2.0
Issues: 7 open

533

stars

Forks: 47
Commits: 10,003 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

GitHub

linkedin/isolation-forest

A distributed Spark/Scala implementation of the isolation forest and extended isolation forest algorithms for unsupervised outlier detection, featuring support for scalable training and ONNX export for easy cross-platform inference.

Stack

Scala pytest Gradle PEP 517 pip

GitHub topics

#anomaly-detection #isolation-forest #linkedin #machine-learning #onnx #outlier-detection

Updated: 2026-06-12
Lists: 2 list mentions
First commit: 2019-08-12
History: 5 history points
License: NOASSERTION
Issues: 1 open

260

stars

Forks: 53
Commits: 101 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Stack

Scala

GitHub topics

#archivespark #internet-archive #spark #spark-framework #warc #web-archiving

Updated: 2025-10-08
Lists: 1 list mention
First commit: 2015-08-06
History: 5 history points
License: MIT
Issues: 5 open

161

stars

Forks: 19
Commits: 154 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stack

Scala Maven

GitHub topics

#analysis #apache-spark #big-data #big-data-analytics #dataframe #digital-humanities

Updated: 2025-12-05
Lists: 1 list mention
First commit: 2013-07-13
History: 5 history points
License: Apache-2.0
Issues: 5 open

158

stars

Forks: 33
Commits: 1,032 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

archivesunleashed/notebooks

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

Stack

Jupyter Notebook

GitHub topics

#juypter-notebook #notebooks #pyspark-notebook #python3 #spark #web-archives

Updated: 2022-12-05
Lists: 1 list mention
First commit: 2019-11-06
History: 5 history points
License: Apache-2.0
Issues: 0 open

stars

Forks: 5
Commits: 185 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

Search awesome repositories

Find repositories

Put your repository first

How it works

Pricing

How it works

Pricing