Open highlighted repo slot
Put your repository first
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Awesome-list intelligence for GitHub
Discover projects curated by awesome-list maintainers, then narrow them by stars, age, freshness, archive status, language, topics, generated tags, detected stacks, package managers, and source list.
Open highlighted repo slot
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Apache Spark - A unified analytics engine for large-scale data processing
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
๐ง Build, run, and manage data pipelines for integrating and transforming data.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Simple and Distributed Machine Learning
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
A large-scale entity and relation database supporting aggregation of properties
Elassandra = Elasticsearch + Apache Cassandra
Distributed Deep learning with Keras & Spark
Project SnappyData - memory optimized analytics database, based on Apache Sparkโข and Apache Geodeโข. Stream, Transact, Analyze, Predict in one cluster
ๅบไบJDK8 AI ่ๅคฉๆบๅจไบบ๏ผๅพฎไฟกๅ ฌไผๅท Midjourney็ปๅพใๅกๅฏๅ ๆขใweb ๆฏๆChatGPTใMidjourney็ปๅพใsd็ปๅพ๏ผๅกๅฏๅ ๆข๏ผๆๆฏไป๏ผๅ ฌไผๅทๅผๆต๏ผ้ฎไปถๆณจๅ๐ฅ
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
A distributed Spark/Scala implementation of the isolation forest and extended isolation forest algorithms for unsupervised outlier detection, featuring support for scalable training and ONNX export for easy cross-platform inference.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz