-
parallel_corpus_mnbvc Public
Forked from mnbvc-parallel-corpus-team/parallel_corpus_mnbvcparallel corpus dataset from the mnbvc project
Jupyter Notebook Apache License 2.0 UpdatedJan 21, 2024 -
-
charset_mnbvc Public template
Forked from alanshi/charset_mnbvc本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
Python MIT License UpdatedJan 18, 2024 -
MNBVC Public
Forked from esbatmop/MNBVCMNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
MIT License UpdatedJan 17, 2024 -
-
augmentoolkit Public
Forked from e-p-armstrong/augmentoolkitConvert Compute And Books Into Instruct-Tuning Datasets
Python MIT License UpdatedJan 8, 2024 -
github_downloader_mnbvc Public
Forked from imgingroot/github_downloader_mnbvcgithub仓库下载器
Python UpdatedJan 3, 2024 -
-
-
deduplication_mnbvc Public
Forked from aplmikex/deduplication_mnbvc文本去重
Python MIT License UpdatedDec 24, 2023 -
-
pdf_meta_data_mnbvc Public
Forked from MIracleyin/pdf_meta_data_mnbvcJupyter Notebook UpdatedDec 9, 2023 -
Qwen-Agent Public
Forked from QwenLM/Qwen-AgentAgent framework and applications built upon Qwen, featuring Code Interpreter and Chrome browser extension.
Python Other UpdatedDec 6, 2023 -
data_management_LLM Public
Forked from ZigeW/data_management_LLMCollection of training data management explorations for large language models
UpdatedDec 4, 2023 -
awesome-foundation-and-multimodal-models Public
Forked from SkalskiP/awesome-foundation-and-multimodal-models👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]
Python UpdatedNov 22, 2023 -
gpt-crawler Public
Forked from BuilderIO/gpt-crawlerCrawl a site to generate knowledge files to create your own custom GPT from a URL
TypeScript MIT License UpdatedNov 21, 2023 -
crawlee Public
Forked from apify/crawleeCrawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
TypeScript Apache License 2.0 UpdatedNov 20, 2023 -
Exam-Question-Bank-Dataset-zh_mnbvc Public
Forked from UnstoppableCurry/Exam-Question-Bank-Dataset-zh_mnbvc通用考试题库数据集 选择 填空 简答
Jupyter Notebook MIT License UpdatedNov 10, 2023 -
Skywork Public
Forked from SkyworkAI/SkyworkSkywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation me…
Python Other UpdatedNov 6, 2023 -
notabug_download_mnbvc Public
Forked from gezi2333/notabug_download_mnbvcJupyter Notebook UpdatedNov 3, 2023 -
OpenAGI Public
Forked from agiresearch/OpenAGIOpenAGI: When LLM Meets Domain Experts
Python Apache License 2.0 UpdatedOct 29, 2023 -
siyuan Public
Forked from siyuan-note/siyuanA privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.
TypeScript GNU Affero General Public License v3.0 UpdatedOct 23, 2023 -
AI_Chinese_DataSet_KnowledgeDAO Public
Forked from shuliu586/AI_Chinese_DataSet_KnowledgeDAO供AI训练的中文数据集(持续更新。。。),目前的数据集餐饮行业8000问,百度知道,Alpaca中文数据集,计算机领域数据集,Vicuna数据集,RedPajama数据集,Wikipedia中文词条数据集,网站论坛问答数据集
Python MIT License UpdatedOct 16, 2023 -
Data-Copilot Public
Forked from zwq2018/Data-CopilotData-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow
Python MIT License UpdatedOct 10, 2023 -
-
-
githubcode_extractor_mnbvc Public
Forked from LinnaWang76/githubcode_extractor_mnbvc用于提取github-code-zip文件的内容,并保存为jsonl格式
Python UpdatedAug 27, 2023 -
ray Public
Forked from ray-project/rayRay is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads.
Python Apache License 2.0 UpdatedAug 16, 2023 -
-
AI-Code-Convert Public
Forked from siknet/AI-Code-ConvertAI Code Translator,Generate Code or Natural Language To Programming Language
TypeScript UpdatedAug 5, 2023