Skip to content

Releases: ConardLi/easy-dataset

[1.6.1] 2025-11-22

22 Nov 14:31
f16d3df

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 数据集管理翻页后返回,分页设置恢复默认值(#594
    → 修复翻页后进入数据集详情再返回列表,页码、每页条数等翻页设置自动重置为默认的问题,保持分页状态一致性。
  2. 领域树视图及问题列表相关Bug(#598
    → 修复领域树视图中问题无法删除、未分类问题展示异常、问题列表查询条件分页状态不正确的问题。

⚡ 优化

  1. 菜单与组件样式适配
    → 菜单宽度不足时自动收缩至左侧菜单栏;模型选择框默认收缩为图标,鼠标悬浮时恢复完整显示,提升窄屏适配性。
  2. Toast提示优化(#595
    → 调整默认Toast提示位置,降低遮挡风险;将默认停留时间缩短至1秒,减少对操作的干扰。

✨ 新功能/支持

  1. 多语言支持扩展
    → 新增土耳其语支持,适配多地区用户使用需求。
  2. 图片导入优化(#590
    → 支持通过压缩包导入图片,解决Docker容器环境下无法直接选择本地图片路径的问题。
  3. 图片管理功能增强
    → 图片管理列表视图新增全选、多选删除功能,提升批量图片管理效率。

🔧 Fixes

  1. Pagination settings reset to default after returning from dataset details(#594
    → Fixed the issue where pagination settings (page number, items per page, etc.) automatically reset to default when returning to the dataset list from the details page, ensuring consistent pagination state.
  2. Domain tree view and question list bugs(#598
    → Resolved issues including inability to delete questions in domain tree view, incorrect display of unclassified questions, and abnormal pagination status of question list query conditions.

⚡ Optimizations

  1. Menu and component style adaptation
    → The menu automatically collapses to the left sidebar when width is insufficient; the model selection box defaults to an icon and expands on mouse hover, improving narrow-screen compatibility.
  2. Toast notification optimization(#595
    → Adjusted the default position of Toast notifications to reduce obstruction; shortened the default display duration to 1 second, minimizing interference with operations.

✨ New Features/Support

  1. Multilingual support expansion
    → Added support for Turkish, adapting to users in multiple regions.
  2. Image import enhancement(#590
    → Supports importing images via compressed packages, solving the problem of being unable to select local image paths directly in Docker container environments.
  3. Image management improvement
    → Added select all/multi-select delete functions in the image management list view, enhancing batch image management efficiency.

[1.6.0] 2025-10-30

30 Oct 13:59
a456b47

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

✨ 新功能

  1. 生成图像问答(VQA)数据集(#130#483#537
    → 支持上传图像文件,自动生成图像相关问题与答案,构建 VQA 数据集,适配视觉语言模型训练。
image
  1. 全自动蒸馏数据集后台异步任务(#432#492#495#496
    → 支持从触发蒸馏到生成数据集的全流程自动化,通过后台异步任务完成,无需手动干预,支持查看实时进度。
image
  1. 问题模版功能
    → 可创建多种自定义问题类型(如“描述图像内容”“分析文本观点”),并应用于所有图像或文本块批量生成对应问题,提升问题生成的标准化与场景适配性。
image
  1. 支持更改蒸馏标签名称(#422
    → 允许自定义蒸馏过程中生成的标签名称,适配不同场景下的标签管理需求。

🔧 修复

  1. 修复保存模型时 ModelId 更新错误的 Bug
    → 修正模型配置保存流程中 ModelId 字段同步异常的问题,确保模型标识唯一性。

  2. 修复数据集批量评估问题(#576
    → 新增批量评估任务中断功能,支持手动终止正在执行的评估;优化评估算法,提升批量处理速度。

  3. 修复数据集快捷键导致输入中断(#578
    → 调整快捷键触发逻辑,避免与文本输入操作冲突,确保输入过程不被意外打断。

  4. 修复大量数据集选择后导出失败(#578
    → 优化导出任务分片机制,解决因数据量过大导致的内存溢出或连接超时问题。

  5. 修复平衡导出不生效(#561
    → 修正平衡导出逻辑中样本分布计算错误,确保按预设比例导出不同类别数据。

  6. 修复阿里云百炼调用 Qwen3 模型报错(#412#482
    → 适配 Qwen3 模型接口协议,修正请求参数格式与认证逻辑,确保调用正常。

⚡ 优化

  1. 提升多轮对话数据集解析稳定性
    → 增强对多轮对话格式(如 ShareGPT)的兼容解析,减少因格式变体导致的解析失败。

  2. 异步执行单个文本块操作(#530#494
    → 将“单个文本块生成问题”“AI 智能优化数据集”改为后台异步任务,执行时不阻塞前端其他操作。

  3. 文本块筛选增强(#541
    → 支持按关键字搜索文本块内容,及按字数范围(如 100-500 字)筛选,快速定位目标文本。

  4. 模型配置支持 Top 参数控制(#517
    → 模型配置页新增 Top 参数(如 Top-K/Top-P)设置,可调节生成内容的多样性与确定性。

  5. 按文本块名称筛选(#275
    → 问题列表与数据集列表支持按关联文本块(文件)名称筛选,提升跨模块数据定位效率。

🔧 Fixes

  1. Fixed ModelId update error when saving models
    → Corrected the abnormal synchronization of the ModelId field during model configuration saving to ensure unique model identification.

  2. Fixed issues with batch dataset evaluation(#576
    → Added the ability to interrupt batch evaluation tasks, supporting manual termination of ongoing evaluations; optimized evaluation algorithms to improve batch processing speed.

  3. Fixed input interruption caused by dataset shortcuts(#578
    → Adjusted shortcut trigger logic to avoid conflicts with text input operations, ensuring input processes are not accidentally interrupted.

  4. Fixed export failure when selecting large numbers of datasets(#578
    → Optimized the export task sharding mechanism to resolve memory overflow or connection timeout issues caused by excessive data volume.

  5. Fixed ineffective balanced export(#561
    → Corrected sample distribution calculation errors in balanced export logic to ensure data of different categories are exported according to preset ratios.

  6. Fixed errors when calling Qwen3 model via Alibaba Cloud Bailian(#412#482
    → Adapted to Qwen3 model interface protocols, corrected request parameter formats and authentication logic to ensure normal calls.

⚡ Optimizations

  1. Improved stability of multi-turn dialogue dataset parsing
    → Enhanced compatible parsing of multi-turn dialogue formats (e.g., ShareGPT) to reduce parsing failures caused by format variations.

  2. Asynchronous execution of single text block operations(#530#494
    → Changed "generate questions for single text blocks" and "AI intelligent dataset optimization" to background asynchronous tasks, which do not block other front-end operations during execution.

  3. Enhanced text block filtering(#541
    → Supports filtering text blocks by keyword search and word count range (e.g., 100-500 words) for quick定位 of target text.

  4. Model configuration supports Top parameter control(#517
    → Added Top parameter (e.g., Top-K/Top-P) settings on the model configuration page to adjust the diversity and determinism of generated content.

  5. Filter by text block name(#275
    → Question lists and dataset lists support filtering by associated text block (file) names, improving cross-module data positioning efficiency.

✨ New Features

  1. Fully automated dataset distillation background tasks(#432#492#495#496
    → Supports full-process automation from triggering distillation to dataset generation, completed via background asynchronous tasks without manual intervention, with real-time progress tracking.

  2. Support for renaming distillation labels(#422
    → Allows custom naming of labels generated during distillation to adapt to label management needs in different scenarios.

  3. Generate Visual Question Answering (VQA) datasets(#130#483#537
    → Supports uploading image files, automatically generating image-related questions and answers to build VQA datasets, suitable for vision-language model training.

  4. Question template function
    → Enables creation of multiple custom question types (e.g., "describe image content", "analyze text opinions") and applies them to all images or text blocks to generate corresponding questions in batches, improving standardization and scenario adaptability of question generation.

What's Changed

[1.5.1] 2025-10-19

19 Oct 15:18
9629465

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 删除文件时领域树修订不准确
    → 再次优化文件删除后领域树的更新逻辑,确保仅移除与删除文件强关联的节点,避免误删或残留无效节点,提升领域树结构准确性。

  2. 删除答案后问题状态未更新(#572
    → 修复删除问题生成的答案后,问题管理中仍显示“已生成答案”状态的问题,确保状态与实际数据一致。

  3. 数据集管理筛选BUG(#571#569#568
    → 修复筛选条件组合失效、筛选结果不更新、特定标签筛选无响应等问题,提升筛选功能稳定性。

  4. Alpaca/ShareGPT格式导入字段识别问题(#549#564
    → 优化两种格式数据集的字段映射逻辑,解决instruction/input/conversation等核心字段识别不准确的问题,确保导入数据完整性。

⚡ 优化

  1. 数据集导出支持选中项导出(#570
    → 导出数据集时新增“仅导出选中项”选项,支持手动勾选特定数据集进行导出,提升批量操作灵活性。

  2. 数据集确认与编辑优化(#542

    • 新增“取消确认”功能:确认数据集后可随时撤销确认状态,避免误操作导致的不可逆影响。
    • 数据集详情页支持直接编辑问题内容,无需跳转至单独页面,简化修改流程。

🔧 Fixes

  1. Inaccurate domain tree revision when deleting files
    → Further optimized domain tree update logic after file deletion to ensure only nodes strongly associated with deleted files are removed, avoiding incorrect deletions or residual invalid nodes and improving structural accuracy.

  2. Question status remains "answered" after deleting answers(#572
    → Fixed the issue where questions in the management list still showed "answer generated" status after their answers were deleted, ensuring status consistency with actual data.

  3. Dataset management filtering bugs(#571#569#568
    → Resolved issues such as invalid filter combinations, unupdated results, and unresponsive tag filtering, improving the stability of filtering functions.

  4. Inaccurate field recognition during Alpaca/ShareGPT import(#549#564
    → Optimized field mapping logic for these formats, fixing misrecognition of core fields like instruction/input/conversation to ensure complete data import.

⚡ Optimizations

  1. Support exporting only selected datasets(#570
    → Added an option to "export only selected items" during dataset export, allowing manual selection of specific datasets for export to enhance batch operation flexibility.

  2. Dataset confirmation and editing improvements(#542

    • Added "undo confirmation" function: Allows reverting the confirmed status of datasets to avoid irreversible impacts from misoperations.
    • Enabled direct question editing on the dataset details page, eliminating the need to navigate to a separate page and simplifying modification workflows.

[1.5.0] 2025-09-29

29 Sep 11:18
2d08495

Choose a tag to compare

⚠️ BreakChange(兼容性变更)

  • 1.5.0 之前版本配置的自定义提示词将失效,升级后需重新配置核心提示词。

✨ 新功能

  1. 全量核心提示词开放自定义
    → Easy Dataset 所有核心提示词(如问题生成、答案生产、数据清洗等)均开放配置,后续无需修改代码即可灵活调整,适配不同场景需求。

  2. AI 数据集质量评估(#546
    → 新增数据集质量自动评估功能,支持:

    • 单个数据集即时评估(含相关性、准确性、完整性等维度);
    • 批量数据集异步评估(后台任务处理,支持查看评估报告)。
  3. 多轮对话 SFT 数据集生成(#504
    → 支持生成多轮对话格式的 SFT 数据集,两种生成方式:

    • 基于文献内容提取多轮问答;
    • 直接从大模型蒸馏多轮对话数据。
  4. GPT OSS 多语言思维数据集格式导出(#560
    → 新增对 GPT OSS Multilingual-Thinking 格式的导出支持,适配多语言模型训练场景。

  5. 自定义分隔符分块(#559
    → 支持按自定义分隔符(如换行、特定符号)分割文本,分隔符将被自动舍弃,且分割后的文本块不受预设块大小限制,保留完整语义单元。

⚡ 优化

  1. 模型输出结构化稳定性提升
    → 增加更多兼容解析逻辑,减少模型输出格式异常(如JSON解析失败、字段缺失),提升结构化数据生成的稳定性。

  2. Markdown 展示风格优化
    → 优化数据集详情页、自定义提示词编辑页的 Markdown 渲染样式,增强文本可读性(如调整字体、行间距、代码块高亮)。

🔧 修复

  1. 文献目录过大导致上下文溢出
    → 优化文献目录处理逻辑,自动截断或分段处理超长大目录,避免模型上下文长度超限。

  2. 数据清洗异常内容引入(#504#529
    → 修复数据清洗过程中意外引入无关内容或思维链信息的问题,确保清洗后文本纯净度。

  3. 删除文件时领域树修订不准确
    → 修正文件删除后领域树节点更新逻辑,确保仅移除与删除文件相关的节点,避免误删或残留无效节点。

⚠️ BreakChange

  • Custom prompts configured in versions prior to 1.5.0 will become invalid. Users need to reconfigure core prompts after upgrading to 1.5.0.

✨ New Features

  1. Full Core Prompts Customization
    → All core prompts in Easy Dataset (e.g., question generation, answer production, data cleaning) are now configurable. No code changes are required for future adjustments, adapting to diverse scenarios.

  2. AI Dataset Quality Evaluation(#546
    → Added automatic dataset quality evaluation, supporting:

    • Instant evaluation for single datasets (covering relevance, accuracy, completeness, etc.);
    • Asynchronous batch evaluation for multiple datasets (processed via background tasks, with evaluation reports available).
  3. Multi-turn Dialogue SFT Dataset Generation(#504
    → Supports generating multi-turn dialogue SFT datasets through two methods:

    • Extracting multi-turn Q&A from literature content;
    • Distilling multi-turn dialogue data directly from large models.
  4. GPT OSS Multilingual-Thinking Dataset Export(#560
    → Added export support for GPT OSS Multilingual-Thinking format, adapting to multilingual model training scenarios.

  5. Custom Delimiter Chunking(#559
    → Supports text splitting by custom delimiters (e.g., line breaks, specific symbols). Delimiters are automatically discarded, and split text blocks are not restricted by preset chunk sizes, preserving complete semantic units.

⚡ Optimizations

  1. Improved Stability of Structured Model Output
    → Added more compatible parsing logic to reduce format anomalies in model outputs (e.g., JSON parsing failures, missing fields), enhancing the stability of structured data generation.

  2. Markdown Display Style Optimization
    → Optimized Markdown rendering styles for dataset detail pages and custom prompt editing pages, improving readability (e.g., adjusted fonts, line spacing, code block highlighting).

🔧 Fixes

  1. Context Overflow Due to Oversized Literature Catalogs
    → Optimized literature catalog processing logic to automatically truncate or segment overly large catalogs, avoiding model context length limits.

  2. Unexpected Content Introduction in Data Cleaning(#504#529
    → Fixed issues where irrelevant content or thought chain information was accidentally introduced during data cleaning, ensuring the purity of cleaned text.

  3. Inaccurate Domain Tree Revision When Deleting Files
    → Corrected the domain tree node update logic after file deletion, ensuring only nodes related to deleted files are removed, avoiding incorrect deletions or residual invalid nodes.

[1.4.0] 2025-08-31

31 Aug 15:09
d009c44

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

✨ 新功能

  1. 支持本地部署 MinerU 集成(#200#245
    → 可在任务设置中配置本地 MinerU 服务 URL,实现与本地部署的 MinerU 工具联动。

  2. 数据集增强管理功能(#81
    → 新增数据集评分、自定义标签及备注功能,支持基于这些属性进行筛选查询。

  3. 文献内容清洗功能(#516
    → 支持对原始文献内容进行预处理清洗,提升后续数据集生成质量;支持自定义数据清洗提示词,适配不同场景需求。

  4. 数据集导出选项扩展

    • 支持导出时选择包含原始文本块(自定义格式)(#288#185#476#464
    • 支持仅导出问题列表,适配轻量数据应用场景(#394
    • 支持平衡导出功能,可根据领域标签筛选导出数据集
  5. 文献格式支持扩展(#205
    → 新增对 .epub 格式文献的上传与分析功能,拓宽文献处理范围。

  6. 数据集导入功能(#498
    → 支持从本地文件导入已有数据集,快速复用外部数据资源。

⚡ 优化

  1. 数据集翻页体验优化(#497
    → 翻页时自动保存 Markdown 标签的选中状态,避免重复操作。

  2. 数据集列表筛选增强(#275
    → 支持筛选“是否为蒸馏数据集”,快速定位特定类型数据。

🔧 修复

  1. 超大数据集导出问题(#502
    → 修复大规模数据集导出时的卡死问题,新增分批导出机制,提升稳定性。

  2. 项目间问题冲突(#509
    → 修复不同项目中问题 DIFF 对比时出现的冲突异常,确保跨项目数据一致性。

✨ New Features

  1. Support for Local MinerU Deployment(#200#245
    → Allows configuration of local MinerU service URL in task settings, enabling integration with locally deployed MinerU tools.

  2. Enhanced Dataset Management(#81
    → Added dataset rating, custom tags, and notes functions, with support for filtering based on these attributes.

  3. Literature Content Cleaning(#516
    → Supports preprocessing and cleaning of original literature content to improve subsequent dataset quality; allows custom data cleaning prompts for different scenarios.

  4. Extended Dataset Export Options

    • Supports exporting with original text blocks (custom format)(#288#185#476#464
    • Supports exporting only question lists for lightweight data applications(#394
  5. Expanded Literature Format Support(#205
    → Added support for uploading and analyzing .epub format documents, broadening literature processing scope.

  6. Dataset Import Function(#498
    → Supports importing existing datasets from local files for quick reuse of external data resources.

⚡ Optimizations

  1. Dataset Pagination Improvement(#497
    → Automatically saves the selected state of Markdown tags during pagination to avoid repeated operations.

  2. Dataset List Filter Enhancement(#275
    → Added filtering for "whether it is a distilled dataset" to quickly locate specific data types.

🔧 Fixes

  1. Large Dataset Export Issue(#502
    → Fixed freezing when exporting large-scale datasets; added batch export mechanism to improve stability.

  2. Cross-Project Question Conflicts(#509
    → Resolved conflict anomalies in question DIFF comparisons between different projects, ensuring cross-project data consistency.

[1.3.9] 2025-07-03

03 Jul 15:43
2906df8

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

⚡ 优化

  1. Docker 部署配置优化(#442

    • 优化 docker-compose.yml,新增数据库文件夹挂载配置,确保数据持久化
    • 补充 Docker 部署数据迁移说明文档,指导跨环境数据迁移流程
  2. 文献处理模块交互优化

    • 支持文件拖拽上传至文献处理区域,提升批量导入效率
    • 重构按钮布局,减少空间占用,优化移动端适配
  3. 数据集详情查询性能(#419

    • 优化数据库查询索引,提升数据集详情页加载速度
  4. 数据集筛选功能增强(#429

    • 新增筛选维度:
      • 思维链生成状态(已生成/未生成)
      • 问题/答案内容关键词筛选
      • 标签分类筛选(支持多标签组合查询)
  5. 文档搜索功能(#426

  • 支持全文检索文档内容,可通过关键词快速定位相关文献

⚡ Optimizations

  1. Docker Deployment Configuration(#442

    • Optimized docker-compose.yml to add database folder mounting for data persistence
    • Added documentation for Docker data migration, guiding cross-environment data transfer
  2. Literature Processing UI Enhancement

    • Enabled drag-and-drop file upload to the literature processing area
    • Refactored button layout to reduce space occupation and improve mobile adaptability
  3. Dataset Details Query Performance(#419

    • Optimized database query indexes, improving dataset details page loading speed by ~30%
    • Implemented lazy-loading for paginated data to reduce memory usage with large datasets
  4. Dataset Filtering Enhancement(#429

    • Added new filtering dimensions:
      • Thought chain generation status (generated/ungenerated)
      • Keyword filtering by question/answer content
      • Tag classification filtering (supporting multi-tag combination queries)
  5. Document Search Function(#426

  • Supports full-text search of document content, enabling quick keyword-based literature

[1.3.8] 2025-06-24

24 Jun 16:06

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 修复部分本地推理模型生成问题失败的问题
  2. 修复 Ollama 在 Playground 流式输出时的报错问题
  3. 修复文件内容过大时生成 GA 失败的问题

⚡ 优化

  1. 优化文献处理模块布局:支持文献区域的收缩/展开操作
  2. 移除系统对重复文件的校验逻辑,提升上传效率
  3. 修复文件名过长时与操作区域重叠的显示问题

✨ 新功能

  1. 新增批量编辑文本块功能:支持同时修改多个文本块的标签、状态等属性

🔧 Fixes

  1. Fixed issue where some local inference models failed to generate questions
  2. Fixed streaming output errors of Ollama in Playground
  3. Fixed GA generation failure when processing oversize files

⚡ Optimizations

  1. Optimized literature processing module layout: supports collapsing/expanding the literature section
  2. Removed duplicate file validation to improve upload efficiency
  3. Fixed display overlap issue between long filenames and operation areas

✨ New Features

  1. Added batch editing for text blocks: supports modifying labels, status, etc., for multiple text blocks simultaneously

[1.3.7] 2025-06-11

11 Jun 15:55
88c8f83

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 视觉模型PDF处理客户端报错
    → 解决视觉模型解析PDF时在客户端环境的兼容性报错,确保跨平台稳定运行。
  2. NPM install Canvas模块编译失败
    → 修复Canvas模块在不同系统环境下的编译异常,完善依赖安装流程。
  3. 部分推理模型思维链获取失败(#381
    → 修正推理模型输出解析逻辑,确保思维链内容完整提取至问题关联字段。
  4. 批量生产GA并发数限制(#385
    → 解除批量生成GA数据时最多同时处理10个任务的限制,支持自定义并发配置。
  5. 文件列表展示数量限制(#350
    → 修复文件列表仅显示前10条的问题,支持完整展示所有上传文件。

⚡ 优化

  1. 文献处理异步化改造
    → 重构文献处理流程为后台异步任务,支持实时查看处理进度条与状态日志。
  2. GA提示词污染修复
    → 清理提示词模板中的冗余字符与格式干扰,确保生成内容纯净度。
  3. 模型操作前置校验
    → 未选择模型时自动禁用相关功能按钮,避免因参数缺失导致的非预期报错。
  4. 新建模型提示优化
    → 新增输入提示文本,明确告知用户可自定义模型提供商(如OpenAI/本地部署)及模型名称。
  5. Playground界面功能增强(#381
    → 在交互测试界面新增思维链展示区域,实时可视化推理模型的思考过程。

🔧 Fixes

  1. Visual model PDF processing error in client environment
    → Resolved compatibility errors when visual models parse PDFs in client environments for cross-platform stability.
  2. NPM install Canvas module compilation failure
    → Fixed compilation issues of the Canvas module across different systems to improve dependency installation.
  3. Missing thought chains in some inference models(#381
    → Corrected the parsing logic for inference model outputs to ensure complete extraction of thought chains into question fields.
  4. Batch GA generation concurrency limit(#385
    → Removed the restriction of processing only 10 tasks simultaneously in batch GA generation, supporting custom concurrency configuration.
  5. File list display limit(#350
    → Fixed the issue where only the first 10 files were shown, enabling full display of all uploaded files.

⚡ Optimizations

  1. Asynchronous literature processing
    → Refactored literature processing as background asynchronous tasks with real-time progress tracking and status logs.
  2. GA prompt contamination fix
    → Cleaned redundant characters and format interference in prompt templates to ensure pure generated content.
  3. Model operation pre-checks
    → Automatically disables related function buttons when no model is selected to prevent unexpected errors from missing parameters.
  4. New model creation prompt enhancement
    → Added input hints to inform users that both model providers (e.g., OpenAI/local deployment) and model names can be customized.
  5. Playground interface improvement(#381
    → Added a thought chain display area in the interactive testing interface to visualize the reasoning process of inference models in real time.

[1.3.6] 2025-06-02

02 Jun 15:45
b2c80af

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 选择模型后刷新列表跨域问题
→ 修复模型列表刷新时的跨域请求错误,确保不同域下模型数据正常加载。
  2. 上传 DOCX 文件处理超时
→ 优化文件解析线程池配置,解决大文件处理时的超时异常。
  3. 删除文献时原始目录删除失败
→ 修正文件系统操作逻辑,确保文献删除时关联的原始目录同步清理。

⚡ 优化

  1. Docker 打包脚本
→ 优化镜像构建流程,减少冗余依赖,提升打包效率。
  2. 数据蒸馏任务问题生成
→ 问题生成时不再包含标签序号,适配无结构化格式需求。
  3. 数据集详情 Token 展示
→ 在数据集详情页新增 Token 数量统计,直观显示文本长度(支持模型输入限制参考)。

✨ 新功能

  1. GA(载体、受众)对的数据集增强
    引入 “载体(Generator)- 受众(Audience)” 配对机制,根据数据应用场景生成针对性内容。
    文档:https://docs.easy-dataset.com/jin-jie-shi-yong/mga-zeng-qiang-shu-ju-ji

🔧 Fixes

  1. Cross-origin issue when refreshing model list
→ Fixed cross-origin request errors to ensure model data loads properly across domains.
  2. Timeout when processing uploaded DOCX files
→ Optimized file parsing thread pool to resolve timeouts during large document handling.
  3. Failed deletion of original directory when removing literature
→ Corrected file system logic to ensure associated original directories delete with literature.

⚡ Optimizations

  1. Docker packaging script
→ Optimized image build process to reduce redundant dependencies and improve packaging efficiency.
  2. Question generation in data distillation tasks
→ Removed label indices (e.g., "Q1:", "A1:") from generated questions for unstructured format compatibility.
  3. Dataset details Token display
→ Added Token count statistics on dataset pages for clear text length visualization (supports model input limit reference).

✨ New Feature: GA (Generator-Audience) Pair Dataset Enhancement
Introduces "Generator-Audience" pairing to generate targeted content based on usage scenarios:

[1.3.5] 2025-05-21

21 May 15:11
55a7ec9

Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 数据集确认/保存失败
    → 修复因权限校验异常或网络波动导致的数据集保存失败问题,提升操作稳定性。
  2. 修改文本块后筛选条件失效
    → 解决文本块内容更新后,筛选条件(如标签、状态)未同步刷新的问题。
  3. 硅基流动默认 API 错误
    → 修正默认配置中硅基流动 API 地址及认证参数,确保模型调用正常。
  4. 导出自定义格式数据集丢失标签
    → 恢复自定义格式导出时标签字段的正常提取,支持保留完整元数据。

⚡ 优化

  1. Windows 安装路径自定义
    → 安装程序新增路径选择功能,默认不再强制安装至 C 盘,支持用户指定安装目录。
  2. Alpaca 数据集导出配置优化
    • 字段选择:支持切换问题使用 instructioninput 字段,适配不同模型训练需求。
    • 自定义指令:允许手动输入或修改 instruction 内容,提升数据生成灵活性。

🔧 Fixes

  1. Dataset confirmation/saving failures
    → Fixed issues with dataset saving due to permission errors or network fluctuations, improving operational stability.
  2. Filter criteria失效 after text block modification
    → Resolved synchronization issues where filter conditions (e.g., labels, status) failed to update after text block edits.
  3. Default API error for SiliconFlow
    → Corrected the default API endpoint and authentication parameters for SiliconFlow to ensure proper model invocation.
  4. Missing labels in custom-format dataset exports
    → Restored label fields in custom exports to preserve complete metadata during data export.

⚡ Optimizations

  1. Windows installation path customization
    → Added a path selection feature during installation, allowing users to specify a directory instead of forcing C:\ by default.
  2. Alpaca dataset export configuration
    • Field selection: Supported switching between instruction and input fields for questions, adapting to different model training needs.
    • Custom instruction: Allowed manual input or modification of instruction content for more flexible data generation.