Ali launched AI data scientists, the whole process is automated, and scientific research novices can also use it
Based on the open-source agent framework, the agent that can automatically solve complex data science problems is here!
Specifically, Data Science Assistant (DS Assistant) is a data science assistant developed based on the Modelscope-Agent framework.
With it, we only need to give the requirements, and this assistant can run through the steps of exploratory data analysis (EDA), data preprocessing, feature engineering, model training, model evaluation, etc. by itself.
Of course, in addition to the DS Assistant that will be highlighted in this article, the Modelscope-Agent framework behind it is also worth mentioning.
This framework is open-sourced by Alibaba, and the main features include:
Even better, the Modelscope-Agent framework allows developers to create Agent assistants interactively without coding.
No, snap, our data science assistant was "born"~
Automating complex data science tasks has always been challenging.
DS Assistant, on the other hand, uses the plan-and-exclusive framework, an emerging agent framework that efficiently accomplishes complex tasks with clear planning and execution steps.
The description of Plan-and-execute Agent on the official website of langchain: https://blog.langchain.dev/planning-agents/
Specifically, the workflow includes the following steps:
1. Task plan: The agent receives the task description input by the user, performs semantic understanding, and decomposes the task into multiple executable subtasks.
2. Subtask scheduling: Based on the dependencies and priorities between tasks, the execution order of subtasks is intelligently scheduled.
3. Task execution: Each subtask is assigned to a specific module for execution.
4. Result integration: summarize the results of each subtask, form the final output, and feedback to the user.
Based on the above framework, let's look at the entire system architecture, DS Assistant has 4 main modules.
Let's start with the DS Assistant on the right, which acts as the brain of the whole system and is responsible for scheduling the operation of the whole system.
The Plan module is responsible for generating a series of task lists according to the user's needs, and topological sorting of the task sequence.
At this stage, DS Assistant automatically breaks down complex data science problems based on user input into multiple subtasks.
These subtasks are organized and scheduled based on dependencies and priorities, ensuring that the order of execution is logical and efficient.
Next, you will find the Execution module, which is responsible for the specific execution of the task and saves the task execution result.
Here, each subtask is concretized into actionable operations, such as data preprocessing, model training, and so on.
Finally, the Memory management module is responsible for recording the execution results, code, data details and other information in the middle of the task.
After all tasks are executed, DS Assistant will save the execution of intermediate data (including the code and results generated by each task, the number of tokens consumed, and the task time) as a file.
Below, we take a concrete example to understand the execution process of DS Assistant.
Let's take one of the contest tasks ICR - Identifying Age-Related Conditions on Kaggle as an example:
The task is a machine learning task whose main purpose is to identify age-related health conditions by analyzing various data such as medical records, genetic data, lifestyle data, etc.
The end result will be used to help medical professionals identify common health problems in older populations early and provide personalized prevention and treatment plans.
Without further ado, let's start now~
First of all, we need to configure the LLM we choose.
We have introduced MetaGPT's Data Science tool and Tool Recommender, which can recommend suitable data science tools to DS Assistant according to the type of task.
Next, we need to pass the specific requirements of the task to the DS Assistant. It is important to note that the path to the data file needs to be indicated to the DS Assistant in the requirements:
from modelscope_agent.agents.data_science_assistant import DataScienceAssistant
from modelscope_agent.tools.metagpt_tools.tool_recommend import TypeMatchToolRecommender
llm_config = {
'model': 'qwen2-72b-instruct',
'model_server': 'dashscope',
}
tool_recommender = TypeMatchToolRecommender(tools=["<all>"])
ds_assistant = DataScienceAssistant(llm=llm_config, tool_recommender=tool_recommender)
ds_assistant.run(
"This is a medical dataset with over fifty anonymized health characteristics linked to three age-related conditions. Your goal is to predict whether a subject has or has not been diagnosed with one of these conditions. The target column is Class. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report F1 Score on the eval data. Train data path: ‘./dataset/07_icr-identify-age-related-conditions/split_train.csv', eval data path: ‘./dataset/07_icr-identify-age-related-conditions/split_eval.csv' ."
)
In the Plan phase, DS Assistant generates a task list based on the user's needs, breaks down the entire data processing process, and then processes the task list sequentially.
As you can see, DS Assistant generates 5 tasks, which are data exploration, data preprocessing, feature engineering, model training, and prediction.
Then these 5 tasks entered the Execute stage, let's take a look at them one by one.
You can see that the generated code executes with the following error because the numpy package is not introduced.
DS Assistant reflects on the error message, regenerates the code, executes it, and successfully outputs the results of data exploration.
Finally, Code Judge will perform a quality check on the code to ensure that the logic of the generated code is correct.
In the data pre-processing stage, DS Assistant performs appropriate missing value processing for numeric and categorical data, and clears the ID column.
After fixing two bugs, DS Assistant features engineering the data to encode categorical variables.
At the same time, the previously defined categorical_columns variable has been updated to remove the ID column.
DS Assistant actively installed suitable dependencies and selected multiple models (random forest, gradient boosting, logistic regression) for training, and selected the model with the best results.
DS Assistant selected the model with the highest F1 score in the training set to test the validation set, and calculated the model's F1 score on the validation set, successfully completing the task.
执行完以上任务后,DS Assistant支持将运行结果保存为Jupyter Notebook类型的文件,并记录运行的中间过程。
△Jupyter Notebok△The intermediate process records a JSON file
We use ML-Benchmark as the test set ("Data Interpreter: An LLM Agent For Data Science") from three dimensional pairs: Normalized Performance Score (NPS), total time, and total tokenDS Assistant is effective.
NPS is a method to standardize the performance metrics of different tasks or models, so that different metrics can be compared with each other.
Its calculation usually involves the following steps:
Step 1: Determine the direction of indicator optimization and determine whether the performance metric is as big as possible or smaller as possible.
Step 2: Normalize the calculation. If the metric is "bigger is better" (e.g., accuracy, F1 score, AUC), the NPS is equal to the original value; If the metric is "smaller is better" (e.g. loss value), you need to map the original value to a higher NPS value close to 1.
The normalized performance score typically ranges from 0 to 1, where 1 is the best performance and 0 is the worst.
The details and results of the experimental task are as follows (green represents the optimal indicator under the current task):
It can be seen that on some complex data science tasks, DS Assistant achieves higher results than open-source SOTA in terms of normalization performance score (NPS), task time, and number of tokens consumed. (The open-source SOTA effect refers to the measured value of MetaGPT).
Full Experiment Log: https://modelscope-agent.oss-cn-hangzhou.aliyuncs.com/resources/DS_Assistant_results.zip
For different people, DS assistant has different functions:
In the next step, DS assistant will be optimized in three directions:
1. Further improve the success rate of task execution:
a) For Code Agent, a large amount of incoming information (error information, intermediate data information, and generated code information) will lead to a decrease in the accuracy rate of model generation code, and LLM can be considered for summarizing and filtering the information in the future.
b) The same task can be further decomposed to reduce the requirements for LLM inference ability.
2. Interactive dialogue, which can separate the task from the execution of the task, promote the task through dialogue, and affect the execution result.
3. Support batch processing of multiple batches of files for the same task.
For more details, see the Data Sciecne Assistant example in the Modelscope-Agent official repository.
This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com