ZHANG Ji didn't think that he could move from being a "zero-based" student five months ago to being a lecturer in the large language model (LLM) seed class.
"It's all new stuff that happens along the way." ZHANG Ji said that he avoided the "pits" that others fell into, and he must transfer his experience to latecomers.
Fast learning, mutual help, and actual combat-oriented talent training create an atmosphere for the LLM seed class, and also drive the growth of "seeds".
Probing the Boundaries of LLMs
In March this year, Zhejiang Lab (ZJ Lab) launched a young elite scientists sponsorship program, under which LLM seed training courses (the "seed class") will be offered on a rolling basis every three months to cultivate LLM full-stack developers.
As a student of the second seed class, ZHANG Ji, with a physics and traditional supercomputing background, was a "zero-based" player in the field of LLMs.
"At the beginning, I attended classes and read books by myself to learn basic theories and algorithms recommended by my classmates, copied others' steps, and then built a foundation model with 1 billion parameters by myself." With the help of his instructor and classmates, ZHANG Ji made rapid progress.
In the seed class, ZHANG Ji's work time was delayed to midnight every day. In the face of high-intensity training and successive challenges, he devoted himself entirely to LLMs at the expense of his sleep time.
ZHANG Ji joined the seed class for the purpose of learning something. However, after digging deeper into LLMs and engaging in related work, he found that LLMs have much room for improvement.
"The domain model we are working on requires scientific data processing, and the calculation of scientific data focuses on precision and quantification, so the standard language model may not add, subtract, multiply or divide big numbers correctly," said ZHANG Ji.
Another thing that needs to be improved is model training efficiency. Since GPU compute accounts for the largest portion of the costs of training LLMs, the computational power measured in floating-point operations per second of GPUs must be harnessed.
While building a domain model, ZHANG Ji found that data throughput is a key factor that affects computing performance, which isn't quite what he thought it would be. For example, if data is likened to water flow, then memory can be seen as an outlet. If the outlet flow is too slow, GPU will be in an idle state, waiting for data, and a significant amount of waiting time can trigger a waste of GPU memory resources.
"The scientific data we are working with is three-dimensional (3D). If, in a language model, the length of 1,000 texts is 1,000, then it's the third power of 1,000 in a 3D scientific model, which means massive throughput." ZHANG Ji added, "What we need to do is to optimize algorithms and improve computing performance to boost GPU utilization."
ZHANG Ji and his classmates improved the overall efficiency of the model by an order of magnitude by optimizing algorithms and increasing data stream throughput.
ZHANG Ji (Middle)
"LLMs are not a panacea for all problems, but we can do nothing without an explicit understanding of LLMs." This is a sentence that all the students in the seed class have heard. Now, ZHANG Ji knows more about what it means.
The power of LLM lies in its ability to help us automate mass data processing, improve production efficiency, and identify things probably in data that are not easily visible to humans. More importantly, we recognize the capability boundaries of LLM and turn it into truly an open tool to facilitate cooperation with scientists in many fields to advance their scientific discoveries with easy access to this tool.
From "Seed" to "Dandelion"
After graduating from the second seed class, ZHANG Ji joined ZJ Lab's comprehensive planning regime for research tasks concerning "foundation models for science" and immediately started his new job.
"I devote at least one-third of my energies to data every day. Data, models, and computational power are known as the troika of artificial intelligence. First and foremost, it is critical to let the data flow smoothly," said ZHANG Ji.
Scientific data collection and cleaning is the first step in developing foundation models for science, which requires a lot of patience, especially with reference to few proven outcomes. "Domain-specific scientific data is unique, resulting in a collaborative process from scientific data to model corpus. After data processing, we sum up the mistakes we made and the conclusions, and form a set of processes for your reference to improve overall data processing efficiency."
While engaged in foundation models, ZHANG Ji also served as a teaching assistant of the seed class, giving lessons and answering students' questions.
In the seed class, ZHANG Ji did two subjects: one is LLM evaluation, and the other is a case study of seismic exploration models for oil and gas to show what LLMs can help scientists do.
ZHANG Ji told the students before every lecture, "Not all of what I'm saying is true, so you can always interrupt me and raise your questions."
"Because it's really an incredibly new area, updated every month. For example, some of my PPT focuses on cutting-edge issues in LLM evaluation. Perhaps, when lecturing to the students now, I have to update the content of the PPT that I made last month."
When attending classes and serving as a teaching assistant, ZHANG Ji stayed in sync with this field at all times and also handled problems encountered by students. "These solutions are eventually translated into my experience gained in the seed class, and my accumulated documents can be spread to the later students."
Being both a student and a teacher, ZHANG Ji learned and taught as he went along, and the uninterrupted input and output pushed him forward. Therefore, a "seed" grows into a "dandelion", which will continue to produce and spread seeds.