0% found this document useful (0 votes)
17 views8 pages

Guide 004

This project focuses on extracting and digitizing STEM problems and solutions from educational materials in PDF format to enhance AI capabilities. The process involves accurate transcription, LaTeX conversion, and structured output formatting, ensuring clarity and self-containment of each problem. Attention to detail and adherence to specific criteria for problem selection and answer formatting are essential for the project's success.

Uploaded by

jaisingh2744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Guide 004

This project focuses on extracting and digitizing STEM problems and solutions from educational materials in PDF format to enhance AI capabilities. The process involves accurate transcription, LaTeX conversion, and structured output formatting, ensuring clarity and self-containment of each problem. Attention to detail and adherence to specific criteria for problem selection and answer formatting are essential for the project's success.

Uploaded by

jaisingh2744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

This project represents a significant contribution to the field of AI and educational resource

digitization. Your meticulous work will directly improve the capabilities of a powerful model,
potentially benefiting countless students and researchers. Remember to take your time, focus
on accuracy, and don't hesitate to ask for clarification if needed. Your dedication and attention
to detail are essential to the success of this project. We are confident in your abilities, and we
appreciate your commitment to this important undertaking. Good luck!

What to Expect in the Project:

 Source Material: Extracting problems and solutions from STEM textbooks and math
competition materials (originally in PDF format).

 Accurate Transcription: Maintaining the original form and content of the problems and
solutions.

 Final Answer Provision: Providing the final answer for each problem.

 LaTeX Conversion: Utilizing Mathpix to obtain LaTeX versions for efficient handling.

 High Accuracy and Clarity: Careful attention to detail is crucial for accurate transcription
and solution verification.

 Structured Output: Formatting the output as a highly structured list of dictionaries with
specific keys for machine processing.

 Answer Simplification: Simplifying final answers and double-checking for accuracy


against the original source.

 Image Handling: Working with base64-encoded images.

 Potential Rewording: Rewording problems for enhanced clarity where necessary, while
preserving the original intent.

What you will do:

1. Identify Suitable Problems: Select problems from the provided PDFs that meet the
specified criteria (clear final answer, no proof-based questions, self-contained, etc.).
2. Extract and Transcribe: Accurately transcribe the problem text from the PDF. If needed,
reword to ensure self-containment and clarity, adding necessary physical constants.
Never create a new solution.

3. Extract and Encode Figures: Extract figures that are absolutely necessary for solving the
problem and encode them as base64 strings.

4. Copy-Paste LaTeX (Mathpix): Utilize the provided LaTeX versions from Mathpix.

5. Write Final Answer: Using the solution, determine and write the final answer
(simplifying it as much as possible), noting multiple variations if they exist (up to 3).

6. Create Metadata: Record metadata including the source (resource name and page
number) for traceability.

7. Format Output: Arrange the extracted data into the specified structured format (list of
dictionaries with keys for problem, problem figures, solution, final answer variations,
and metadata).

Part 1
We are focusing on extracting existing problems and solutions from educational materials like
STEM textbooks and math competition challenges that are typically in PDF format. The aim is to
accurately transcribe these problems. We will utilize tools like Mathpix to convert the content
into LaTeX format, which is particularly helpful for capturing complex equations and figures
precisely.

Key Points:

Purpose: Capture existing problems and solutions for educational use, specifically those that
can end with a definitive answer.

Format: Problems should have a final, concise answer like a number or a combination of
mathematical entities (e.g., a matrix, vector). They should be straightforward to grade as either
correct or incorrect.

Metadata and Answer: Alongside the problem and solution, we need a concise final answer to
use as a benchmark for checking the correctness of future solutions.
Exclusion Criteria: Avoid proof-based questions, extensive explanations, or tasks that expect
specific solving methods. We should steer clear of problems asking for plots or drawings, as
these don't fit the definitive answer requirement.

Part 2
Now let’s focus on the careful extraction and transcription of mathematical and scientific
problems from educational resources. The aim is to ensure that each problem is clear, self-
contained, and has all the necessary information for someone to solve it without needing to
reference other materials.

The process is divided into two key phases: problem selection and output formatting.

Part 1: Problem Selection Criteria

 Self-Contained Problems: All variables must be clearly defined within the problem
statement, except for standard mathematical constants like π or e.

 Inclusion of Constants: Include the values of any physical constants (e.g., melting point
of an element) needed to solve the problem.

 Independence: Problems should be self-contained and not refer to other parts of the
book.

 Extraction and Rewording: Extract problems as they appear, but reword if needed for
clarity or to make them self-contained (adding constants if required).

 Available Solutions: Only select problems with final answers already presented in the
book; do not solve problems without provided answers. Skip problems with suspect
answers or unclear phrasing.

Examples:

Bad ones:

1. Sketch the graphs of the following polynomians if y = P (x) is:

a. x(x+1)(x-3)

b. x(x+1)(3-x)
Note: This example doesn't fit the instructions because it asks for a sketch of graphs, which is
explicitly forbidden. The task requires problems with clearly extractable final answers that can
be automatically graded as correct or incorrect. A sketch is subjective and cannot be
automatically graded.

2. In the circuit described in Figure 3.2 (see page 45), calculate the current flowing through
resistor R3. Assume the values of all other components are as defined on page 45.

Note: This problem is not self-contained. It explicitly relies on information from a figure and
other parts of the book ("Figure 3.2," "page 45," values defined on page 45").

Good ones

1. Find the maximum time in weeks a comet of mass $m$ following a parabolic trajectory
around the Sun can spend within the orbit of the Earth. Assume that the Earth's orbit is circular
and in the same plane as that of the comet.

Note: Solving this problem leads to a single, definitive numerical answer (maximum time in
weeks) representing the maximum time the comet spends within Earth's orbit. This numerical
answer can be directly compared to the provided solution and automatically graded for
correctness.

2. I meet someone with 2 children, and I learn that one of the children is a boy. What's the
probability that the other child is also a boy? What if one of the children is a boy born on a
Tuesday?

Note: While it's a classic probability puzzle, and involves some nuanced reasoning about
conditional probabilities, the ultimate goal is to calculate and state a numerical probability. This
final answer allows for automated grading (it's either right or wrong).

Part 2: Formatting and Answer Criteria

 Multiple Answer Variations: Document up to three variations of the final answer if


multiple correct forms exist. Simplify the answer as much as possible.

 Figure Inclusion: Only include figures absolutely necessary for solving the problem;
exclude figures if the text provides sufficient information.
 Acceptable Answer Types: Exact answers (integers, rationals, radicals, π, e, algebraic
formulas, and decimals representing exact results) are acceptable. Approximate
decimals are not acceptable.

 Multiple Sub-Questions: Treat each sub-question as a separate, independent problem,


including necessary context from previous sub-questions. Each sub-question should have
a single mathematical quantity as its final answer.

 Multiple-Choice Questions: Reword multiple-choice questions (if possible) so that they


have only one correct answer. If the final answer is given as A, B, C, etc., write the
corresponding value. Example: "A man is 35 years old, and his son is 7 years old. In how
many years will the father be three times as old as his son?"

Exceptions: If rewording isn't feasible (e.g., the answer is "E) Other" or the question
directly refers to the choices like "Which of the following statements are correct? A)...,
B)...,..."), skip those questions.

 Unacceptable Answer Types: Approximate decimal answers (e.g., x ≈ 4.334) are not
acceptable.

 Numerical Answers with Formulas: If the final answer is given by a formula followed by
a numerical value, present the final answer in one of the following formats:
o Option 1: v1 = (0.9g/g)^(1/3)v0 ≈ 0.97v0
o Option 2: v1 = (0.9)^(1/3)v0 ≈ 0.97v0

Note: The formula must be included to clarify how the approximate value (0.966,
0.971, or 0.97) was obtained. This is especially important for physics/science
problems. Example: sin⁻¹(6*460/3333) ≈ 55.9° or sin⁻¹(6*460/3333) (depending on
precision requirements).

 Simplification Process: Do not include the simplification process in the final answer.
 Shortened Solutions: If the solution is available in the Mathpix version, ideally copy and
paste the entire solution unless it's excessively time-consuming.

Examples:

Bad ones:

2. A helicopter needs a minimum of a 100 hp engine to hover ($1 \text{ hp} = 746 \text{ W}$).
Estimate the minimum power necessary to hover for the motor of a 10 times reduced model of
this helicopter (assuming that it is made of the same materials)
Note: This example problem likely doesn't fit the instructions because its solution relies on
scaling arguments and estimations, rather than a single, directly calculable numerical answer.
The instructions prefer problems solvable with a precise numerical or algebraic answer that can
be easily checked for correctness.

2. A system of three interconnected water tanks (Tank A, Tank B, Tank C) has the following
properties: Initially, Tank A contains 10 liters of water, Tank B contains 5 liters, and Tank C is
empty. Water flows from Tank A to Tank B at a rate of 1 liter/minute, and from Tank B to Tank C
at a rate of 0.5 liters/minute. There is no flow from Tank C.

a) How much water is in Tank B after 2 minutes?

b) How much water is in Tank C after 5 minutes?

c) Determine the exact time when Tank B will contain exactly 7 liters of water and how much
water will be in Tank A and Tank C at that time.

Note: It's not a self-contained question with a single mathematical quantity as a final answer. It
requires using the results (or at least the underlying equations) derived from parts (a) and (b) to
solve. The final answer to (c) is not a single value but actually three values (time, and amount of
water in two other tanks). The instruction explicitly states each sub-question needs a single
answer; (c) intrinsically requires multiple answers.

Good one:

1. Solve the equation in real numbers $$ 5^{x}+12^{x}=13^{x}$$

Note: This solution will be a specific number (likely an integer or a simple rational number in
this case), suitable for easy automated verification. The problem statement is completely self-
contained. It is a single question that leads to a single mathematical result.

2.1. A person on Earth observes two rocket ships moving directly toward each other in one
straight line and colliding. At time $t=0$ in the Earth frame, the Earth observer determines that
rocket $A$, travelling to the right at $v_{A}=0.8 c$, is at point $a$, and rocket $B$ is at point
$b$, travelling to the left at $v_{B}=0.6 c$. They are separated by a distance $l=4.2 \cdot
10^{8} \mathrm{~m}$. How fast is rocket $B$ approaching in $A$ 's frame? How fast is rocket
$A$ approaching in $B$ 's frame?

2.2 A person on Earth observes two rocket ships moving directly toward each other in one
straight line and colliding. At time $t=0$ in the Earth frame, the Earth observer determines that
rocket $A$, travelling to the right at $v_{A}=0.8 c$, is at point $a$, and rocket $B$ is at point
$b$, travelling to the left at $v_{B}=0.6 c$. They are separated by a distance $l=4.2 \cdot
10^{8} \mathrm{~m}$. How much time will elapse in $A$ 's frame from the time rocket $A$
passes point $a$ until collision? How much time will elapse in $B$ 's frame from the time rocket
$B$ passes point $b$ until collision?

Note: When a problem requires calculating multiple quantities, it should be divided into
separate problems. Each part (2.1 and 2.2) presents a distinct question, and the answer to one
does not directly depend on the answer to the other. While both parts use the same initial
conditions (velocities and separation), they ask for different quantities.

Part 3
Now we can detail the structured format for organizing the extracted problems. Each problem-
solution pair will be stored as a dictionary, simplifying management and retrieval.

Structured Format: The output must be a list of dictionaries, with each dictionary having
specific keys to ensure uniformity and clarity, which helps in organizing the data for further use.
Please remember that the dictionary must be entered in the exact order specified in the
instructions.

Required Keys:

'problem': The text of the problem itself.

'problem_figures': A list containing base64-encoded images relevant to the problem, which


should be derived from the corresponding Mathpix files.

'solution': The solution text, if available, should be included.

'final_answer_variations': A list of variations of the final answer, allowing for different


acceptable formats.

'metadata': Contains the resource name and page number for easy tracking and referencing.
Here is the fixed format: {"filename" :"Textbook1", "page_number" :["23"], "solution_page" :
["39"]}
Encoding Figures: Figures must be encoded in base64 format to facilitate easy embedding in
digital documents.

The final answer could also be a combination of numbers, functions, vectors, etc. if the
question has multiple parts. In these cases, please still format the final answer as a single string
with the combined answers, e.g. “(a): TEXT_1, (b): TEXT_2,...”.

Note: For page_number and solution_page, use the actual page numbers from the PDF
(including pages with numbers i, ii, etc.). Also, include the problem_number for each book in
the metadata.

You might also like