Starting a Data Parsing Script
We already identified a good source for most of the data we want to use from the Monster Infusion FAQ post on Gamefaqs.com. However, we don't want to type the thousands of lines of data from this FAQ into our database because we would be introducing human error with the data copying, and the writers of this FAQ have already gone through all of the trouble of entering the data the first time, hopefully without mistakes. Besides, why would we go through such a tedious process when we could have fun writing a script to do the work for us? Come on, we're programmers! Let's write this script.
Since the website will eventually be in Ruby on Rails, we might as well write this script in Ruby, too. It's not absolutely necessary to write the script in Ruby because it's a one-off deal that will only be run once (when it works) to convert the text file into a format that we can easily import into a database, but Ruby is pretty darn good at text processing, so let's stick with it. I like writing scripts in stages, breaking things down into simple problems and starting with an easy first step, so let's do that here. The simplest thing we can do is read in the text file after saving the FAQ to a local file. To add a bit of debug to make sure we have the file read in, let's scan through and print out the section header for the data we're looking for in the file:
File.foreach("ffiii2_monster_taming_faq.txt") do |line|
if line.include? "MLTameV"
puts line
end
end
Already, this code gives the basic structure of what we're trying to do. We're going to read in the file, loop through every line, look for certain patterns, and output what we find that matches those patterns. The real deal will be much more complex, but it's always good to have a working starting point.This code also has a few problems that we may or may not want to do anything about. First, it's just hanging out in the middle of nowhere. It's not in a class or function or anything more structured. If this was going to be a reusable parsing tool for converting various FAQs into rows of data, I would definitely want to engineer this code more robustly. But hey, this is a one-off script, and it doesn't need all of that extra support to make it reusable. Over engineering is just a waste of time so we'll leave this code out in the open.
Second, I've got two constant strings hard-coded in those lines: the file name and the search string. I may want to stick the search string in a variable because it's not terribly obvious what "MLTameV" means. The file name, on the other hand, doesn't need to be in a variable. I plan to keep this part of the code quite simple, and it's the obvious loop where the file is read in. On top of that, this code will be very specific to handling this exact file, so I want the file name to be tightly coupled to this loop. If the script is ever copied and modified to work on a different file, this file name string can be changed in this one place to point to the new file that that script works with. I don't see a need to complicate this code with a variable.
Third, when this code runs, it prints out two lines instead of one because there's another instance of "MLTameV" in the table of contents of the file. For locating the place to start parsing monster data, we want the second instance of this string. One way to accomplish this task is with the following code:
SECTION_TAG = "MLTameV"
section_tag_found = false
File.foreach("ffiii2_monster_taming_faq.txt") do |line|
if section_tag_found and line.include? SECTION_TAG
puts line
elsif line.include? SECTION_TAG
section_tag_found = true
end
end
Now only the section header line is printed when this script is run. However, as what inevitably happens when we add more code, we've introduced a new problem. It may not be obvious right now, but the path that we're on with the section_tag_found variable is not sustainable. This variable is a piece of state that notifies the code when we've seen a particular pattern in the text file so we can do something different afterward. When parsing a text file using state variables like this one, we'll end up needing a lot of state variables, and it gets unmanageable and unreadable fast. What we are going to need instead, to keep track of what we need to do next, is a state machine.Parsing Text with a Finite State Machine
Finite state machines (FSM) are great for keeping track of where you are in a process and knowing which state to go to next, like we need to know in the case of finding the section header for the list of tamable monsters in this text file. In the FSM we always have a current state that is one of a finite number of states, hence the name. Depending on the input in that state, the FSM will advance to a next state and possibly perform some output task. Here is what that process looks like in Ruby for finding the second section tag:
SECTION_TAG = "MLTameV"
section_tag_found = lambda do |line|
if line.include? SECTION_TAG
puts line
end
return section_tag_found
end
start = lambda do |line|
if line.include? SECTION_TAG
return section_tag_found
end
return start
end
next_state = start
File.foreach("ffiii2_monster_taming_faq.txt") do |line|
next_state = next_state.(line)
end
First, the states are defined as lambda methods so that they can easily be passed around as variables, but still called as functions. These variables have to be declared before they're used, so the section_tag_found method either has to be defined first because the start method uses it, or all methods could be predefined at the start of the file and then redefined with their method bodies in any desired order. Another way to define these states would be to wrap the whole thing in a class so that the states are class members, but that kind of design would be more warranted if this FSM was part of a larger system. As it is, this parser will be almost entirely made up of this FSM, so we don't need to complicate things.We can also represent this FSM with a diagram:
The FSM starts in the Start state, obviously, and it transitions to the Section Tag Found state when there's a matching SECTION_TAG. The unlabeled lines pointing back to the same states mean that for any other condition, the state remains unchanged. This diagram is quite simple, but when the FSM gets more complex, it will definitely help understanding to see it drawn out.
Notice that running through the lines of the text file in the foreach loop became super simple. All that's necessary is to feed each line into the next_state, and assign the return value as the new next_state. The current state is kind of hidden because we're assigning the next_state to itself. Also notice that we need to be careful to always return a valid state in each path of each state method, even if it's the same state that we're currently in. Inadvertently returning something that was not a valid state would be bad, as the FSM is going to immediately try to call it on the next line.
Now that we have an FSM started, it'll be easy to add more states and start working our way through the tamable monster data. What do we need to look for next? Well, we can take a look at the data for one monster and see if there are any defining characteristics:
...............................................................................
MONSTER 001
Name---------: Apkallu Minimum Base HP------: 1,877
Role---------: Commando Maximum Base HP------: 2,075
Location-----: Academia 500 AF Minimum Base Strength: 99
Max Level----: 45 Maximum Base Strength: 101
Speed--------: 75 Minimum Base Magic---: 60
Tame Rate----: 10% Maximum Base Magic---: 62
Growth-------: Standard
Immune-------: N/A
Resistant----: N/A
Halved-------: All Ailments
Weak---------: Fire, Lightning
Constellation: Sahagin
Feral Link-----: Abyssal Breath
Description----: Inflicts long-lasting status ailments on target and nearby
opponents.
Type-----------: Magic
Effect---------: 5 Hits, Deprotect, Deshell, Wound
Damage Modifier: 1.8
Charge Time----: 1:48
PS3 Combo------: Square
Xbox 360 Combo-: X
Default Passive: Attack: ATB Charge
Default Skill--: Attack
Default Skill--: Ruin
Default Skill--: Area Sweep
Lv. 05 Skill---: Powerchain
Lv. 12 Passive-: Strength +16%
Lv. 18 Skill---: Slow Chaser
Lv. 21 Skill---: Scourge
Lv. 27 Passive-: Strength +20%
Lv. 35 Passive-: Resist Dispel +10%
Lv. 41 Passive-: Strength +25%
Lv. 42 Passive-: Resist Dispel +44%
Lv. 45 Skill---: Ruinga
Special Notes: Apkallu only spawns twice in Academia 500 AF. If you fail to
acquire its Crystal in both encounters, you will have to close
the Time Gate and replay the area again.
...............................................................................
That series of dots at the beginning looks like a good thing to search for. It repeats at the start of every monster, so it's a good marker for going into a monster state. We'll also want to pass in a data structure that will be used to accumulate all of this monster data that we're going to find. To make it easy to export to a .csv file at the end, we're going to make this data structure an array of hashes, and it looks like this with the new state:
SECTION_TAG = "MLTameV"
MONSTER_SEPARATOR = "........................................"
new_monster = lambda do |line, data|
if line.include? MONSTER_SEPARATOR
return new_monster, data << {}
end
return new_monster, data
end
section_tag_found = lambda do |line, data|
if line.include? SECTION_TAG
return new_monster, data
end
return section_tag_found, data
end
start = lambda do |line, data|
if line.include? SECTION_TAG
return section_tag_found, data
end
return start, data
end
next_state = start
data = []
File.foreach("ffiii2_monster_taming_faq.txt") do |line|
next_state, data = next_state.(line, data)
end
puts data.length
I shortened the MONSTER_SEPARATOR pattern in case there were some separators that were shorter than the first one, but it should still be plenty long to catch all of the instances of separators between monsters in the file. Notice that we now have to pass the data array into and out of each state method so that we can accumulate the monster data in it. Right now it simply appends an empty hash for each monster it finds. We'll add to those hashes in a bit. At the end of the script, I print out the number of monsters found, which we expect to be 164, and it turns out to be a whopping 359! That's because that same separator is used more after the tamable monster section of the file, and we didn't stop at the end of the section. That should be easy enough to fix:SECTION_TAG = "MLTameV"
MONSTER_SEPARATOR = "........................................"
NEXT_SECTION_TAG = "SpecMon"
end_monsters = lambda do |line, data|
return end_monsters, data
end
new_monster = lambda do |line, data|
if line.include? MONSTER_SEPARATOR
return new_monster, data << {}
elsif line.include? NEXT_SECTION_TAG
return end_monsters, data
end
return new_monster, data
end
# ...
I added another state end_monsters that consumes every line to the end of the file, and we enter that state from the new_monster state if we see the NEXT_SECTION_TAG. Now if we run the script again, we get a count of 166 monsters. Close, but still not right. The problem is that there are a couple extra separator lines used in the tamable monster section, one after the last monster and one extra separator after a sub-heading for DLC monsters. We're going to have to get a bit more creative with how we detect a new monster. If we look back at the example of the first monster, we see that after the separator the next text is MONSTER 001. This title for each monster is consistent for all of the monsters, with MONSTER followed by a three digit number. Even the DLC monsters have this tag with DLC in front of it. This pattern is perfect for matching on a regular expression (regex).Finding Monster Data with Regular Expressions
A regex is a text pattern defined with special symbols that mean various things like "this character is repeated one or more times" or "any of these characters" or "this character is a digit." This pattern can be used to search a string of text, which is called matching the regex. In Ruby a regex pattern is denoted by wrapping it in forward slashes (/), and we can easily define a regex for our MONSTER 001 pattern:
SECTION_TAG = "MLTameV"
MONSTER_SEPARATOR = "........................................"
NEXT_SECTION_TAG = "SpecMon"
NEW_MONSTER_REGEX = /MONSTER\s\d{3}/
find_separator = nil
end_monsters = lambda do |line, data|
return end_monsters, data
end
new_monster = lambda do |line, data|
if NEW_MONSTER_REGEX =~ line
return find_separator, data << {}
elsif line.include? NEXT_SECTION_TAG
return end_monsters, data
end
return new_monster, data
end
find_separator = lambda do |line, data|
if line.include? MONSTER_SEPARATOR
return new_monster, data
end
return find_separator, data
end
# ...
The NEW_MONSTER_REGEX is defined as the characters MONSTER, followed by a space (\s), followed by three digits (\d). I changed the new_monster state to look for a match on our new regex, and added a find_separator state to still search for the MONSTER_SEPARATOR. Notice that the FSM will bounce between these two states, so the state that's defined later has to be declared at the top of the file, otherwise Ruby will complain that find_separator is undefined in new_monster.These regex patterns are useful and powerful, but they can also be quite tricky to get right, especially when they get long and complicated. We'll be using them to pull out all of the data we want from each monster, but we'll try to keep them as simple as possible. The next regex is more complicated, but it will allow us to pull nearly all of the properties for each monster and put it into the empty hash that was added to the list of hashes for that monster. Ready? Here it is:
MONSTER_PROP_REGEX = /(\w[\w\s\.]*\w)-*:\s(\S+(?:\s\S+)*)/
We'll break this regex apart and figure out what each piece means separately.The first part of the regex, (\w[\w\s\.]*\w), is surrounded by parentheses and is called a capture. A capture will match on whatever the pattern is inside the parentheses and save that matching text so that it can be accessed later. We'll see how that works in the code a little later, but right now we just need to know that this is how we're going to separate out the property name and its value from the full matching text. This particular capture is the property name, and it starts with a letter or number, symbolized with \w. The stuff in the brackets means that the next character can be a letter or number, a space, or a period. Any of those characters will match. Then the following '*' means that a string of zero or more of the preceding character will match. Finally, the property name must end with a letter or number, symbolized with \w again. The reason this pattern can't just be a string of letters and numbers is because some of the property names are multiple words, and the "Lv. 05 Skill" type properties also have periods in them. We want to match on all of those possibilities.
The next part of the regex is -*:\s, which simply means it will match on zero or more '-', followed by a ':', followed by a space. Reviewing the different lines for the MONSTER 001 example above, we can see that this pattern is indeed what happens. Some cases have multiple dashes after the property name, while others are immediately followed by a colon. The colon is always immediately followed by a single space, so this should work well as our name-value separator. It's also outside of any parentheses because we don't want to save it for later.
The last part of the regex is another capture for the property value: (\S+(?:\s\S+)*). The \S+—note the capital S—will match on one or more characters that are not white space. It's the inverse of \s. The next thing in this regex looks like yet another capture, but it has this special '?:' after the open parenthesis. This special pattern is called a grouping. It allows us to put a repeat pattern after the grouping, like the '*' in this case, so that it will match on zero or more of the entire grouping. It will not save it for later, though. Since this grouping is a space followed by one or more non-space characters, this pattern will match on zero or more words, including special characters. If we look at the example monster above, we see that this pattern is exactly what we want for most of the property values. Special characters are strewn throughout, and it would be too much trouble to enumerate them all without risking missing some so we cover our bases this way.
Fairly simple, really. We're going to match on a property name made up of one or more words, followed by a dash-colon separator, and ending with a property value made up of one or more words potentially including a mess of special characters. Note how we couldn't have used the \S character for the property name because it would have also matched on and consumed the dash-colon separator. We also could not have used the [\s\S]* style pattern for the words in the property value because it would have matched on any number of spaces between words. That wouldn't work for the first few lines of the monster properties because there are two name-value pairs on those lines. Now that we have our regex, how do we use those captured names and values, and how exactly is this going to work for the lines with two pairs of properties on them? Here's what the new add_property state looks like with some additional context:
# ...
MONSTER_PROP_REGEX = /(\w[\w\s\.]*\w)-*:\s(\S+(?:\s\S+)*)/
find_separator = nil
new_monster = nil
end_monsters = lambda do |line, data|
return end_monsters, data
end
add_property = lambda do |line, data|
props = line.scan(MONSTER_PROP_REGEX)
props.each { |prop| data.last[prop[0]] = prop[1] }
return new_monster, data if line.include? MONSTER_SEPARATOR
return add_property, data
end
new_monster = lambda do |line, data|
if NEW_MONSTER_REGEX =~ line
return add_property, data << {}
elsif line.include? NEXT_SECTION_TAG
return end_monsters, data
end
return new_monster, data
end
# ...
The double-property lines are handled with a different type of regex matcher, line.scan(MONSTER_PROP_REGEX). This scan returns an array of all of the substrings that matched the given regex in the string that it was called on. Conveniently, if the regex contains captures, the array elements are themselves arrays of each of the captures. For example, the scan of the first property line of our MONSTER 001 results in this array:[['Name', 'Apkallu'],['Minimum Base HP', '1,877']]
We can simply loop through this array, adding property name and property value to the last hash in the list of hashes. Then, if the line was actually the MONSTER_SEPARATOR string, it didn't match any properties and we'll move on to the next monster. Otherwise, we stay in the add_property state for the next line.One last thing that we're not handling is those multi-line descriptions and special notes. We need to append those lines to the correct property when we come across them, but how do we do that? Keep in mind that these extra lines won't match on MONSTER_PROP_REGEX, so we can simply detect that non-match, make sure it's not an empty line, and add it to the special notes if it exists or the description if the special notes doesn't exist. Here's what that code looks like in add_property.
MONSTER_PROP_EXT_REGEX = /\S+(?:\s\S+)*/
# ...
add_property = lambda do |line, data|
props = line.scan(MONSTER_PROP_REGEX)
props.each { |prop| data.last[prop[0]] = prop[1] }
return new_monster, data if line.include? MONSTER_SEPARATOR
ext_line_match = MONSTER_PROP_EXT_REGEX.match(line)
if props.empty? and ext_line_match
if data.last.key? 'Special Notes'
data.last['Special Notes'] += ' ' + ext_line_match[0]
else
data.last['Description'] += ' ' + ext_line_match[0]
end
end
return add_property, data
end
By putting the extra code after the return if the line is the MONSTER_SEPARATOR, we can assume that this line is not the MONSTER_SEPARATOR and just check if the MONSTER_PROP_REGEX didn't match and there's something on the line. Then decide on which property to add the line to, and we're good to go.Okay, that was a lot of stuff, so let's review. First, we read in the file that we wanted to parse that contains most of the monster taming data we need. Then, we loop through the lines of the file, feeding them into a FSM in order to find the section of the file where the list of monsters is and separate each monster's properties into its own group. Finally, we use a few simple regex patterns to capture each monster's property name-value pairs and add them to a list of hashes that will be fairly easy to print out to a .csv file later. All of this was done in 66 lines of Ruby code! Here's the program in full so we can see how it all fits together:
SECTION_TAG = "MLTameV"
MONSTER_SEPARATOR = "........................................"
NEXT_SECTION_TAG = "SpecMon"
NEW_MONSTER_REGEX = /MONSTER\s\d{3}/
MONSTER_PROP_REGEX = /(\w[\w\s\.]*\w)-*:\s(\S+(?:\s\S+)*)/
MONSTER_PROP_EXT_REGEX = /\S+(?:\s\S+)*/
find_separator = nil
new_monster = nil
end_monsters = lambda do |line, data|
return end_monsters, data
end
add_property = lambda do |line, data|
props = line.scan(MONSTER_PROP_REGEX)
props.each { |prop| data.last[prop[0]] = prop[1] }
return new_monster, data if line.include? MONSTER_SEPARATOR
ext_line_match = MONSTER_PROP_EXT_REGEX.match(line)
if props.empty? and ext_line_match
if data.last.key? 'Special Notes'
data.last['Special Notes'] += ' ' + ext_line_match[0]
else
data.last['Description'] += ' ' + ext_line_match[0]
end
end
return add_property, data
end
new_monster = lambda do |line, data|
if NEW_MONSTER_REGEX =~ line
return add_property, data << {}
elsif line.include? NEXT_SECTION_TAG
return end_monsters, data
end
return new_monster, data
end
find_separator = lambda do |line, data|
if line.include? MONSTER_SEPARATOR
return new_monster, data
end
return find_separator, data
end
section_tag_found = lambda do |line, data|
if line.include? SECTION_TAG
return find_separator, data
end
return section_tag_found, data
end
start = lambda do |line, data|
if line.include? SECTION_TAG
return section_tag_found, data
end
return start, data
end
next_state = start
data = []
File.foreach("ffiii2_monster_taming_faq.txt") do |line|
next_state, data = next_state.(line, data)
end
And here's the corresponding FSM diagram:We still need to write the collected data out to a .csv file so that we can import it into a database, but that is a task for next time. Also, notice that we have done almost no data integrity checks on this input other than what the FSM and regex patterns inherently provide. Any mistakes, typos, or unexpected text in the file will likely result in missing or corrupt data, so we'll need to do some checks on the data as well. Additionally, this data is just the tamable monster data. We still need the other table data for abilities, game areas, monster materials, and monster characteristics. However, this is a great start on the data that was the most difficult to get, and we ended up with quite a few extra properties that we weren't intending to collect in the list. That's okay, I'm sure we'll find a use for them.
No comments:
Post a Comment