I mentioned some weeks ago that I was looking to automate the validation of a bunch of source files I have on disk before uploading them to GitHub. I decided to get started on that, this week, to see what was involved. As mentioned in the previous post, I wanted to programmatically query posts from Typepad, extract any embedded code from them and compare it with what I had on disk, to see which files were already correct and which needed to be created or removed.
The first step was to choose a language. The fact this is really an OS-level task – with a bunch of string processing and file I/O, but nothing whatsoever to do with AutoCAD – gave me some freedom to choose pretty much any language that can be run on OS X or Windows. I briefly considered Python or Ruby, but ended up going back to F#: it’s nicely integrated into Visual Studio (my primary development tool) and a functional approach makes a lot of sense for this kind of problem. I also had the itch to do more with F# after my recent (too brief) foray into machine learning. One final factor in the decision process – not that it precludes the use of F#, at all – is that as this is a one-off activity I’m not at all worried about performance or memory constraints. It doesn’t matter if it takes 10 minutes to complete, for instance. Just as long as it completes. ;-)
Armed with F#, I then started looking at the problem itself. First up was pulling data down from the blog: I needed an API to access my blog’s content on Typepad – thankfully there’s a simple REST API which gives access to its posts’ complete content – and then had some additional choices to make about how to access the data.
To simplify working with the blog’s post data, I decided to use F# Type Providers. These allow you to code against data-oriented services as if they had local object models. Which is exactly what happens, I guess: when the code instantiates a JSON Type Provider against a particular resource – I downloaded a sample JSON file from the Typepad API for this – it’s contents are then accessible via a locally-generated set of objects and properties.
The next problem was the “screen scraping”: we need to extract HTML code and convert it to plain text to compare with the local files. I opted for the HtmlAgilityPack for this: it comes with an Html2Txt sample that I converted to F#. I haven’t followed it exactly, as I wanted to keep some amount of whitespace in the generated plain text, but it got me a good part of the way there.
In a general sense, here’s what the algorithm needs to do:
- Parse the various C# files on disk and create an index that links a comma-separated command-list with the source filename
- The order isn’t significant: if the commands don’t come in the same order then the files will be different
- Extract the post content from my blog and parse it for code fragments
- I’ve used the same CopyAsHtml tool since this blog’s inception, so all code sections are enclosed in a similar-looking <div>
- For now I only care about posts with a single code section. There are certainly posts where I’ve used this tool to copy smaller fragments for illustrative purposes, so at some point I need the code to pick these up, too
- For the posts with a single code section, extract the code and convert it to plain text
- Extract the commands implemented in the code and check for which files have the same commands – in the same sequence – on disk
- Perform a lower-level comparison between the code extracted from HTML with the files of disk
- This still needs some work: there are files which should match that for some reason don’t, right now, but overall it’s working quite well
- For debugging purposes I’m currently writing any unmatched code fragments as files in another folder, so that I can go through and see what problems are worth fixing
- The ultimate output is a list of post titles with the matching local filename
- It’ll be a simple matter to copy these files programmatically into a local folder that will sync with GitHub
Here’s the code I have, so far:
(*
#r "Z:/GitHub/FSharp.Data/bin/FSharp.Data.dll"
#r "packages\HtmlAgilityPack.1.4.9\lib\Net45\HtmlAgilityPack.dll"
#r "System.Xml"
*)
open FSharp.Data
open HtmlAgilityPack
open System.Xml
open System
open System.IO
let codeHeader = "<div style="
let blogRoot = "http://api.typepad.com/blogs/6a00d83452464869e200d83452baa169e2/post-assets.json"
let csFolder = @"Z:\data\Blogs\Projects\Basic C# app"
let tmpFolder = @"Z:\data\Blogs\Projects\Basic C# app\Notfound"
let csTest =
@"Z:\data\Blogs\Projects\Basic C# app\enumerate-sysvars.cs"
let cmdAttrib = "[CommandMethod("
type Post = JsonProvider<"data.json">
// Use TypePad's REST API to retrieve batches of posts
let getPosts m n =
let url = String.Format("{0}?max-results={1}", blogRoot, m)
let url2 =
match n with
| 0 -> url
| _ -> url + "&start-index=" + (m * n).ToString()
let doc = Post.Load(url2)
doc.Entries
// Count the number of times a substring appears in a string
let countOccurrences (sub:string) (text:string) =
match sub with
| "" -> 0
| _ ->
(text.Length - text.Replace(sub, @"").Length) / sub.Length
// These are HTML entity codes etc. that need to be replaced
// as we convert from HTML to plain text
let reps =
[(" "," ");(" "," ");(" "," ");(">",">");
("<","<");("'","'");(""", "\"");("–","-");
("&","&");("Â","")]
let convertText (t : string) =
List.fold
(fun (a : string) (b : string, c : string) -> a.Replace(b,c))
t reps
// Use the HtmlAgilityPack to convert from HTML to plain text
let rec convertTo (node : HtmlNode ) =
match node.NodeType with
| HtmlNodeType.Comment -> ""
| HtmlNodeType.Document ->
Seq.map convertTo node.ChildNodes |>
Seq.fold (fun r s -> r + s) ""
| HtmlNodeType.Text ->
// script and style must not be output
let parentName = node.ParentNode.Name
if parentName = "script" || parentName = "style" then
""
else
// get text
let html = (node :?> HtmlTextNode).Text;
// is it in fact a special closing node output as text?
if HtmlNode.IsOverlappedClosingElement(html) then
""
else
convertText html
| HtmlNodeType.Element ->
if node.Name = "p" then
if node.HasChildNodes then
(Seq.map convertTo node.ChildNodes |>
Seq.fold (fun r s -> r + s) "") + "\r\n"
else
"\r\n"
else if node.HasChildNodes then
Seq.map convertTo node.ChildNodes |>
Seq.fold (fun r s -> r + s) ""
else
""
| _ -> ""
// Take post data and extract the HTML fragment representing code
let extractCode (content : string) =
let start = content.IndexOf(codeHeader)
let finish = content.LastIndexOf("</div>") + 6
let html = content.Substring(start, finish - start)
let doc = new HtmlDocument()
doc.LoadHtml(html)
convertTo doc.DocumentNode
// If a post contains only 1 code segment, we'll extract it
let processPost (ent : Post.Entry) =
let count = countOccurrences codeHeader ent.Content
ent.Title,
count,
match count with
| 1 -> extractCode ent.Content
| _ -> ""
// List the files conforming to a pattern in a folder
let filesInFolder pat folder =
try Directory.GetFiles(folder, pat, SearchOption.TopDirectoryOnly)
|> Array.toList
with | e -> []
// Get the indices at which a substring occurs in a string
let stringIndices (pat:string) (text:string) =
let rec getIndices (pat:string) (text:string) (start:int) =
match text.IndexOf(pat, start) with
| -1 -> []
| x -> x :: getIndices pat text (x+1)
getIndices pat text 0
// Extract the command name from a CommandMethod attribute
let extractCommandName (text : string) =
let delim = "\""
let count = countOccurrences delim text
match count with
| 0 -> ""
| 1 -> ""
| 2 -> text.Substring(1, text.LastIndexOf(delim) - 1)
| 3 -> ""
| _ ->
let idxs = stringIndices delim text
text.Substring(idxs.[2] + 1, idxs.[3] - idxs.[2] - 1)
// Extract the various command names from a code segment
let rec commandsFromCode (text : string) =
match text.Contains(cmdAttrib) with
| false -> []
| true ->
let start = text.IndexOf(cmdAttrib) + cmdAttrib.Length
let finish = text.IndexOf(")", start + 1)
let name =
text.Substring(start, finish - start) |> extractCommandName
name :: commandsFromCode (text.Substring finish)
// Create a comma-separated string from a list of strings
let rec commaSepString (cmds : string list) =
match cmds with
| [] -> ""
| x::[] -> x
| x::xs -> x + "," + commaSepString xs
// Get the commands for a particular file on disk as a
// comma-separated list and return them with the filename
let commandsForFile file =
File.ReadAllText file |>
commandsFromCode |>
commaSepString |>
(fun x -> (x, file))
// Get the command names for a set of files on disk
let rec commandsForFiles files =
match files with
| [] -> []
| file::xs -> commandsForFile file :: commandsForFiles xs
// Create an index from commands to files for a particular folder
let indexCommands (folder : string) =
filesInFolder "*.cs" folder |> commandsForFiles
// From our index, get the files associated with a command-set
let filesForCommandsFromIndex index cmds =
index |>
List.filter (fun (a,b) -> a = cmds && a <> "") |>
List.map (fun (a,b) -> b)
// Strip blank lines from a sequence of strings
let stripBlanks (s : seq<string>) =
Seq.filter (fun x -> not(String.IsNullOrWhiteSpace(x))) s
// Compare sequences of strings, ignoring non-relevant whitespace
let compareSequences (s1 : seq<string>) (s2 : seq<string>) =
Seq.compareWith
(fun (a:string) (b:string) -> String.Compare(a.Trim(), b.Trim()))
(stripBlanks s1) (stripBlanks s2)
// Write code to a temp file - for debugging only
let writeToTmpFile (code:string) =
let rec getTmpFile i =
let file = tmpFolder + "\\" + i.ToString() + ".cs"
if not(File.Exists(file)) then
file
else
getTmpFile (i+1)
use wr = new StreamWriter((getTmpFile 0))
wr.Write(code)
// Take a code fragment and a file and check them for equivalence
let checkCodeAgainstFile (code:string) (file:string) =
let clines = code.Split("\n\r".ToCharArray())
let flines = File.ReadAllLines(file)
let s1 = Seq.ofArray clines
let s2 = Seq.ofArray flines
if compareSequences s1 s2 = 0 then
[file]
else
[]
// Take a code fragment and a set of files and see if one matches
let checkCodeAgainstFiles code files =
let rec checkAgainstFiles code files =
match files with
| [] -> []
| x::xs ->
checkCodeAgainstFile code x :: checkAgainstFiles code xs
checkAgainstFiles code files |> List.concat
// Our main function
[<EntryPoint>]
let main argv =
// Build an index from commands to source files on the hard drive
let index = indexCommands csFolder
// Pull down post information from TypePad and process it
let posts =
[|0..25|] |>
Array.map (getPosts 50) |> // Get 1250 posts in batches of 50
Array.concat |> // Flatten the nested arrays
Array.map processPost // Process the posts
// Separate the posts into posts with code and those without
let postsWith, postsWithout =
Array.partition (fun (a,b,c) -> b > 0) posts
// Separate the posts with code into those with one section
// and those with more
let postsWithOne, postsWithMore =
Array.partition (fun (a,b,c) -> b = 1) postsWith
printfn
"%d posts with zero, %d posts with one, %d posts with more"
postsWithout.Length
postsWithOne.Length
postsWithMore.Length
// We'll take the posts with a single code section and process
// them
let res =
postsWithOne |>
Array.map (fun (a,b,c) -> commandsFromCode c) |> // Get commands
Array.map commaSepString |> // Make a comma-delimited cmd list
Array.map (filesForCommandsFromIndex index) |> // Use our index
Array.map2 (fun (a,b,c) d -> (a,c,d)) postsWithOne |> //
Array.filter (fun (a,b,c) -> b <> "") |> // Strip codeless
Array.map
(fun (a,b,c) ->
a,
let x = checkCodeAgainstFiles b c
if x = [] then writeToTmpFile b // This is for debugging
x) |>
Array.filter (fun (a,b) -> b <> []) // Strip fileless
0 // return an integer exit code
Right now it finds 108 source files that are “correct” on disk. This is a reasonable start, but there are certainly more to be found.
By the way, while I don’t currently have a second part of this series planned, specifically, I know I’m going to need one to share the final version of the code. Which I’ll also place on GitHub, of course. :-)