其他
我优化多年的 C 语言竟然被 80 行 Haskell 打败了?
正确性:它应当返回被测试文件的正确的字符数、单词数和行数。
速度(真实世界的时间):与wc的执行时间相比是快是慢?
最大常驻内存量:最多使用多少内存?内存使用量是常量还是线性的,或者是其他?
2stupid fp = do
3 contents <- readFile fp
4 return (length s, length (words s), length (lines s))
2import Data.Char
3
4simpleFold :: FilePath -> IO (Int, Int, Int)
5simpleFold fp = do
6 countFile <$> readFile fp
7
8countFile :: String -> (Int, Int, Int)
9countFile s =
10 let (cs, ws, ls, _) = foldl' go (0, 0, 0, False) s
11 in (cs, ws, ls)
12 where
13 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
14 go (cs, ws, ls, wasSpace) c =
15 let addLine | c == '\n' = 1
16 | otherwise = 0
17 addWord | wasSpace = 0
18 | isSpace c = 1
19 | otherwise = 0
20 in (cs + 1, ws + addWord, ls + addLine, isSpace c)
2
3...
4 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
5 go (!cs, !ws, !ls, !wasSpace) c =
6 let addLine | c == '\n' = 1
7 | otherwise = 0
8 addWord | wasSpace = 0
9 | isSpace c = 1
10 | otherwise = 0
11 in (cs + 1, ws + addWord, ls + addLine, isSpace c)
2import qualified Data.ByteString.Lazy.Char8 as BS
3
4simpleFold :: FilePath -> IO (Int, Int, Int)
5simpleFold fp = do
6 simpleFoldCountFile <$> BS.readFile fp
7
8simpleFoldCountFile :: BS.ByteString -> (Int, Int, Int)
9simpleFoldCountFile s =
10 let (cs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s
11 in (cs, ws, ls)
12 where
13 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
14 go (!cs, !ws, !ls, !wasSpace) c =
15 let addLine | c == '\n' = 1
16 | otherwise = 0
17 addWord | wasSpace = 0
18 | isSpace c = 1
19 | otherwise = 0
20 in (cs + 1, ws + addWord, ls + addLine, isSpace c)
2 deriving Show
3
4data Flux =
5 Flux !CharType
6 {-# UNPACK #-} !Int
7 !CharType
8 | Unknown
9 deriving Show
2 Unknown <> x = x
3 x <> Unknown = x
4 Flux l n NotSpace <> Flux NotSpace n' r = Flux l (n + n' - 1) r
5 Flux l n _ <> Flux _ n' r = Flux l (n + n') r
6
7instance Monoid Flux where
8 mempty = Unknown
2flux c | isSpace c = Flux IsSpace 0 IsSpace
3 | otherwise = Flux NotSpace 1 NotSpace
2Flux NotSpace 4 NotSpace
3
4 > foldMap flux "testing on" <> foldMap flux "e two three"
5Flux NotSpace 4 NotSpace
6
7 > foldMap flux "testing one " <> foldMap flux " two three"
8Flux NotSpace 4 NotSpace
2 Counts { charCount :: {-# UNPACK #-} !Int
3
4 , wordCount :: !Flux
5 , lineCount :: {-# UNPACK #-} !Int
6 }
7 deriving (Show)
8
9instance Semigroup Counts where
10 (Counts a b c) <> (Counts a' b' c') = Counts (a + a') (b <> b') (c + c')
11
12instance Monoid Counts where
13 mempty = Counts 0 mempty 0
2countChar c =
3 Counts { charCount = 1
4 , wordCount = flux c
5 , lineCount = if (c == '\n') then 1 else 0
6 }
2Counts {charCount = 13, wordCount = Flux NotSpace 3 NotSpace, lineCount = 1}
2
3import Data.Char
4import qualified Data.ByteString.Lazy.Char8 as BS
5
6monoidBSFold :: FilePath -> IO Counts
7monoidBSFold paths = monoidFoldFile <$> BS.readFile fp
8
9monoidFoldFile :: BS.ByteString -> Counts
10monoidFoldFile = BS.foldl' (\a b -> a <> countChar b) mempty
2monoidBSFold paths = monoidBSFoldFile <$> BS.readFile fp
3{-# INLINE monoidBSFold #-}
4
5monoidBSFoldFile :: BS.ByteString -> Counts
6monoidBSFoldFile = BS.foldl' (\a b -> a <> countChar b) mempty
7{-# INLINE monoidBSFoldFile #-}
8
2import Control.Monad
3import Data.Traversable
4import Data.Bits
5import GHC.Conc (numCapabilities)
6import Control.Concurrent.Async
7import Data.Foldable
8import System.IO
9import System.Posix.Files
10import qualified Data.ByteString.Lazy.Char8 as BL
11import Data.ByteString.Internal (c2w)
12import GHC.IO.Handle
13
14multiCoreCount :: FilePath -> IO Counts
15multiCoreCount fp = do
16 putStrLn ("Using available cores: " <> show numCapabilities)
17 size <- fromIntegral . fileSize <$> getFileStatus fp
18 let chunkSize = fromIntegral (size `div` numCapabilities)
19 fold <$!> (forConcurrently [0..numCapabilities-1] $ \n -> do
20 -- Take all remaining bytes on the last capability due to integer division anomolies
21 let limiter = if n == numCapabilities - 1
22 then id
23 else BL.take (fromIntegral chunkSize)
24 let offset = fromIntegral (n * chunkSize)
25 fileHandle <- openBinaryFile fp ReadMode
26 hSeek fileHandle AbsoluteSeek offset
27 countBytes . limiter <$!> BL.hGetContents fileHandle)
28{-# INLINE handleSplitUTF #-}
29
30countBytes :: BL.ByteString -> Counts
31countBytes = BL.foldl' (\a b -> a <> countChar b) mempty
32{-# INLINE countBytes #-}
33
输入可以是ASCII或UTF-8编码。当然还有其他流行的编码方式,但根据我有限的经验,绝大部分现代文本文件都采用两者之一。实际上,有许多网站都在致力于让UTF-8成为唯一的编码格式。
我们仅把ASCII中的空格和换行当做空格和换行处理;MONGOLIAN VOWEL SEPARATOR等字符就不考虑了。
2import Data.ByteString.Internal (c2w)
3countByte :: Char -> Counts
4countByte c =
5 Counts {
6 -- Only count bytes at the START of a codepoint, not continuation bytes
7 charCount = if (bitAt 7 && not (bitAt 6)) then 0 else 1
8 , wordCount = flux c
9 , lineCount = if (c == '\n') then 1 else 0
10 }
11 where
12 bitAt = testBit (c2w c)
13{-# INLINE countByte #-}
2
3import Types
4import Data.Traversable
5import GHC.Conc (numCapabilities)
6import System.IO (openFile, IOMode(..))
7import qualified Streamly as S
8import qualified Streamly.Data.String as S
9import qualified Streamly.Prelude as S
10import qualified Streamly.Internal.Memory.Array as A
11import qualified Streamly.Internal.FileSystem.Handle as FH
12
13streamingBytestream :: FilePath -> IO Counts
14streamingBytestream fp = do
15 src <- openFile fp ReadMode
16 S.foldl' mappend mempty
17 $ S.aheadly
18 $ S.maxThreads numCapabilities
19 $ S.mapM countBytes
20 $ FH.toStreamArraysOf 1024000 src
21 where
22 countBytes =
23 S.foldl' (\acc c -> acc <> countByte c) mempty
24 . S.decodeChar8
25 . A.toStream
26
27{-# INLINE streamingBytestream #-}
2 S.foldl' (\acc c -> acc <> countByte c) mempty
3 . S.decodeChar8
4 . A.toStream
☞只因写了一段爬虫,公司 200 多人被抓!☞不足 20 行 Python 代码,高效实现 k-means 均值聚类算法!
☞三年一跳槽、拒绝“唯学历”,火速 Get 这份程序员求生指南!
☞确认!语音识别大牛Daniel Povey将入职小米,曾遭霍普金斯大学解雇,怒拒Facebook
☞【又是一波重点】深度解析服务器科普知识 | CSDN博文精选
☞“国家队”入局! 中移动、银联等宣布区块链服务网络(BSN)正式内测!
☞行!嘀嗒不甘第二